Linguistic Data Consortium
@ldcupenn.bsky.social
30 followers 1 following 27 posts
LDC creates and distributes language resources to universities, labs, companies and libraries for linguistic education, research and technology development.
Posts Media Videos Starter Packs
ldcupenn.bsky.social
More LDC data in the LORELEI series: LORELEI Hindi Representative Language Pack features monolingual and parallel text, annotations, software tools and more for human language technology development to address emergent situations bit.ly/4nCp3ar
ldcupenn.bsky.social
AIDA Scenario 1 Evaluation Topic Source Data, Annotation & Assessment: 10k+ English, Russian & Ukrainian web docs on political relations between Russia & Ukraine in the 2010s annotated for entities & cross-reference, w/ judgments for scoring submissions bit.ly/3K7ynoA
ldcupenn.bsky.social
Mixer 7 English Speech has 12,321 hours of telephone conversations, interviews and transcript readings from 222 English speakers, some collected using a 14-microphone array; speaker metadata is included bit.ly/4nvSYkG
ldcupenn.bsky.social
Check out our September newsletter for three new LDC publications: Mixer 7 English Speech, AIDA Scenario 1 Evaluation Topic Source Data, Annotation and Assessment, and LORELEI Hindi Representative Language Pack ldc-upenn.blogspot.com
ldcupenn.bsky.social
KAIROS Phase 1 Quizlet contains English and Spanish web data annotated for events, relations and arguments and a reference knowledge graph; quizlets were defined tasks to explore evaluation objectives before the full program evaluation bit.ly/3HvDU7k
ldcupenn.bsky.social
Abstract Meaning Representation 2.0 - Machine Translations translates 1,371 English sentences from LDC’s AMR 2.0 corpus into Spanish, German, Italian and Mandarin Chinese using Google Translate bit.ly/4n1m8bp
ldcupenn.bsky.social
Mixer 6 - CHiME 8 Transcribed Calls and Interviews: 80 hours of Mixer 6 English interviews and telephone speech across 13 channels (1063 hours) with transcripts divided into training, development and test sets bit.ly/4oyUCn5
ldcupenn.bsky.social
LDC’s August newsletter has the last call for fall data scholarship applications and details on new publications: Mixer 6 CHiME 8 Transcribed Calls and Interviews, Abstract Meaning Representation 2.0 – Machine Translations and KAIRO Phase 1 Quizlet ldc-upenn.blogspot.com
ldcupenn.bsky.social
What a great conference #Interspeech2025! There is still time to stop by our booth and grab a limited-edition TIMIT word poetry magnet. Also don’t miss our colleague’s oral session on TELVID: A multilingual, multi-modal corpus for speaker recognition at 13:30, A04, Port 1A @interspeech.bsky.social
ldcupenn.bsky.social
Good morning #Interspeech2025 Stop by our booth during the coffee breaks today to say hello. Also don't miss today's special session co-organized by LDC on Challenges in Speech Collection, Curation and Annotation in two parts beginning at 13:30, Dock 15. @interspeech.bsky.social
ldcupenn.bsky.social
Good morning Interspeech. It's a great second day. Come by and grab one of our limited giveaways. @interspeech.bsky.social
#Interspeech2025
ldcupenn.bsky.social
We are excited to be here at Interspeech 2025 @interspeech.bsky.social‬ Come see us at the first coffee break today to learn more about the latest developments at LDC. #Interspeech2025
ldcupenn.bsky.social
LDC will be exhibiting at #Interspeech2025, August 17-21 in Rotterdam. Stop by our booth to say hello and learn the latest developments at the Consortium. LDC work will also be featured in presentations, posters and a special session. We look forward to seeing you there. www.interspeech2025.org
ldcupenn.bsky.social
From the LORELEI companion project: LoReHLT Uzbek Representative Language Pack features monolingual and parallel text, annotations, audio recordings, software tools and more for human language technology development to address emergent situations bit.ly/4lL0zuL
ldcupenn.bsky.social
Penn Parsed Corpora of Historical English Second Release: POS-tagged & syntactically annotated British English text (1100 CE -1914 CE); updates the 2020 release with new annotation, revised guidelines, philological information & the Corpus2 search tool bit.ly/46zR1hR
ldcupenn.bsky.social
AnnoDIFP Session Audio and Transcripts: 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments bit.ly/4nEYQJr
ldcupenn.bsky.social
Check out the July newsletter for Fall 2025 data scholarship application deadlines & 3 new publications: AnnoDIFP Session Audio and Transcripts, Penn Parsed Corpora of Historical English Second Release & LoReHLT Uzbek Representative Language Pack ldc-upenn.blogspot.com
ldcupenn.bsky.social
KAIROS Schema Learning Complex Event Annotation has English and Spanish web text, audio, video and image data labeled for 93 real-world complex events with event, relation and argument annotations linking to document provenance bit.ly/4jNrDIq
ldcupenn.bsky.social
IWSLT 2022 - 2023 Shared Task Training, Development and Test Set: 210 hours of Tunisian Arabic conversational telephone speech, transcripts, English translations, speaker metadata, and documentation used in IWSLT dialectal speech and low resource tracks bit.ly/3HEO4lL
ldcupenn.bsky.social
Chinese Sentence Pattern Structure Treebank contains 5,016 sentences and 119,627 tokens from modern and ancient Chinese works annotated for lexical sense, syntactic structure and inter-clause relations bit.ly/4kZVGh3
ldcupenn.bsky.social
LDC’s June newsletter has the latest on three new publications: Chinese Sentence Pattern Structure Treebank, IWSLT 2022-2023 Shared Task Training, Development and Test Set, and KAIROS Schema Learning Complex Event Annotation ldc-upenn.blogspot.com
ldcupenn.bsky.social
BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and Translations: transcripts and English translations for 93 hours of BOLT CTS telephone recordings; all speech was transcribed; 89% of the transcripts were translated bit.ly/4jKul2j
ldcupenn.bsky.social
BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio: 93 hours of telephone speech from 236 conversations between native speakers; developed by LDC for the DARPA BOLT program; contains previously unexposed calls from the CF/CH collections bit.ly/4kbsBPy
ldcupenn.bsky.social
Check out LDC’s May newsletter for two new companion releases developed by LDC to support the DARPA BOLT program, BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio and BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and Translations ldc-upenn.blogspot.com
ldcupenn.bsky.social
MATERIAL Kazakh-English Language Pack has 57 hours of Kazakh conversational telephone speech, transcripts, English translations, annotations and queries designed to support cross language information retrieval bit.ly/42cwe01