OLYMPICS Task
The OLYMPICS task is carried out using parts of the HIT Olympic Trilingual Corpus (HIT), a multilingual corpus that covers 5 domains (traveling, dining, sports, traffic and business) that are closely related to the Beijing 2008 Olympic Games. The HIT corpus contains around 2.8 million words in total.
Moreover, the Basic Travel Expression Corpus (BTEC), a multilingual speech corpus containing tourism-related sentences, is provided as an additional training corpus. The BTEC corpus consists of 20k sentences including the evaluation data sets of previous IWSLT evaluation campaigns.
The monolingual and bilingual language resources that should be used to train the translation engines for the primay runs are limited to the supplied HIT and BTEC corpora. This includes all supplied development sets, i.e., you are free to use these data sets as you wish for tuning model parameters or as training bitext, etc. All other languages resources besides the ones for the given translation task, such as any additional dictionaries, word lists, bitext corpora such as the ones provided by LDC, should be treated as "additional language resources".
MT track: text translation of Chinese sentences into English
- official: training data limited to supplied data only.
- optional: usage of additional language resources (please provide a detailed description of the resources at run submission time).
Participants of the OLYMPICS task have to sign an end-user agreement in order to get access to the data sets. In addition, participants of the OLYMPICS Task are requested to fill-in and send the following form to 'iwslt2012.oly AT gmail DOT com'. A single organization can register more times as different participants. Registration is possible until the testdata run submission deadline.
--- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- Registration Form IWSLT 2012 Evaluation Campaign - OLYMPICS Task Organization Name (long, e.g. Fondazione Bruno Kessler): Organization Name (short, e.g. FBK): Country: Contact Person (e.g. John Smith): E-mail address: Track: + MT (Chinese to English): YES I understand that participants in this evaluation are requested to submit a system paper describing their work and to present it at the workshop. --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< --- 8< ---
Training Corpus:
- parallel text
- each line consists of three fields divided by the character '\'
Format: <SENTENCE_ID>\<PARAPHRASE_ID>\<MT_TRAINING_SENTENCE>
TRAIN_00001\01\This is the first training sentence
TRAIN_00002\01\This is the second training sentence.
...
- Data sets:
- HIT: 50K Chinese-English sentence pairs randomly selected from the HIT corpus
- BTEC: 20K Chinese-English sentence pairs randomly selected from the BTEC corpus
- Corpus specifications:
- coding: UTF-8
- text is case-sensitive and includes punctuation
Develop Corpus:
- parallel text
- each line consists of three fields divided by the character '\'
Format: <SENTENCE_ID>\<PARAPHRASE_ID>\<MT_DEVELOP_SENTENCE>
DEV_001\01\1st reference translation for 1st input.
DEV_001\02\2nd reference translation for 1st input
...
DEV_002\01\1st reference translation for 2nd input
DEV_002\02\2nd reference translation for 2nd input
...
- Data sets:
- HIT: 2 develop sets, 1000 Chinese-English sentence pairs each, single reference
- BTEC: 8 develop sets, 500 Chinese-English sentence pairs each, multiple references
- Corpus specifications:
- coding: UTF-8
- text is case-sensitive and includes punctuation
Evaluation Corpus:
- source language (Chinese) text
- each line consists of three fields divided by the character '\'
Format: <SENTENCE_ID>\<PARAPHRASE_ID>\<MT_EVAL_SENTENCE>
HIT_TST_IWSLT12_0001\01\1st source language input sentence.
HIT_TST_IWSLT12_0002\01\2nd source language input sentence.
...
- Data sets:
- HIT: 1 evaluation set, 1000 Chinese sentences, MT engine input
- Corpus specifications:
- coding: UTF-8
- text is case-sensitive and includes punctuation
Context Annotations:
- for each sentence of the HIT corpus, context information on the type of text (dialogue, samples, explanation), scene (airplane, airport, restaurant, water/winter sports, etc.), topic (ask about traffic conditions, bargain over a price, front desk customer service, etc. ), and the speaker (customer, clerk, passenger, receptionist, travel agent, etc.) is provided. Each line of the INFO files consists of three fields divided by the character '\' where the context annotations are divided by the character '|'.
Format: <SENTENCE_ID>\<PARAPHRASE_ID>\<TEXTTYPE>|<SCENE>|<TOPIC>|<SPEAKER>
TRAIN_00004\01\dialogue|train|when you are feeling ill|conductor
TRAIN_00005\01\dialogue|train|when you are feeling ill|Foreign guest
...
Submission Guidelines:
- Each participant has to submit at least one run for the OLYMPICS translation task s/he registered for
- multiple run submissions are allowed, but participants must explicitly indicate one PRIMARY run for each track. All other run submissions are treated as CONTRASTIVE runs. In case that none of the runs is marked as PRIMARY, the latest submission (according to the file time-stamp) for the respective track will be used as the PRIMARY run.
- Runs have to be submitted as a gzipped TAR archive (format see below) and e-mailed to iwslt2012.oly AT gmail DOT com.
TAR archive file structure:
<UserID> = user ID of participant
<Set> = testset_IWSLT12
<Task> = MT_ZhEn
Example:
nict/testset_IWSLT12.MT_ZhEn.nict.primary.txt
/testset_IWSLT12.MT_ZhEn.nict.contrastive1.txt
Run submission file format:
Format: <SENTENCE_ID>\01\<TRANSLATED_SENTENCE>
HIT_TST_IWSLT12_0001\01\This is the first translation
HIT_TST_IWSLT12_0002\01\This is the second translation.
...
- Re-submitting your runs is allowed as far as the mails arrive BEFORE the submission deadline. In case that multiple TAR archives are submitted by the same participant, only runs of the most recent submission mail will be used for the IWSLT 2012 evaluation and previous mails will be ignored.
Download:
- The references of the testset_IWSLT12 evaluation data set are available to participants here.
