/
HAP;&#xTER-;怀ID=1ResumptionofthesessionID=1NAME= HAP;&#xTER-;怀ID=1ResumptionofthesessionID=1NAME=

HAP;&#xTER-;怀ID=1ResumptionofthesessionID=1NAME="Preside&#x - PDF document

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
405 views
Uploaded On 2016-03-09

HAP;&#xTER-;怀ID=1ResumptionofthesessionID=1NAME="Preside&#x - PPT Presentation

Figure1Formatofthereleasedcorpusbeginningofledeenenep000117txttheEnglishhalfoftheGerman150EnglishcorpusfromJanuary172000amultilingualcorpusof11languagesor10parallelcorporaforeachlangu ID: 248449

Figure1:Formatofthereleasedcorpus:beginningoflede-en/en/ep-00-01-17.txt theEnglishhalfoftheGerman–EnglishcorpusfromJanuary17 2000.amultilingualcorpusof11languages or10parallelcorporaforeachlangu

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document " HAP;&#xTER-;怀ID=1Resumptionof..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

HAP;&#xTER-;怀ID=1ResumptionofthesessionID=1NAME="Preside&#xSPEA;&#xKER-;怀nt"IdeclareresumedthesessionoftheEuropeanParliament...&#xP000;Although,asyouwillhaveseen,thedreaded'millenniumbug'... Figure1:Formatofthereleasedcorpus:beginningoflede-en/en/ep-00-01-17.txt,theEnglishhalfoftheGerman–EnglishcorpusfromJanuary17,2000.amultilingualcorpusof11languages,or10parallelcorporaforeachlanguage.2.1CrawlingThewebsiteoftheEuropeanParliament2providestheProceedingsoftheEuropeanParliamentinformofHTMLles.Atthetimeofourmostrecentcrawl,eachlecontainstheutterancesofonespeakerinturn.Theformathaschangedthisyear.TheURLforeachlecontainsrelevantinformationforiden-tication,suchasitslanguage,thedayandnumberofthethreadofdiscussionandnumberoftheutter-ance.Crawlingthiswebresourcewithawebspiderisdonebystartingatanindexpageandfollowingcer-tainlinksbasedoninclusionandexclusionrules.Sincethecorpusconsistsofmanysmallparts,thecrawlingprocessistimeconsuming.Perlanguage,ittookseveraldaystoobtaintheroughly80,000leseach.Althoughsuchcrawlsareslow,itistyp-icallyeasierthancontactingdirectlythetechnicalstaffofthewebsiteandnegotiateatransferproce-dure.Usually,therearealsocopyrightconcerns,al-thoughlesssoforinformationfromgovernmentsources.TheEuropeanParliamentwebsitestates:“Exceptwhereotherwiseindicated,reproductionisauthorised,providedthatthesourceisacknowl-edged.”Suchliberalcopyrightpolicycannotnec-essarilyexpected.Oftenalongerlegalprocessisrequiredtogetpermissionandaccesstothedata.Besidesidentifyingsourcesforparallelcorporamanually,itisalsopossibletominethewebforsuchdata.Resnik[1999]proposessuchasystem,calledSTRAND.2.2DocumentAlignmentEachsittingoftheEuropeanParliamentcoversanumberoftopics.Arststepistoidentifythetextsbelongingtoeachtopic,andmatchingthesebe-tweenlanguages.Toobtainthemaximumamountofdata,wematchthesetopicsforeachofthelan-guagepairs. 2Onlineathttp://www.europarl.eu.int/LargedatacollectionssuchastheProceedingsoftheEuropeanParliamentarecreatedovertheperiodofmanyyears,oftenwithchangingformattingstan-dardsandothersourcesoferror.Forinstance,partofthe“English”partoftheproceedingscontainac-tuallyFrenchtexts(21–24May1996)atthetimeofourcrawl.TheextractionofrelevanttextfromnoisyHTMLisacumbersomeenterprisethatrequiresconstantrenementandadaptation.WeprocesstheHTMLdatawithaPerlprogramthatusespatternmatchingtodetectandextracttheidentityofthespeakerandherstatementsincludingparagraphmarkers.Thereisworkonautomaticallylearningsys-temsthatextractstructuredinformationfromwebsourcesorotherformsofunstructureddata.Thistaskiscalledwrapperinduction.SeeforinstanceworkbyMusleaetal.[1999].Forasingledatasource,however,amanualapproachisoftenmoreefcient.Foreachday,westorethedatainoneleperlanguagewithsomemetainformation,asshowninFigure1.WecreatedparallelcorporainvolvingEnglishinthisformat.Wealsoprovidecorporainsen-tencealignedformat,whichwewilldescribebelow.Scriptsareprovidedtogeneratetheotherparallelcorpora.Thedocumentalignmentisdonewithouttokeni-sationandsentencesplitting.Themotivationbehindthisisthattheseareerrorproneprocessesforwhichmultiplestandardscouldbeapplied,andwedonotwanttoforceanyspecicstandardatthisstep.2.3SentenceSplittingandTokenisationSentencesplittingandtokenisationrequirespe-cialisedtoolsforeachlanguage.Unfortunately,wedonothavesuchtoolsavailableforallthelanguagesunderconsideration.Oneproblemofsentencesplittingistheambi-guityoftheperiod“.”aseitheraendofsentencemarker,orasamarkerforanabbreviation.ForEn-glish,FrenchandGerman,wesemi-automaticallycreatedalistofknownabbreviationsthataretypi-callyfollowedbyaperiod.Oneclueisalowercased Danish:deternæstenenpersonligrekordformigdetteefter ar.German:dasistf¨urmichfastpers¨onlicherrekordindiesemherbst.Greek: Englishthatisalmostapersonalrecordformethisautumn!Spanish:eslamejormarcaquehealcanzadoesteoto˜no.Finnish:seonmelkeinminunenn¨atyksenit¨an¨asyksyn¨a!French:c'estpratiquementunrecordpersonnelpourmoi,cetautomne!Italian:e'quasiilmiorecordpersonaledell'autunno.Dutch:ditishaasteenpersoonlijkrecorddezeherfst.Portuguese:´equaseomeurecordepessoaldestesemestre!Swedish:det¨arn¨astanpersonligtrekordf¨ormigdennah¨ost! Figure2:Onesentencealignedacross11languagesNotethatthisdataisalsolowercased,whichisnotdoneforthereleasedsentencealigneddata.Al-ternatively,truecasingcouldbeapplied,althoughthisisamoredifculttask.2.6ReleasesoftheCorpusTheinitialreleaseofthiscorpusconsistedofdataupto2001.Thesecondreleaseaddeddataupto2003,increasingthesizefromjustover20millionwordstoupto30millionwordsperlanguage.Aforthcom-ingthirdreleasewillincludedatauptoearly2005andwillhavebettertokenisation.Formoredetails,pleasecheckthewebsite.3110SMTSystemsTheprevailingmethodologyinstatisticalmachinetranslation(SMT)hasprogressedfromtheinitialword-basedIBMModels[Brownetal.,1993]tocurrentphrase-basedmodels[Koehnetal.,2003].Todescribethelatterquickly:Whentranslatingasentence,sourcelanguagephrases(anysequencesofwords)aremappedintophrasesinthetargetlan-guage,asspeciedbyaprobabilisticphrasetrans-lationtable.Phrasesmaybereordered,andalan-guagemodelinthetargetlanguagesupportsuentoutput.Thecoreofthismodelistheprobabilisticphrasetranslationtablethatislearnedfromaparallelcor-pora.Therearevariousmethodstotrainthis,severalstartwithaautomaticallyobtainedwordalignmentandthencollectphrasepairsofanylengththatareconsistentwiththewordalignment.Decodingisabeamsearchoverallpossibleseg-mentationoftheinputintophrases,anytranslationforeachphrase,andanyreordering.Additionalcomponentmodelsaidinscoringalternativetransla-tions.Translationspeedinourcaseisafewsecondspersentence.Fuelledbyannualcompetitionsandanactivere-searchcommunity,wecanobserverapidprogressintheeld.DuetotheinvolvementofUSfundingagencies,mostresearchgroupsfocusonthetransla-tionfromArabictoEnglishandChinesetoEnglish.Nexttotext-to-texttranslation,thereisincreasinginterestinspeech-to-texttranslation.Mostsystemsarelargelylanguage-independent,andbuildingaSMTsystemforanewlanguagepairismostlyamatterofavailabilityofparalleltexts.Oureffortstoexploreopen-domainGerman–EnglishSMTledustocollectingdatafromtheEu-ropeanParliament.Incidentally,theexistenceoftranslationsin11languagesnowenabledustobuildtranslationsystemsforall110languagepairs.OurSMTsystem[Koehnetal.,2003]includesthedecoderPharaoh[Koehn,2004],whichisfreelyavailableforresearchpurposes3.Training110sys-temstookabout3weeksona16-nodeLinuxclus-ter.WeevaluatedthequalityofthesystemwiththewidelyusedBLEUmetric[Papinenietal.,2002],whichmeasuresoverlapwithareferencetransla-tion.Wetestedona2000sentencesheld-outtestset,whichisdrawnfromtextfromsessionsthattookpartthelastquarteroftheyear2000.Thesesen-tencesarealignedacrossall11languages,sowhentranslationthe,say,FrenchsentencesintoDanish,wecancomparetheoutputagainsttheDanishsetofsentences.Thesametestsetwasusedinasharedtaskatthe2005ACLWorkshoponParallelTexts[Koehn,2005].Thescoresforthe110systemsaredisplayedinTable2.Accordingtothesenumbers,theeasiesttranslationdirectionisSpanishtoFrench(BLEUscoreof40.2),thehardestDutchtoFinnish(10.3). 3Availableonlineathttp://www.isi.edu/licensed-sw/pharaoh/ Source TargetLanguage Language da de el en es fr  it nl pt sv da - 18.4 21.1 28.5 26.4 28.7 14.2 22.2 21.4 24.3 28.3 de 22.3 - 20.7 25.3 25.4 27.7 11.8 21.3 23.4 23.2 20.5 el 22.7 17.4 - 27.2 31.2 32.1 11.4 26.8 20.0 27.6 21.2 en 25.2 17.6 23.2 - 30.1 31.1 13.0 25.3 21.0 27.1 24.8 es 24.1 18.2 28.3 30.5 - 40.2 12.5 32.3 21.4 35.9 23.9 fr 23.7 18.5 26.1 30.0 38.4 - 12.6 32.4 21.1 35.3 22.6  20.0 14.5 18.2 21.8 21.1 22.4 - 18.3 17.0 19.1 18.8 it 21.4 16.9 24.8 27.8 34.0 36.0 11.0 - 20.0 31.2 20.2 nl 20.5 18.3 17.4 23.0 22.9 24.6 10.3 20.0 - 20.7 19.0 pt 23.2 18.2 26.4 30.1 37.9 39.0 11.9 32.0 20.2 - 21.9 sv 30.3 18.9 22.8 30.2 28.6 29.7 15.3 23.9 21.9 25.9 - Table2:BLEUscoresforthe110translationsystemstrainedontheEuroparlcorpus Figure3:Clusteringoflanguagesbasedonsystemscores:Languagefamiliesemerge4LanguageClusteringIntuitively,languagesthatarerelatedareeasiertotranslateintoeachother.WecanunderscorethiswithourSMTsystemscores.Whenclusteringlan-guagestogetherbasedontheirtranslationscore,the11languagesgrouptogetherroughlyalongthelinesoftheirlanguagefamilies,asshowninFigure3.Onetheoneside,youcanndtheRomancelanguagesSpanish,French,PortugueseandItalian,ontheothersidetheGermaniclanguagesDanish,Swedish,English,DutchandGerman.ThecloselanguagesDanishandSwedish,aswellasDutchandGermanaregrouptogetherrst.Thegraphisnotperfect:OnewouldsuspectSpanishandPor-tuguesetobejoinedrst,butSpanishisrstjoinedwithFrench.Theclusteringalgorithmgreedilygroupslan-guagestogetherthattranslateintoeachothermosteasily.Intherststep,SpanishandFrencharegroupedtogether,sincetheyhavethehighesttrans-lationscore(38.4and40.2).InthenextstepPor-tugueseisadded(37.9and35.9withSpanish,39.0 Language From Into Diff Danish(da) 23.4 23.3 0.0 German(de) 22.2 17.7 -4.5 Greek(el) 23.8 22.9 -0.9 English(en) 23.8 27.4 +3.6 Spanish(es) 26.7 29.6 +2.9 French(fr) 26.1 31.1 +5.1 Finnish() 19.1 12.4 -6.7 Italian(it) 24.3 25.4 +1.1 Dutch(nl) 19.7 20.7 +1.1 Portuguese(pt) 26.1 27.0 +0.9 Swedish(sv) 24.8 22.1 -2.6 Table3:Averagetranslationscoresforsystemswhentranslatingfromandintoalanguage.NotethatGerman(de)andEnglish(en)aresimilarlydif-culttotranslatefrom,butEnglishismucheasiertotranslateinto.and35.3withFrench).Always,thetwoclustersoflanguagesarejoinedthathavethehighestaveragetranslationscore.Abiastermof�jc1jjc2j=2isaddedtothescoretobiastowardtheemergenceofsmallerclusters(jcjisthesizeoftheclusterc).5TranslationDirectionSomelanguagearemoredifculttotranslateintothanfrom.SeeTable3fordetailsonthis.Theav-eragescoreforsystemsthattranslatefromGermanintotheeachoftheother10languagesis22.2,verysimilarforsystemstranslatingfromEnglish,23.8.However,thescoresfortranslatingintotheselan-guageisvastlydifferent:17.7forGermanvs.27.4forEnglish.Oneapparentreasonforthedifcultyoftranslat-ingintoalanguageismorphologicalrichness.NounphrasesinGermanaremarkedwithcase,which Figure4:Vocabularysizevs.BLEUscorewhentranslatingintoEnglish(whichhasabout65,000distinctwordforms)manifestsitselfasdifferentwordendingsatdeter-miners,adjectivesandnouns.Generatingtherightcasemarkingsishard,especiallysincenothinginthetranslationmodelkeepstrackoftheroleofnounphrasesandthetrigramlanguagemodelisfairlyweakinthisrespect,sinceitonlyconsidersathreewordwindow.ThepoorperformanceofsystemsinvolvingFinnishcanalsopartlybeattributedtoitsmorphol-ogy,whichisveryagglutinative:SomeelementsthatformindividualwordsinEnglish(determiners,prepositions)areincludedinthemorphology.Thisincreasesthesizeofthevocabulary(theFinnishvo-cabularyisaboutvetimesasbigastheEnglish),leadingtosparsedataproblemswhencollectingstatisticsforwordandphrasetranslation.SeeFig-ure4foracomparisonofBLEUscoreswhentrans-latingintointoEnglishandvocabularysize.Intuitively,translatingfromaninformation-richintoaninformation-poorlanguageiseasierthantheotherwayaround.Researchershavemadesim-ilarobservationsaboutthebetterperformanceofArabic–EnglishSMTsystemsvs.Chinese–EnglishSMTsystems,thataretrainedonsimilaramountoftrainingdataandtestedonnewswire:TranslatingfromArabicwithitsrichmorphologyiseasierthantranslatingfromChinese,whichisevenmorefrugalthanEnglish,oftenlackingdeterminersandpluralortensemarkers.NotethattranslatingintoEnglishisamongtheeasiest.However,sincetheresearchcommunityisprimarilyoccupiedwithtranslationintoEnglish,in-terestingproblemsassociatedwithtranslatingintomorphologicallyrichlanguageshavelargelybeenneglected.6BackTranslationThequalityofmachinetranslationsystemsisdif-culttoassess.Thisisespeciallytrueformonolin-gualspeakers,whoonlyknowonelanguage.Whenmainstreamjournalistsreportontheprogressofma-chinetranslationsystems,theyfrequentlyresorttoaseeminglyclevertrick:TheyuseaMTsystemtotranslateasentencefromEnglishintoaforeignlan-guage,andthenuseareverseMTsystemtotranslatethesentencebackintoEnglish.Theythenjudgethe Language From Into Back da 28.5 25.2 56.6 de 25.3 17.6 48.8 el 27.2 23.2 56.5 es 30.5 30.1 52.6  21.8 13.0 44.4 it 27.8 25.3 49.9 nl 23.0 21.0 46.0 pt 30.1 27.1 53.6 sv 30.2 24.8 54.4 Table4:Scoresformono-directionalsystemsandbacktranslation:TranslatingfromEnglishtoGreek(systemscore27.2)andbacktoEnglish(systemscore23.2)resultsinaBLEUscoreof56.5forthecombinedtranslation.ThescoreishigherthanforthecombinationEnglish–Portuguese–English(53.6),althoughthemono-directionalsystemsarebetter(30.1,27.1).qualityoftheMTsystemsbyhowwelltheEnglishsentenceispreserved.Thismethodisinspiredbyanurbanlegendin-volvingapairofMTsystemsbetweenRussianandEnglish.ThelegendproclaimsthatoncesomeonefedaEnglish–RussianMTsystemthebibleverse“Thespiritiswilling,buttheeshisweak.”WhenbacktranslatingthesentencewiththeRussian–Englishsystem,thesystemreturned“Thevodkaisgoodbutthemeatisrotten.“Howwelldoesbacktranslationindicatethetrans-lationperformanceoftheMTsystemsinvolved?AsTable4shows,notmuch.Firstofall,whileonewouldsuspectadegrada-tionofthequalityofasentencewhentranslatedintoaforeignlanguage,andafurtherdegradationwhentranslatedback,theBLEUscorestelladiffer-entstory:Forinstance,thequalityoftheEnglish–Greeksystemis27.2and23.2fortheGreek–Englishsystem.However,translatingthetestsetfromEnglishintoGreekandbackintoEnglish,givesaBLEUscoreof56.5,muchhigherthanei-thersystem.NotethatthishighscoreisanartifactofhowtheBLEUscoreworks:Itmeasuresoverlapwithareferencetranslation.Inthemono-directionalsys-temsthereferencetranslationisahumantrans-lation.Whilethesystemoutputmaybecorrect,thesystemmaygetpunishedforvalidtranslationchoicesthatdifferfromtheonesbythehuman.Inbacktranslation,however,wecompareagainstex-actlytheinputsentence,whichwillbeeasiertomatch.ThemoreinterestingpointofTable4is:Thebacktranslationscoresdonotcorrelatewellwiththemono-directionalsystemscores.Again,theEnglish–Greek–Englishcombinationhassystemscoresof27.2and23.2,andabacktranslationscoreof56.5.Thisishigherthan53.6,thescorefortheEnglish–Portuguese–Englishcombination,whichhasbettermono-directionalsystemscores:30.1and27.2.Inconclusion,backtranslationdoesnotonlypro-videafalsesenseofthecapabilitiesofMTsystems,itisalsoalazyandawedmethodtocomparesys-tems.Backtranslationunfairlybenetsfromtheabilitytoreverseerrors,whichonlyshowupintheforeignlanguage.Todrivethepointhome:asystempairthatdoesnothing,meaning,leavingallEnglishwordsinplacewilldoperfectlyinbacktranslation,whilebeingutterlyuselessinpractise.7ConclusionsWedescribedtheacquisitionoftheEuroparlcorpusanditsapplicationinbuildingstatisticalmachinetranslationsystemsfor110languagepairs,maybethelargestnumberofmachinetranslationsystemsbuiltwithinthreeweeks,andtherstseriouseffortatbuildingsuchasystemfor,say,GreektoFinnish.SomesampleoutputisinFigure5.ThewidelyrangingqualityofthedifferentSMTsystemsforthedifferentlanguagepairsdemonstratethemanydifferentchallengesforSMTresearch,whichwehaveonlytouchedupon.Theeld'spri-maryoccupationwithtranslatingafewlanguagesintoEnglishignoresmanyofthesechallenges.Finally,wehopethattheavailabilityofresources(corpora,tools)continuestomakestatisticalma-chinetranslationanexcitingandproductiveeld.ReferencesBrown,P.F.,Pietra,S.A.D.,Pietra,V.J.D.,andMercer,R.L.(1993).Themathematicsofsta-tisticalmachinetranslation.ComputationalLin-guistics,19(2):263–313.Gale,W.andChurch,K.(1993).Aprogramforaligningsentencesinbilingualcorpora.Compu-tationalLinguistics,19(1).Koehn,P.(2004).Pharaoh:abeamsearchdecoderforstatisticalmachinetranslation.In6thConfer-enceoftheAssociationforMachineTranslationintheAmericas,AMTA,LectureNotesinCom-puterScience.Springer.Koehn,P.(2005).Sharedtask:Statisticalmachinetranslationforeuropeanlanguages.InACLWork-shoponParallelTexts.