English-Chinese Cross-Language IR using Bilingual Dictionaries

6 Pages · 2001

English topics into Chinese by dictionary lookup. An. English/Chinese bilingual wordlist compiled by Linguistic. Data Consortium and an online 

EnglishChineseCrossLanguage IRusing Bilingual Dictionaries Aitao Chen ,Hailing Jiang ,and Fredric Gey  School ofInformation Management andSystems  UCData Archi ve & Technical Assistance (UCDATA) Uni versity ofCalifornia atBerk eley, CA 94720, USA aitao,hjiang1 @simsberk eleyedu, [email protected] eleyedu Abstract This report describes theEnglishChinese cross langua ge retrie valexperiments atBerk eleyfor TREC9 Cr ossLangua ge Information Retrievaltrac k We pr esent a simple andeffective Chinese wordsegmentation method and compar ethe crosslangua ge retrie valperformance of two bilingual dictionaries forquery translation 1 Intr oduction In TREC9 weonly participated inthe EnglishChinese crosslanguage informationretrieval (CLIR) trackWeper formed oneChinese monolingual retrieval run and three EnglishChinese crosslanguageretrieval runs Ourap proach tothe crosslanguage retrieval was totranslate the English topicsintoChinese bydictionary lookupAn English/Chinese bilingualwordlist compiled byLinguistic Data Consortium andanonline English/Chinese bilingual dictionary wereusedinour crosslanguage retrieval experi ments The four ofcial runswesubmitted areBRKCCA1, BRKECA1, BRKECA2, andBRKECM1 TheBRKCCA1 is amonolingual run,theother threebeing EnglishChinese crosslanguage runsTheBRKECA1 andBRKECA2 runs are automatic, whiletheBRKECM1 ismanual F or all four runs, thesame document rankingalgorithm based onlogistic regression technique was used Thedetails on our ranking algorithm canbefound in[2] 2 Word Segmentation The documents andqueries inmost text retrie val sys tems areinde xed by the words occurring inthe text For languages suchasEnglish inwhich words areseparated by blank space, itis simple toinde xtext by words To inde x Chinese text by words, howe ver ,one rst needs toidentify w ords inthe text since word boundaries arenot explicitly mark edinChinese text There isalar ge literature onChi nese word segmentation Wewill notattempt tosurv ey this eld Tw o recent papers onChinese word segmentation are presented byDai and Loh in[4]and Sun etal in[9] Both corpusbased statisticalmethodsanddictionarybased methods have been developed tobreak asentence intoin di vidual words Ifone hasaChinese word dictionary ,one could match thetext against thedictionary andoutput as a word thelongest sequence ofcharacters thatmatches an dictionary entryWhen adictionary isnot av ailable, one could collect large amount ofChinese text and attempt to disco ver words byexaming theoccurrence patternsofthe characters inthe corpus Amajor problem withdictionary based word segmentation methodsisthe dictionary cover age Thecorpusbased orstatistical methodscanbeeasily applied toane wcollection ofChinese text since theydo not use word dictionaries Theoverlapping bigramindexing is simple, efcient andeffecti ve as well [7]One problem with bigram indexing isthat theinde xing leproduced istw oto three times asbig asthe size ofthe raw text Here werefer to single Chinese characters asunigrams andtwocharacter Chinese termsasbigrams W epresent amethod thatisequally efcient andeffec ti ve as bigram indexing, but produces amuch smaller in de xle than theoverlapping bigramindexing Ourmethod is similar tobut less general thanthework presented by Ge etal in[5] Our method breaksasentence intoun igrams andbigrams bymaximizing theprobability ofthe sentence Hereweassume thatunigrams andbigrams oc cur independently inthe corpus For ase gmented sentence    ,if we assume words occur indepen dently ,then theprobability ofthe sentence  can beex pressed asfollo ws:       (1)             "!    (2) sincewedonot knowho wto break asentence intowords in adv ance, wewill consider allpossible ways ofsegmenting a sentence andestimate theprobability ofev ery segmenta tion given asentence Wecan then usethesegmentation of the highest probability tobreak upthe sentence intowords The number ofpossible ways tobreak asentence of #char acters intowords is $&%(' whenaword can bearbitrarily long However ,when aword islimited toone ortwochar acters, thenumber ofpossible ways tosegment asentence of #characters canbeexpressed bythe recurrence relation )* # + ),#0/ 21)*#34$  ,where )* #  isthe num ber ofways tobreak asentence of #characters intooneor tw ocharacter words and )*65  5879)* /   / 79)* $ : $ When asentence isshort, onecaneasily enumerate allpos sible ways ofsegmenting thesentence andcompute their associated probabilities, thenchoose thesegmentation of the highest probability But when asentence islong, the number ofpossible segmentations isexponential, itis no longer practical toenumerate allpossible ways ofbreak ing thesentence andestimate theirprobabilities However one canapply dynamic programming techniquetond out the most likely segmentation efciently withoutcomputing the probabilities ofall possible segmentations ofasentence The best way ofbreaking asentence of #characters canbe recursi vely expressed asfollo ws:   ;% = @ 6  ;%A' CB % 7  D;%A' CB %A' B % E where F D; %  B B  B% and  F D; %  isthe maximum probability ofsegmenting asentence of #characters into one ortwocharacter words Theprobability ofaone character word (ie, unigram) isestimated by CB 4 G H"IKJML G , and theprobability ofatw ocharacter word (ie, bi gram) isestimated by CB BN   G H"IJIKOPL G , where )*CB  isthe number oftimes thatcharacter B occurs inthe cor pus, )*QB BN  isthe number oftimes thatstring B BN oc curs inthe corpus and ) isthe total number oftimes that an ysingle character termsandanytw ocharacter termsoc curs inthe corpus Asentence isbrok eninto oneortwo character termsusingthemost likely segmentation For ex ample, forthe sentence ofthree characters, R B B B:S , the probability ofthe sentence withthethree different pos sible ways ofsegmentation

