English-Chinese Cross-Language IR using Bilingual Dictionaries

English-Chinese Cross-Language IR using Bilingual Dictionaries

6 Pages · 2001 · 56 KB · English

English topics into Chinese by dictionary lookup. An. English/Chinese bilingual wordlist compiled by Linguistic. Data Consortium and an online 

English-Chinese Cross-Language IR using Bilingual Dictionaries free download

EnglishChineseCrossLanguage IRusing Bilingual Dictionaries Aitao Chen ,Hailing Jiang ,and Fredric Gey  School ofInformation Management andSystems  UCData Archi ve & Technical Assistance (UCDATA) Uni versity ofCalifornia atBerk eley, CA 94720, USA aitao,hjiang1 @simsberk eleyedu, [email protected] eleyedu Abstract This report describes theEnglishChinese cross langua ge retrie valexperiments atBerk eleyfor TREC9 Cr ossLangua ge Information Retrievaltrac k We pr esent a simple andeffective Chinese wordsegmentation method and compar ethe crosslangua ge retrie valperformance of two bilingual dictionaries forquery translation 1 Intr oduction In TREC9 weonly participated inthe EnglishChinese crosslanguage informationretrieval (CLIR) trackWeper formed oneChinese monolingual retrieval run and three EnglishChinese crosslanguageretrieval runs Ourap proach tothe crosslanguage retrieval was totranslate the English topicsintoChinese bydictionary lookupAn English/Chinese bilingualwordlist compiled byLinguistic Data Consortium andanonline English/Chinese bilingual dictionary wereusedinour crosslanguage retrieval experi ments The four ofcial runswesubmitted areBRKCCA1, BRKECA1, BRKECA2, andBRKECM1 TheBRKCCA1 is amonolingual run,theother threebeing EnglishChinese crosslanguage runsTheBRKECA1 andBRKECA2 runs are automatic, whiletheBRKECM1 ismanual F or all four runs, thesame document rankingalgorithm based onlogistic regression technique was used Thedetails on our ranking algorithm canbefound in[2] 2 Word Segmentation The documents andqueries inmost text retrie val sys tems areinde xed by the words occurring inthe text For languages suchasEnglish inwhich words areseparated by blank space, itis simple toinde xtext by words To inde x Chinese text by words, howe ver ,one rst needs toidentify w ords inthe text since word boundaries arenot explicitly mark edinChinese text There isalar ge literature onChi nese word segmentation Wewill notattempt tosurv ey this eld Tw o recent papers onChinese word segmentation are presented byDai and Loh in[4]and Sun etal in[9] Both corpusbased statisticalmethodsanddictionarybased methods have been developed tobreak asentence intoin di vidual words Ifone hasaChinese word dictionary ,one could match thetext against thedictionary andoutput as a word thelongest sequence ofcharacters thatmatches an dictionary entryWhen adictionary isnot av ailable, one could collect large amount ofChinese text and attempt to disco ver words byexaming theoccurrence patternsofthe characters inthe corpus Amajor problem withdictionary based word segmentation methodsisthe dictionary cover age Thecorpusbased orstatistical methodscanbeeasily applied toane wcollection ofChinese text since theydo not use word dictionaries Theoverlapping bigramindexing is simple, efcient andeffecti ve as well [7]One problem with bigram indexing isthat theinde xing leproduced istw oto three times asbig asthe size ofthe raw text Here werefer to single Chinese characters asunigrams andtwocharacter Chinese termsasbigrams W epresent amethod thatisequally efcient andeffec ti ve as bigram indexing, but produces amuch smaller in de xle than theoverlapping bigramindexing Ourmethod is similar tobut less general thanthework presented by Ge etal in[5] Our method breaksasentence intoun igrams andbigrams bymaximizing theprobability ofthe sentence Hereweassume thatunigrams andbigrams oc cur independently inthe corpus For ase gmented sentence    ,if we assume words occur indepen dently ,then theprobability ofthe sentence  can beex pressed asfollo ws:       (1)             "!    (2) sincewedonot knowho wto break asentence intowords in adv ance, wewill consider allpossible ways ofsegmenting a sentence andestimate theprobability ofev ery segmenta tion given asentence Wecan then usethesegmentation of the highest probability tobreak upthe sentence intowords The number ofpossible ways tobreak asentence of #char acters intowords is $&%(' whenaword can bearbitrarily long However ,when aword islimited toone ortwochar acters, thenumber ofpossible ways tosegment asentence of #characters canbeexpressed bythe recurrence relation )* # + ),#0/ 21)*#34$  ,where )* #  isthe num ber ofways tobreak asentence of #characters intooneor tw ocharacter words and )*65  5879)* /   / 79)* $ : $ When asentence isshort, onecaneasily enumerate allpos sible ways ofsegmenting thesentence andcompute their associated probabilities, thenchoose thesegmentation of the highest probability But when asentence islong, the number ofpossible segmentations isexponential, itis no longer practical toenumerate allpossible ways ofbreak ing thesentence andestimate theirprobabilities However one canapply dynamic programming techniquetond out the most likely segmentation efciently withoutcomputing the probabilities ofall possible segmentations ofasentence The best way ofbreaking asentence of #characters canbe recursi vely expressed asfollo ws:   ;% = @ 6  ;%A' CB % 7  D;%A' CB %A' B % E where F D; %  B B  B% and  F D; %  isthe maximum probability ofsegmenting asentence of #characters into one ortwocharacter words Theprobability ofaone character word (ie, unigram) isestimated by CB 4 G H"IKJML G , and theprobability ofatw ocharacter word (ie, bi gram) isestimated by CB BN   G H"IJIKOPL G , where )*CB  isthe number oftimes thatcharacter B occurs inthe cor pus, )*QB BN  isthe number oftimes thatstring B BN oc curs inthe corpus and ) isthe total number oftimes that an ysingle character termsandanytw ocharacter termsoc curs inthe corpus Asentence isbrok eninto oneortwo character termsusingthemost likely segmentation For ex ample, forthe sentence ofthree characters, R B B B:S , the probability ofthe sentence withthethree different pos sible ways ofsegmentation

------------- Read More -------------

Download english-chinese-cross-language-ir-using-bilingual-dictionaries.pdf

English-Chinese Cross-Language IR using Bilingual Dictionaries related documents

(Phocoena phocoena) population differentiation using RAD-tag genotyping by sequencing

12 Pages · 2014 · 457 KB · English

Skagerrak (SKA) region and the North Sea (NOS) population has been a continuous matter of debate particularly with regard to conservation management practices. While the eastern North Sea population behaves as a continuous population with significant isolation-by-distance (Fontaine et al, 2007), 

Using Boundary-Free Storytelling to Inspire Students' Professional

22 Pages · 2013 · 196 KB · English

doubled (Becker, et al., 2012, p. 38), yet only 11 As a result, the overall sales figures of smartphones and tablets do not come as a surprise. The year 

optimal placement and sizing of unified power flow controller using heuristic techniques for ...

7 Pages · 2017 · 345 KB · English

Using these power equations, the linear zed. UPFC model is given below, where the voltage magnitude. and phase angle . are taken to be the . IEEE Power Eng. Soc. Winter Meeting. 2: 1435-1439. [4] Acha E., C.R. Fuerte-Esquivel, H. Ambriz-Pe´rez, and. C. Angeles-Camacho. 2004.

The School of English, in Association with Lively Conversation

1 Pages · 2012 · 442 KB · English

The School of English, in Association with Lively Conversation Debate Topics Present….. This House would see Unseen University run by Witches

Division of Hebrew Language Instruction The School of Language

2 Pages · 2013 · 267 KB · English

Professor Ray Jackendoff Seth Merrin Professor of Philosophy Co-Director, Center for Cognitive Studies Tufts University on

Pearson English Learning System Using Keystone

6 Pages · 2012 · 1.26 MB · English

This guide introduces the features of the Pearson Longman Keystone program and school students—levels A, B, and C for middle school and levels D, E, . workbook pages, and reading summaries in English, Spanish, Chinese,.

Research Article Peak Voltage Measurements Using Standard Sphere Gap Method - Hindawi

6 Pages · 2015 · 1.96 MB · English

1 Department of Electrotechnics, Faculty of Electrical Engineering and Computer Science, 13 University Street, 720229 Suceava, Romania. 2 Department of The paper presents a measurement system of peak value of high voltage (H.V.) using 150 mm diameter sphere gap, disposed in vertical position 

English 6–12 - FLDOE Home

4 Pages · 2013 · 45 KB · English

Identify and apply various approaches to the study of language, usage, grammar, and style. 3. English 6–12 - 4 - Title: Microsoft Word - 13Eng6-12.doc

Monitoring of atmospheric composition using the thermal infrared IASI/MetOp sounder

14 Pages · 2009 · 4.98 MB · English

The IASI nadir looking thermal infrared sounder onboard ([email protected]) . Middle panels: radiative transfer transmittance simulations to identify the main absorbing gases . of September 2007, we relied on a basic cloud filtering sys- . In parallel, a fast retrieval approach based.

Improving Resource Management in Virtualized Data Centers using Application Performance Models

109 Pages · 2016 · 1.45 MB · English

This dissertation, written by Sajib Kundu, and entitled Improving Resource Man- thesis aimed to develop a framework and techniques that would help correlation between the average or peak CPU utilization of an application