English-Chinese Cross-Language IR using Bilingual Dictionaries

English-Chinese Cross-Language IR using Bilingual Dictionaries

6 Pages · 2001 · 56 KB · English

English topics into Chinese by dictionary lookup. An. English/Chinese bilingual wordlist compiled by Linguistic. Data Consortium and an online 

English-Chinese Cross-Language IR using Bilingual Dictionaries free download


EnglishChineseCrossLanguage IRusing Bilingual Dictionaries Aitao Chen ,Hailing Jiang ,and Fredric Gey  School ofInformation Management andSystems  UCData Archi ve & Technical Assistance (UCDATA) Uni versity ofCalifornia atBerk eley, CA 94720, USA aitao,hjiang1 @simsberk eleyedu, [email protected] eleyedu Abstract This report describes theEnglishChinese cross langua ge retrie valexperiments atBerk eleyfor TREC9 Cr ossLangua ge Information Retrievaltrac k We pr esent a simple andeffective Chinese wordsegmentation method and compar ethe crosslangua ge retrie valperformance of two bilingual dictionaries forquery translation 1 Intr oduction In TREC9 weonly participated inthe EnglishChinese crosslanguage informationretrieval (CLIR) trackWeper formed oneChinese monolingual retrieval run and three EnglishChinese crosslanguageretrieval runs Ourap proach tothe crosslanguage retrieval was totranslate the English topicsintoChinese bydictionary lookupAn English/Chinese bilingualwordlist compiled byLinguistic Data Consortium andanonline English/Chinese bilingual dictionary wereusedinour crosslanguage retrieval experi ments The four ofcial runswesubmitted areBRKCCA1, BRKECA1, BRKECA2, andBRKECM1 TheBRKCCA1 is amonolingual run,theother threebeing EnglishChinese crosslanguage runsTheBRKECA1 andBRKECA2 runs are automatic, whiletheBRKECM1 ismanual F or all four runs, thesame document rankingalgorithm based onlogistic regression technique was used Thedetails on our ranking algorithm canbefound in[2] 2 Word Segmentation The documents andqueries inmost text retrie val sys tems areinde xed by the words occurring inthe text For languages suchasEnglish inwhich words areseparated by blank space, itis simple toinde xtext by words To inde x Chinese text by words, howe ver ,one rst needs toidentify w ords inthe text since word boundaries arenot explicitly mark edinChinese text There isalar ge literature onChi nese word segmentation Wewill notattempt tosurv ey this eld Tw o recent papers onChinese word segmentation are presented byDai and Loh in[4]and Sun etal in[9] Both corpusbased statisticalmethodsanddictionarybased methods have been developed tobreak asentence intoin di vidual words Ifone hasaChinese word dictionary ,one could match thetext against thedictionary andoutput as a word thelongest sequence ofcharacters thatmatches an dictionary entryWhen adictionary isnot av ailable, one could collect large amount ofChinese text and attempt to disco ver words byexaming theoccurrence patternsofthe characters inthe corpus Amajor problem withdictionary based word segmentation methodsisthe dictionary cover age Thecorpusbased orstatistical methodscanbeeasily applied toane wcollection ofChinese text since theydo not use word dictionaries Theoverlapping bigramindexing is simple, efcient andeffecti ve as well [7]One problem with bigram indexing isthat theinde xing leproduced istw oto three times asbig asthe size ofthe raw text Here werefer to single Chinese characters asunigrams andtwocharacter Chinese termsasbigrams W epresent amethod thatisequally efcient andeffec ti ve as bigram indexing, but produces amuch smaller in de xle than theoverlapping bigramindexing Ourmethod is similar tobut less general thanthework presented by Ge etal in[5] Our method breaksasentence intoun igrams andbigrams bymaximizing theprobability ofthe sentence Hereweassume thatunigrams andbigrams oc cur independently inthe corpus For ase gmented sentence    ,if we assume words occur indepen dently ,then theprobability ofthe sentence  can beex pressed asfollo ws:       (1)             "!    (2) sincewedonot knowho wto break asentence intowords in adv ance, wewill consider allpossible ways ofsegmenting a sentence andestimate theprobability ofev ery segmenta tion given asentence Wecan then usethesegmentation of the highest probability tobreak upthe sentence intowords The number ofpossible ways tobreak asentence of #char acters intowords is $&%(' whenaword can bearbitrarily long However ,when aword islimited toone ortwochar acters, thenumber ofpossible ways tosegment asentence of #characters canbeexpressed bythe recurrence relation )* # + ),#0/ 21)*#34$  ,where )* #  isthe num ber ofways tobreak asentence of #characters intooneor tw ocharacter words and )*65  5879)* /   / 79)* $ : $ When asentence isshort, onecaneasily enumerate allpos sible ways ofsegmenting thesentence andcompute their associated probabilities, thenchoose thesegmentation of the highest probability But when asentence islong, the number ofpossible segmentations isexponential, itis no longer practical toenumerate allpossible ways ofbreak ing thesentence andestimate theirprobabilities However one canapply dynamic programming techniquetond out the most likely segmentation efciently withoutcomputing the probabilities ofall possible segmentations ofasentence The best way ofbreaking asentence of #characters canbe recursi vely expressed asfollo ws:   ;% = @ 6  ;%A' CB % 7  D;%A' CB %A' B % E where F D; %  B B  B% and  F D; %  isthe maximum probability ofsegmenting asentence of #characters into one ortwocharacter words Theprobability ofaone character word (ie, unigram) isestimated by CB 4 G H"IKJML G , and theprobability ofatw ocharacter word (ie, bi gram) isestimated by CB BN   G H"IJIKOPL G , where )*CB  isthe number oftimes thatcharacter B occurs inthe cor pus, )*QB BN  isthe number oftimes thatstring B BN oc curs inthe corpus and ) isthe total number oftimes that an ysingle character termsandanytw ocharacter termsoc curs inthe corpus Asentence isbrok eninto oneortwo character termsusingthemost likely segmentation For ex ample, forthe sentence ofthree characters, R B B B:S , the probability ofthe sentence withthethree different pos sible ways ofsegmentation

------------- Read More -------------

Download english-chinese-cross-language-ir-using-bilingual-dictionaries.pdf

English-Chinese Cross-Language IR using Bilingual Dictionaries related documents

Using Geographic Locations in BIM Models

22 Pages · 2016 · 2.58 MB ·

Configure Building Location in Architectural Revit Model coordinates file (in XML format) from Autodesk® AutoCAD® Civil3D®. The XML file is then As we said earlier, you can set up two different Revit family types for spot 

Word-Level Language Identification and Predicting Codeswitching Points in Swahili-English ...

9 Pages · 2016 · 136 KB · English

Proceedings of the Second Workshop on Computational Approaches to Code Switching, pages 21–29,. Austin, TX, November 1, 2016. cO2016 conveys meaning (Myers-Scotton, 1993b). A model should be able to see this Solorio and Liu look at English-Spanish codeswitching in a relatively small 

A Survey on Parallel Computing and its Applications in Data-Parallel Problems Using GPU ...

45 Pages · 2013 · 1.38 MB · English

1 Department of Computer Science (DCC), Universidad de Chile, Santiago, Chile. 2 Centro de Estudios Email addresses: [email protected] (C. A. Navarro), [email protected] (N. Hitschfeld) sidered a programming model itself) and has been used for frameworks such as Hadoop.

Free vibration analysis of radial pneumatictires using Bez´ier

17 Pages · 2005 · 786 KB · English

The tire is a shell structure laminated of rubber-cord constituents. Cord materials, and steel, etc., consist of unidirectional fibers embedded in a matrix. Material

Validation of a Measure of Household Hunger for Cross-Cultural Use

76 Pages · 2010 · 1.06 MB · English

Food and Nutrition Technical Assistance II Project (FANTA-2). Academy for Educational Development 1825 Connecticut Ave., NW Washington, DC 20009-5721. Tel: 202-884-8000 Fax: .. Standardized Cross-Cultural Household Measure Plot for Raw Score Scale Values of 1 to 5, Mozambique R2 and 

Using learning outcomes to design a course and assess learning

8 Pages · 2007 · 69 KB ·

HE learning comprises complex mixtures of knowledge . resources to support learning, and the . learning outcomes is a messy iterative process.

Using trophic hierarchy to understand food web structure

8 Pages · 2009 · 117 KB · English

Using trophic hierarchy to understand food web structure Marco Scotti, Cristina Bondavalli, Antonio Bodini and Stefano Allesina M. Scotti ([email protected]

Accelerating Ceph Distributed Database Using ThunderX and CloudSpeed Ultra

18 Pages · 2017 · 607 KB · English

shipping ThunderX for two quarters and it meets Microsoft's OCP Project Olympus requirements for cloud data centers.2. SSDs now have orders of . replication) for its NoSQL-as-a-Service, with copies across multiple availability zones. Ceph Block Storage Architecture. Ceph12 is a distributed, 

A cross-linguistic study of L2 perception and production of metrical systems

383 Pages · 2009 · 2.57 MB · English

4.3.1.1.2 Languages with irregular stress – Russian,. English, German, Spanish & Italian . position, word length, syllable structure and morphological structure of words in L2 perception and . phonology, psycholinguistics and second language acquisition. At the end of this dissertation we hope to

Remote Monitoring of Sun x64 Systems using IPMITOOL and IPMIEVD

25 Pages · 2005 · 298 KB · English

REMOTE MONITORING OF SUN X64 SYSTEMS USING IPMITOOL AND IPMIEVD Eric Markwardt, Client Solutions Organization Sun BluePrints™ OnLine — January 2007