Mining Web Snippets to Answer List Questions Alejandro Figueroa Gunter Neumann Deutsches Forschungszentrum fur Kunstliche Intelligenz DFKI, Stuhlsatzenhausweg 3, D 66123, Saarbrucken, Germany Email: ffigueroa jneumann [email protected] Abstract This paper presents ListWebQA, a question answer ing system that is aimed speci cally at extracting an swers to list questions exclusively from web snippets Answers are identi ed in web snippetsby means of their semantic and syntactic similarities Initial re sults show that they are a promising source of answers to list questions Keywords: Web Mining, Question Answering, List Questions, Distinct Answers 1 Introduction In recent years, search engines have markedly im proved their power of indexing, provoked by the sharp increase in the number of documents published on the Internet, in particular, HTML pages The great success of search engines in linking users to nearly all the sources that satisfy their information needs has caused an explosive growth in their number, and analogously, in their demands for smarter ways of searching and presenting the requested information Nowadays, one of these increasing demands is nd ing answers to natural language questions Most of the research into this area has been carried out under the umbrella of Question Answering Systems (QAS), especially in the context of the Question Answering track of the Text REtrieval Conference (TREC) In TREC, QAS are encouraged to answer several kinds of questions, whose diculty has been system atically increasing during the years In 2001, TREC incorporated list questions, such as \What are 9 nov els written by John Updike? " and \Name 8 Chuck Berry songs ", into the question answering track Sim ply stated, answering this sort of question consists in discovering a set of di erent answers in only one or across several documents QAS must therefore, e ciently process a wealth of documents, and identify as well as remove redundant responses in order to satis factorily answer the question Modest results obtained by QAS in TREC show that dealing with this kind of question is particu larly dicult (Voorhees 2001, 2002, 2003, 2004), mak ing the research in this area very challenging Usu The work presented here was partially supported by a research grant from the German Federal Ministry of Education, Science, Research and Technology (BMBF) to the DFKI pro ject HyLaP (FKZ: 01 IW F02) and the ECfunded pro ject QALLME Copyright c 2007, Australian Computer Society, Inc This pa per appeared at the Second Workshop on Integrating AI and Data Mining (AIDM 2007), Gold Coast, Australia Confer ences in Research and Practice in Information Technology (CR PIT), Vol 84, KokLeong Ong, Junbin Gao and Wenyuan Li, Ed Reproduction for academic, notfor pro t purposes per mitted provided this text is included ally, QAS tackle list questions by making use of pre compiled, often manually checked, lists (i e famous persons and countries) and online encyclopedias, like Wikipedia and Encarta, but with moderate success Research has been hence conducted towards exploit ing full web documents, especially their lists and ta bles This paper presents our research in progress (\ Greenhouse work ") into list question answering on the web Speci cally, it presents ListWebQA, our list question answering system that is aimed at extract ing answers to list questions directly from the brief descriptions of websites returned by search engines, called web snippets ListWebQA is an extension of our current web question answering system 1 , which is aimed essentially at mining web snippets for discover ing answers to natural language questions, including factoid and de nition questions (Figueroa and Atkin son 2006, Figueroa and Neumann 2006, 2007) The motivation behind the use of web snippets as a source of answers is threefold: (a) to avoid, when ever possible, the costly retrieval and processing of full documents, (b) to the user, web snippets are the rst view of the response, thus highlighting answers would make them more informative, and (c) answers taken from snippets can be useful for determining the most promising documents, that is, where most of an swers are likely to be An additional strong motiva tion is, the absence of answers across retrieved web snippets can force QAS a change in

