Web-Scale Language-Independent Cataloging of Noisy Product Listings for E-Commerce

Web-Scale Language-Independent Cataloging of Noisy Product Listings for E-Commerce

11 Pages · 2017 · 6.59 MB · English

The cataloging of product listings through taxonomy categorization is a fundamental problem for any e-commerce marketplace, with applications ranging from personal- ized search recommendations to query un- derstanding. However, manual and rule based approaches to categorization are not.

Web-Scale Language-Independent Cataloging of Noisy Product Listings for E-Commerce free download

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , pages 969–979, Valencia, Spain, April 37, 2017 c 2017 Association for Computational Linguistics 969WebScale LanguageIndependent Cataloging of Noisy Product Listings for ECommerce Pradipto Das, Yandi Xia, Aaron Levine, Giuseppe Di Fabbrizio, andAnkur Datta Rakuten Institute of Technology, Boston, MA, 02110 USA f pradiptodas, tsyandixia, aaronlevine [email protected] f giuseppedifabbrizio, ankurdatta [email protected] Abstract The cataloging of product listings through taxonomy categorization is a fundamental problem for any ecommerce marketplace, with applications ranging from personal ized search recommendations to query un derstanding However, manual and rule based approaches to categorization are not scalable In this paper, we compare sev eral classiers for categorizing listings in both English and Japanese product cata logs We show empirically that a combina tion of words from product titles, naviga tional breadcrumbs , andlist prices , when available, improves results signicantly We outline a novel method using corre spondence topic models and a lightweight manual process to reduce noise from mis labeled data in the training set We con trast linear models, gradient boosted trees (GBTs) and convolutional neural networks (CNNs), and show that GBTs and CNNs yield the highest gains in error reduc tion Finally, we show GBTs applied in a languageagnostic way on a large scale Japanese ecommerce dataset have improved taxonomy categorization perfor mance over current stateoftheart based on deep belief network models 1 Introduction Webscale ecommerce catalogs are typically ex posed to potential buyers using a taxonomy cat egorization approach where each product is cate gorized by a label from the taxonomy tree Most ecommerce search engines use taxonomy labels to optimize query results and match relevant list ings to users' preferences (Ganti et al, 2010) To illustrate the general concept, consider Fig1 A merchant pushes new men's clothing listings to an online catalog infrastructure, which then orga nizes the listings into a taxonomy tree When a user searches for a denim brand, “DSquared2”, the search engine rst has to understand that the user is searching for items in the “Jeans”category Then, if the specic items cannot be found in the inventory, other relevant items in the “Jeans”cat egory are returned in the search results to encour age the user to browse further However, achiev ing good product categorization for ecommerce marketplaces is challenging Commercial product taxonomies are organized in tree structures three to ten levels deep, with thousands of leaf nodes (Sun et al, 2014;Shen et al, 2012b;Pyo et al, 2016;McAuley et al, 2015) Unavoidable human errors creep in while upload ing data using such large taxonomies, contributing to mislabeled listing noise in the data set Even EBay, where merchants have a unied taxonomy, reported a 15%error rate in categorization (Shen et al, 2012b) Furthermore, most ecommerce companies receive millions of new listings per month from hundreds of merchants composed of wildly different formats, descriptions, prices and metadata for the same products For instance, the two listings, “University of Alabama allcotton non iron dress shirt” and“U of Alabama 100% cotton noiron regular t shirt” by two merchants refer to the same product Ecommerce systems tradeoff between classi fying a listing directly into one of thousands of leaf node categories (Sun et al, 2014; ?) and splitting the taxonomy at predened depths (Shen et al, 2011; ?) with smaller subtree models In the latter case, there is another tradeoff between the number of hierarchical subtrees and the prop agation of error in the prediction cascade Simi lar to (Shen et al, 2012b;Cevahir and Murakami, 2016), we classify product listings in two or three steps, depending on the taxonomy size First, we predict the toplevel category and then clas 970Figure 1: Ecommerce platform using taxonomy categorization to understand query intent, match mer chant listings to potential buyers as well as to prevent buyers from navigating away on search misses sify the listings using another one or two levels of subtree models selected by the previous predic tions For our largescale taxonomy categoriza tion experiments on product listings, we use two inhouse datasets, 1 a publicly available Amazon product dataset (McAuley et al, 2015), and a pub licly available Japanese product dataset 2 Our

------------- Read More -------------

Download web-scale-language-independent-cataloging-of-noisy-product-listings-for-e-commerce.pdf

Web-Scale Language-Independent Cataloging of Noisy Product Listings for E-Commerce related documents

DEPARTMENT of HEALTH and HUMAN - Centers for Disease Control and

507 Pages · 2008 · 6.61 MB · English

influenza, natural disasters, and terrorism, while remaining focused on the threats to health and local, tribal and territorial health network.

A Typology of Victim Characterization in Television Crime Dramas

33 Pages · 2010 · 278 KB · English

her analysis of one season of Law & Order, NYPD Blue, and The Practice. She found that only

International Student Guide for Employment in the US

19 Pages · 2012 · 741 KB · English

Problem- If you do not speak English as a native language, you are at a distinct disadvantage communicating with recruiters. Solution- Consciously make an effort to talk with Americans: • Make presentations, take English courses, and work tirelessly at improving your English skills. • Ask a fel

Binders for radioactive waste forms made from pretreated calcined sodium bearing waste

8 Pages · 2006 · 187 KB ·

Although calcination of the pretreated SBW produces a instance metakaolin mixed with NaOH proved to be a superior binder for solidification.

List of Developing Nations Afghanistan Albania Algeria Angola

2 Pages · 2011 · 538 KB ·

Algeria. Angola. Antigua and Barbuda. Argentina. Armenia. Azerbaijan Hungary. India. Indonesia. Iran, Islamic Republic of. Iraq. Jamaica. Jordan.

22 NAVAJO NATION COUNCIL | Office of the Speaker

2 Pages · 2013 · 295 KB · English

Law and Order Committee receives update regarding and an additional amount of $1.4 million to ensure operation through operations through the winter season.

The European Car Parking Sector Sees M&A Flurry, But Will It Be An Easy Ride For Investors?

11 Pages · 2017 · 813 KB · English

The European Car Parking Sector Sees M&A Flurry, But Will It Be An Easy Ride For Investors? spglobal.com/ratingsdirect. Dec. 6, 2017. 2. Despite lots of M&A activity in the. European car parking sector, the future is somewhat uncertain. Acquisitions are the major growth catalyst for operators, but

Building Permits Granted Development Services Department City of San Antonio

84 Pages · 2012 · 272 KB · English

438 RICHLAND HILLS DR BLDG 10. DL CAMBRIDGE DEV GROUP, INC. (713)961-1336 x. 2251200. NEW 2-STORY MULTI-FAMILY APARTMEN. $947,363.00 2284202. 20x4=80 sq ft at csw, 171 sq ft at approach. $0.00. 3106 PIEDRA DE RIO. PRESIDIO CONST LLC. (210)679-8837 x. 2284203.

Department of History Postgraduate Handbook 2017-18

48 Pages · 2017 · 906 KB · English

Social and cultural change in early modern Ireland; the diffusion of print and the changing experience of . support for their modules (https://www.maynoothuniversity.ie/current-students). Social Media. The Department of History has a presence on social Format (e.g., film, video, DVD), that is, the

An integrated approach to product design and process selection

48 Pages · 2011 · 2.15 MB ·

Narayan Raman .. M? < Bs% .. a geometric series given by TEMP(y) = r * TEMP(