Introducing the Webb Spam Corpus: Using Email Spam to Identify Web

Introducing the Webb Spam Corpus: Using Email Spam to Identify Web

9 Pages · 2003 · 364 KB · English

Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically Steve Webb College of Computing Georgia Institute of Technology

Introducing the Webb Spam Corpus: Using Email Spam to Identify Web free download

Introducing theWebb Spam Corpus: UsingEmailSpam to Identify Web Spam Automaticall y Ste ve W eb b College ofComputing Georgia Instituteof T echnology Atlanta, GA30332 w eb [email protected] J ames Caver lee College ofComputing Georgia Instituteof T echnology Atlanta, GA30332 ca ver [email protected] Calton Pu College ofComputing Georgia Instituteof T echnology Atlanta, GA30332 [email protected] ABSTRA CT Just asemail spamhasnegativ elyimpacted theuser mes saging experience, therise ofWeb spam isthreatening to sev erely degrade thequalit yof information onthe World Wide Web Fundamen tally,W eb spam isdesigned topollute searc hengines andcorrupt theuser experience by driving trac toparticular spammedWeb pages, regardless ofthe merits ofthose pages Inthis paper,we iden tifyaninterest ing link bet ween email spamandWeb spam, andwe use this link toprop oseano vel tec hnique forextracting largeWeb spam samples fromtheWeb Then, we presen tthe Webb Spam Corpus {a rstofitskind, largescale,andpublicly a v ailable Web spam datasetthat was created usingourau tomated Web spam collection methodThe corpus consists of nearly 350,000 Web spam pages, making itmore thantw o orders ofmagnitude largerthananyother previously cited W eb spam datasetFinally ,w e iden tify several application areas where theWebb Spam Corpus maybe esp ecially help ful Interestingly ,since theWebb Spam Corpus bridgesthe w orlds ofemail spamandWeb spam, we note thatitcan be used toaid traditional emailspamclassi cation algorithms through ananalysis ofthe characteristics ofthe Web pages referenced by email messages 1 INTR ODUCTION As the Web grew tobecome theprimary meansforsharing information andsupp orting online commerce, theproblems asso ciated withWebspam also grew Web spam isde ned as Web pages thatarecreated tomanipulate searchengines and deceiv eW eb users [12,13] Just asemail spamhasneg ativ elyimpacted theemail userexperience, therise ofWeb spam isthreatening tosev erely degrade thequalit yof infor mation onthe World Wide Web Web spam isregarded as one ofthe most importan tchallenges facingsearchengines and Web users [12,14],and recen tstudies suggest thatitac coun tsfor asigni can tportion ofall Web con tent,including 8% ofWeb pages [11]and 18% ofWeb sites [13] Although theproblems posed by Web spam have been widely ackno wledged, we believ eresearc hprogress hasbeen limited by the lackof apublicly av ailable Web spam cor pus Inprevious Web spam researc h[2, 6,7, 9,10, 11,13, 21], proposed solutions have been evaluated onrelativ ely CEAS 2006­Thir dConfer enceonEmail andAnti­Spam, July27­28, 2006, Mountain Vie w,California USA small Web spam datasets(ontheorder ofhundreds ofWeb pages) Inman ycases, theseprevious researchershadac cess tolarge samples ofWeb data (ontheorder ofmillions of pages); howev er, the onerous taskofhandlab elingeach W eb page made itimp ossible forthem toevaluate even a small fraction oftheir data Giventhe size and dynamic nature ofWeb con tent,aman ualapproac his neither scal able norsensible Additionally ,none ofthe previously cited W eb spam datasetshave been made publicly av ailable so the repro ducibilit yof curren tW eb spam researc hresults is somewhat limited Similar toemail spamresearc hon the experimen taleval uation ofspam lters usinglargespam corporasuchas the Enron [15]and SpamArc hive[20] corp ora, future Web spam researc hdep ends onthe av ailabilit yof large collections of W eb spam data Thus, the rst contribution ofthe paperis a fully automatic Web spam collection technique forextract ing large Web spam samples Ournew collection technique is based onthe observ ationthattheURLs foundinemail spam messages arereliable indicators ofWeb spam pages Specif ically ,we extract theURLs fromspam messages, cleanse those URLs offalse positiv es(ie, URLs forlegitimate sites), and collect thecorresp onding Web pages Giventhe dy namic nature ofthe Web, this collection methodis extremely useful because itcan be con gured tomain tainuptodate W eb spam datasamples The second contribution ofthe paperisthe Webb Spam Corpus {alargescale andpublicly av ailable Web spam data set that was created usingourautomated Web spam collec tion metho d1 This corpus consists ofnearly 350,000 Web spam pages, making itmore thantw o orders ofmagnitude larger thananyother previously citedWeb spam dataset W edescrib einteresting characteristics ofthis corpus, and w e encourage Web spam andemail spamresearc herstouse it in their work The third partofthe paperoutlines theusefulness ofthe W ebb Spam Corpus insev eral application areasWesum marize related researche orts anddescrib eho w our Web spam collection technique andcorpus couldimmediately en hance thisprevious work Then, wepresen tother interesting application scenariosthatwe believ ecould bene t fromour approac hOne particularly interesting application areais email ltering SincetheWebb Spam Corpus bridgesthe w orlds ofemail spamandWeb spam, we note thatitcan be 1 The Webb Spam Corpus canbe found at h ttp://wwww ebbspamcorpusorg/ usedtoaid traditional emailspamclassi cation algorithms through ananalysis ofthe characteristics ofthe Web pages referenced by email messages The restofthe paperisorganized

------------- Read More -------------

Download introducing-the-webb-spam-corpus-using-email-spam-to-identify-web.pdf

Introducing the Webb Spam Corpus: Using Email Spam to Identify Web related documents

35 Value of your PTA - The California State PTA

2 Pages · 2011 · 21 KB · English

Association • Web Wise Kids . Title: Microsoft Word - 35 Value of your PTA.doc Author: Kathryn Cross Created Date: 2/17/2011 1:52:56 PM

Runtime Enforcement of Memory Safety for the C Programming Language

218 Pages · 2011 · 1.33 MB · English

Abstract Titleofdissertation: Runtime Enforcement of Memory Safety for the C Programming Language MatthewStephenSimpson,DoctorofPhilosphy,2011 Dissertationdirectedby

Powerful Choice Holocaust Middle School Unit - The Education Fund

37 Pages · 2009 · 1.46 MB · English

ghetto. The site invites children to “move around the street” and “enter” various locations in it. In each of the locations, original exhibits such as video

PLEASE REMEMBER as you read the script that we WILL ALLOW CHANGES

32 Pages · 2009 · 270 KB · English

PAJAMA PARTY MURDERS PRODUCTION ORDER FORM (Print this page, complete, then fax or mail) Use this form if you are planning a production.

Flatfoot in Children: How to Approach? - TUMS Electronic Journals

8 Pages · 2007 · 275 KB · English

footwear on the prevalence of flat foot. A survey of 2300 children. J Bone Joint Surg. 1992;74-B:525-7. 25. Sullivan JA. Pediatric flatfoot: evaluation

Join us for the 3rd Annual Pajama Party

1 Pages · 2011 · 160 KB · English

Daisies, Brownies and Juniors join us for the Girl Scout Pajama Parties! New for 2012-Daisy Party! Come dressed in your pajamas for an evening of fun with your gal pals.

FREE and low wellness activities in the Gaylord-Grayling area

4 Pages · 2013 · 178 KB · English

Alpenfest Run 2013 -Saturday, July 20, 2013 - Location: Under the Pavilion Main St and Court St, Gaylord, Michigan -Distances: 5K Run/Walk, 10K Run/Walk, Fun Run 1

Who's Who in the Company

11 Pages · 2011 · 2.16 MB · English

Bailey) at Geauga Lyric Theatre Guild, The . Alice in Wonderland,. Channel Zero, and . Witch and the Wardrobe, All Greek to Me, Seussical.

Reflections of the Author in Eco's The Island of the Day Before

12 Pages · 2011 · 211 KB · English

labyrinth of lines traces the lineaments of his own face. Jorge Luis Borges. The Maker. To err is probably this: to go outside the space of encounter. Maurice Blanchot. The Infinite I believe, alone of all our race, the only man in human memory to have been shipwrecked and cast upon a deserted shi

The Effect of Management Commitment to Service Quality on

27 Pages · 2008 · 278 KB · English

The Effect of Management Commitment to Service Quality on Frontline Employees’ Job Attitudes, Turnover Intentions and Service Recovery Performance in a New Public