Introducing the Webb Spam Corpus: Using Email Spam to Identify Web

Introducing the Webb Spam Corpus: Using Email Spam to Identify Web

9 Pages · 2003 · 364 KB · English

Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically Steve Webb College of Computing Georgia Institute of Technology

Introducing the Webb Spam Corpus: Using Email Spam to Identify Web free download

Introducing theWebb Spam Corpus: UsingEmailSpam to Identify Web Spam Automaticall y Ste ve W eb b College ofComputing Georgia Instituteof T echnology Atlanta, GA30332 w eb [email protected] J ames Caver lee College ofComputing Georgia Instituteof T echnology Atlanta, GA30332 ca ver [email protected] Calton Pu College ofComputing Georgia Instituteof T echnology Atlanta, GA30332 [email protected] ABSTRA CT Just asemail spamhasnegativ elyimpacted theuser mes saging experience, therise ofWeb spam isthreatening to sev erely degrade thequalit yof information onthe World Wide Web Fundamen tally,W eb spam isdesigned topollute searc hengines andcorrupt theuser experience by driving trac toparticular spammedWeb pages, regardless ofthe merits ofthose pages Inthis paper,we iden tifyaninterest ing link bet ween email spamandWeb spam, andwe use this link toprop oseano vel tec hnique forextracting largeWeb spam samples fromtheWeb Then, we presen tthe Webb Spam Corpus {a rstofitskind, largescale,andpublicly a v ailable Web spam datasetthat was created usingourau tomated Web spam collection methodThe corpus consists of nearly 350,000 Web spam pages, making itmore thantw o orders ofmagnitude largerthananyother previously cited W eb spam datasetFinally ,w e iden tify several application areas where theWebb Spam Corpus maybe esp ecially help ful Interestingly ,since theWebb Spam Corpus bridgesthe w orlds ofemail spamandWeb spam, we note thatitcan be used toaid traditional emailspamclassi cation algorithms through ananalysis ofthe characteristics ofthe Web pages referenced by email messages 1 INTR ODUCTION As the Web grew tobecome theprimary meansforsharing information andsupp orting online commerce, theproblems asso ciated withWebspam also grew Web spam isde ned as Web pages thatarecreated tomanipulate searchengines and deceiv eW eb users [12,13] Just asemail spamhasneg ativ elyimpacted theemail userexperience, therise ofWeb spam isthreatening tosev erely degrade thequalit yof infor mation onthe World Wide Web Web spam isregarded as one ofthe most importan tchallenges facingsearchengines and Web users [12,14],and recen tstudies suggest thatitac coun tsfor asigni can tportion ofall Web con tent,including 8% ofWeb pages [11]and 18% ofWeb sites [13] Although theproblems posed by Web spam have been widely ackno wledged, we believ eresearc hprogress hasbeen limited by the lackof apublicly av ailable Web spam cor pus Inprevious Web spam researc h[2, 6,7, 9,10, 11,13, 21], proposed solutions have been evaluated onrelativ ely CEAS 2006­Thir dConfer enceonEmail andAnti­Spam, July27­28, 2006, Mountain Vie w,California USA small Web spam datasets(ontheorder ofhundreds ofWeb pages) Inman ycases, theseprevious researchershadac cess tolarge samples ofWeb data (ontheorder ofmillions of pages); howev er, the onerous taskofhandlab elingeach W eb page made itimp ossible forthem toevaluate even a small fraction oftheir data Giventhe size and dynamic nature ofWeb con tent,aman ualapproac his neither scal able norsensible Additionally ,none ofthe previously cited W eb spam datasetshave been made publicly av ailable so the repro ducibilit yof curren tW eb spam researc hresults is somewhat limited Similar toemail spamresearc hon the experimen taleval uation ofspam lters usinglargespam corporasuchas the Enron [15]and SpamArc hive[20] corp ora, future Web spam researc hdep ends onthe av ailabilit yof large collections of W eb spam data Thus, the rst contribution ofthe paperis a fully automatic Web spam collection technique forextract ing large Web spam samples Ournew collection technique is based onthe observ ationthattheURLs foundinemail spam messages arereliable indicators ofWeb spam pages Specif ically ,we extract theURLs fromspam messages, cleanse those URLs offalse positiv es(ie, URLs forlegitimate sites), and collect thecorresp onding Web pages Giventhe dy namic nature ofthe Web, this collection methodis extremely useful because itcan be con gured tomain tainuptodate W eb spam datasamples The second contribution ofthe paperisthe Webb Spam Corpus {alargescale andpublicly av ailable Web spam data set that was created usingourautomated Web spam collec tion metho d1 This corpus consists ofnearly 350,000 Web spam pages, making itmore thantw o orders ofmagnitude larger thananyother previously citedWeb spam dataset W edescrib einteresting characteristics ofthis corpus, and w e encourage Web spam andemail spamresearc herstouse it in their work The third partofthe paperoutlines theusefulness ofthe W ebb Spam Corpus insev eral application areasWesum marize related researche orts anddescrib eho w our Web spam collection technique andcorpus couldimmediately en hance thisprevious work Then, wepresen tother interesting application scenariosthatwe believ ecould bene t fromour approac hOne particularly interesting application areais email ltering SincetheWebb Spam Corpus bridgesthe w orlds ofemail spamandWeb spam, we note thatitcan be 1 The Webb Spam Corpus canbe found at h ttp://wwww ebbspamcorpusorg/ usedtoaid traditional emailspamclassi cation algorithms through ananalysis ofthe characteristics ofthe Web pages referenced by email messages The restofthe paperisorganized

------------- Read More -------------

Download introducing-the-webb-spam-corpus-using-email-spam-to-identify-web.pdf

Introducing the Webb Spam Corpus: Using Email Spam to Identify Web related documents

Site Matters: Site Matters: The Value of Local Newspaper The Value

11 Pages · 2010 · 158 KB · English

Site Matters: The Value of Local Newspaper Web Sites Newspaper Association of America www.newspapermedia.com 3 1. Introduction Research findings published in two

The Peloponnesian War and the Future of Reference, Cataloging, and

41 Pages · 2007 · 159 KB · English

1 The Peloponnesian War and the Future of Reference, Cataloging, and Scholarship in Research Libraries By Thomas Mann Prepared for AFSCME 2910 The Library of Congress

PROMOTING OPTIMAL MONITORING OF CHILD GROWTH IN CANADA: USING THE

22 Pages · 2010 · 416 KB · English

dietitians of canada and © 2010. all rights reserved. canadian paediatric society 1 promoting optimal monitoring of child growth in canada: using

The Sharing Economy, Competition and Regulation

11 Pages · 2012 · 1.23 MB · English

things, this means being able to rely on internet based reputation systems, possibly of their own design and for their own . The Oxford Dictionary of.

The Effects of Single-Parenting on Children's Educational Success

7 Pages · 2015 · 521 KB · English

single-parent household, children tend to disengage from school at an early age 1991). Many studies show why and how single parenting can affect a child's years. Teachers in two schools will complete a scale based off student 

Incorporating ESOH Integration into Systems Engineering and the

7 Pages · 2012 · 2.83 MB · English

Green Hornet will continue in 2011-2012, including a carrier trial in summer 2012. ESOH Risk Management Communication of acquisition ESOH requirements

The BEER HALL PUTSCH

29 Pages · 2016 · 1.05 MB · English

While moving through the Kiel Canal, he had 47 of the crew of the SMS Markgraf, . Well, at the same time Debs was working his magic here, these frauds in Germany showed in my paper on the Kabbalah. years of the Weimar Republic were not rightist pre-Nazi organizations, they were center-left.

Adaptive Enhancement of X-Band Marine Radar Imagery to Detect Oil Spill Segments

15 Pages · 2017 · 10.67 MB · English

Its overall estimated result cannot fully reflect the signal intensity in individually-measured radar images. Consequently, a significant error of oil spill visualization arises. In this paper, we propose an improved adaptive enhancement for detecting oil spills based on. X-band marine radar images

ADMINISTRATIVE BY-LAWS OF THE TOWN OF CLINTON MASSACHUSETTS

76 Pages · 2007 · 832 KB · English

1 ADMINISTRATIVE BY-LAWS OF THE TOWN OF CLINTON MASSACHUSETTS APPROVED 1914 Updated 2005 Office of the Town Clerk

Homer and The Iliad

30 Pages · 2011 · 178 KB · English

Like all such works except Homer's The Iliad and The Odyssey, this poem consists now of only a few random fragments .. Close analysis of Book 1 reveals a similar design of compared or contrasted parts. The opening scene, for . Hyginus, a compiler of myths, Fable 91). The pregnant Hecuba, Queen