Alex SpenglerPhD Candidate |
Laboratoire d'Informatique de Paris 6 |
|
|
||
Both structural and visual heterogeneity of today's web pages should make it difficult — if not impossible — to extract content with common semantics from a sufficiently large subset of the web. The restriction to a particular domain, such as web news, and the arrival of content management systems which tend to unify the appearance and structure of web sites, however, lead us to believe that sensible solutions to the problem of web content extraction are in reach.
Another reason for this belief is the fact that a web page is far more
than just plain text. It contains a wealth of structural information
that reflects the semantic nature of the page's content. When rendered
in a browser, this information is further enhanced by visual cues,
such as font sizes and element positions, which are, in particular for
individual domains, far from random.
To study and compare various web content extraction approaches, there is hence a need for a reference collection of web pages with well-defined semantic labels. To the best of our knowledge, no such data is publicly available today. The News600 web page corpus is a collection of 604 real world news web pages from over 170 distinct domains. Each of the pages in the corpus has been annotated (on the DOM node level) with one of 30 semantic labels such as author, title, paragraph and advertisement. It has been created to foster research in automatic web content extraction. It can be used for several tasks, including title extraction and advertisement removal.
To download the News600 corpus, click here.
To access the HTML source files together with their semantic XML labels, click here. The complete web pages will be made available for research purposes as soon as possible (we are currently negotiating a license).
© Copyright 2009 Alex Spengler. All rights reserved.
Web page last updated on 29 September 2009.