Alex Spengler

PhD Candidate

Laboratoire d'Informatique de Paris 6
Université Pierre et Marie Curie
104 avenue du président Kennedy
75016 Paris, France
Map
email:

Home

The News600 Web Page Corpus

Both structural and visual heterogeneity of today's web pages should make it difficult — if not impossible — to extract content with common semantics from a sufficiently large subset of the web. The restriction to a particular domain, such as web news, and the arrival of content management systems which tend to unify the appearance and structure of web sites, however, lead us to believe that sensible solutions to the problem of web content extraction are in reach.

Example of a webpage from the corpus Another reason for this belief is the fact that a web page is far more than just plain text. It contains a wealth of structural information that reflects the semantic nature of the page's content. When rendered in a browser, this information is further enhanced by visual cues, such as font sizes and element positions, which are, in particular for individual domains, far from random.

To study and compare various web content extraction approaches, there is hence a need for a reference collection of web pages with well-defined semantic labels. To the best of our knowledge, no such data is publicly available today. The News600 web page corpus is a collection of 604 real world news web pages from over 170 distinct domains. Each of the pages in the corpus has been annotated (on the DOM node level) with one of 30 semantic labels such as author, title, paragraph and advertisement. It has been created to foster research in automatic web content extraction. It can be used for several tasks, including title extraction and advertisement removal.

Download

To download the News600 corpus, click here.

Browse
(Due to copyright issues, access to the HTML sources is currently restricted.)

To access the HTML source files together with their semantic XML labels, click here. The complete web pages will be made available for research purposes as soon as possible (we are currently negotiating a license).

 

© Copyright 2009 Alex Spengler. All rights reserved.
Web page last updated on 29 September 2009.