Alex Spengler

PhD in Informatics

Laboratoire d'Informatique de Paris 6
Université Pierre et Marie Curie
4 place Jussieu
75005 Paris, France
Map
email:

Lip6

Home

The News600 Web Page Corpus

Both structural and visual heterogeneity of today's web pages should make it difficult — if not impossible — to extract content with common semantics from a sufficiently large subset of the web. The restriction to a particular domain, such as the news vertical, and the arrival of content management systems which tend to unify the appearance and structure of web sites, however, lead us to believe that sensible solutions to the problem of web content extraction are in reach.

Another reason for this belief is the fact that a web page is far more than just plain text. It contains a wealth of structural information that reflects the semantic nature of the page's content. When rendered in a browser, this information is further enhanced by visual cues, such as font sizes and element positions, which are, in particular for individual domains, far from random.

Example of a web page from the corpus Another example of a web page from the corpus A fourth example of a web page from the corpus An example of a web page from the corpus

Figure 1. Some examples of web pages from the News600 corpus.

To study and compare various web content extraction approaches, there is hence a need for a reference collection of web pages with well-defined semantic labels. To the best of our knowledge, no such data is publicly available today. The News600 web page corpus is a collection of 604 real world news web pages from over 170 distinct sites, crawled in March 2008. Each of the pages in the corpus has been annotated on DOM node level with 30 semantic labels such as Author, Title, Paragraph and Advertisement. It has been created to foster research in automatic web content extraction. It can be used for several tasks, including title extraction and advertisement removal.


Extracted Features & Labels

A preprocessed version of the News600 corpus, containing extracted features and semantic labels for all visible DOM tree leaf nodes, can be freely downloaded here.


Original HTML Pages
(Due to copyright issues, access to the HTML sources is restricted.)

To gain access to the original HTML page contents and their corresponding semantic labels, researchers must sign a research-only data usage agreement. If you are a researcher and interested in the page contents, please download the

News600 Research-Only Data Usage Agreement,

sign and fax it to:

Alex Spengler
Laboratoire d'Informatique
Université Pierre et Marie Curie

Office 527, wing 25-26, 5th floor
4, place Jussieu
75005 Paris
France

Fax: +33 1 44 27 70 00


Once you have received your credentials (username/password combination) by e-mail, you can access the HTML source files together with their semantic XML labels by clicking here.

Please allow one or two days for processing. Do not hesitate to contact me by e-mail (), should you have any questions or remarks!