Alex SpenglerPhD in Informatics |
Laboratoire d'Informatique de Paris 6 |
Lip6 |
|
||
Both structural and visual heterogeneity of today's web pages should make it difficult — if not impossible — to extract content with common semantics from a sufficiently large subset of the web. The restriction to a particular domain, such as the news vertical, and the arrival of content management systems which tend to unify the appearance and structure of web sites, however, lead us to believe that sensible solutions to the problem of web content extraction are in reach.
Another reason for this belief is the fact that a web page is far more than just plain text. It contains a wealth of structural information that reflects the semantic nature of the page's content. When rendered in a browser, this information is further enhanced by visual cues, such as font sizes and element positions, which are, in particular for individual domains, far from random.
Figure 1. Some examples of web pages from the News600 corpus.
To study and compare various web content extraction approaches, there is hence a need for a reference collection of web pages with well-defined semantic labels. To the best of our knowledge, no such data is publicly available today. The News600 web page corpus is a collection of 604 real world news web pages from over 170 distinct sites, crawled in March 2008. Each of the pages in the corpus has been annotated on DOM node level with 30 semantic labels such as Author, Title, Paragraph and Advertisement. It has been created to foster research in automatic web content extraction. It can be used for several tasks, including title extraction and advertisement removal.
A preprocessed version of the News600 corpus, containing extracted features and semantic labels for all visible DOM tree leaf nodes, can be freely downloaded here.
To gain access to the original HTML page contents and their corresponding semantic labels, researchers must sign a research-only data usage agreement. If you are a researcher and interested in the page contents, please download the
sign and fax it to:
Once you have received your credentials (username/password combination) by e-mail, you can access the HTML source files together with their semantic XML labels by clicking here.
Please allow one or two days for processing. Do not hesitate to contact me by e-mail (), should you have any questions or remarks!