Alex SpenglerPhD in Informatics |
Laboratoire d'Informatique de Paris 6 |
Lip6 |
|
||
The News600 data set has been created at the Laboratoire d'Informatique de Paris 6 and is provided to advance research on web content extraction. The data set comprises 604 real world news web pages from over 170 distinct sites, all of which have been annotated (on the DOM node level) with 30 semantic labels such as Author, Title, Paragraph and Advertisement. It can be used for several tasks, including title extraction and advertisement removal.
News600 Corpus (Features & Labels)
Creative Commons Attribution 2.0 France License
23.2 MByte
The News600 Web Page Corpus by Alex Spengler is licensed under a Creative Commons Attribution 2.0 France License. You are free to copy, distribute, display and to make derivative works of the corpus under the condition that you acknowledge its use with the following citation:
Alex Spengler and Patrick Gallinari
Learning to Extract Content from News Webpages
The 2009 IEEE International Symposium on Mining and Web (MAW2009). Bradford, England, UK, 2009
We consider the problem of content extraction from online news
webpages. To explore to what extent the syntactic markup and the
visual structure of a webpage facilitate the extraction of its
content, we compare two state-of-the-art classifiers as first
instantiations of a general framework that allows for proper model
comparison. To this end, we introduce the publicly available
News600 corpus, a set of 604 real
world news webpages which have
been annotated with 30 semantic labels. An empirical analysis of the
two models on this dataset shows that the inclusion of structural
information is indeed advantageous.
@article{spengler2009learning,
author = {Alex Spengler and Patrick Gallinari},
title = {Learning to Extract Content from News Webpages},
journal ={Advanced Information Networking and Applications Workshops, International Conference on},
volume = {0},
year = {2009},
isbn = {978-0-7695-3639-2},
pages = {709-714},
doi = {http://doi.ieeecomputersociety.org/10.1109/WAINA.2009.97},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA}
}