Alex Spengler

PhD in Informatics

Laboratoire d'Informatique de Paris 6
Université Pierre et Marie Curie
4 place Jussieu
75005 Paris, France
Map
email:

Lip6

The News600 Web Page Corpus

Download — Features & Labels

The News600 data set has been created at the Laboratoire d'Informatique de Paris 6 and is provided to advance research on web content extraction. The data set comprises 604 real world news web pages from over 170 distinct sites, all of which have been annotated (on the DOM node level) with 30 semantic labels such as Author, Title, Paragraph and Advertisement. It can be used for several tasks, including title extraction and advertisement removal.

News600 Corpus (Features & Labels)
Creative Commons Attribution 2.0 France License
23.2 MByte

License

The News600 Web Page Corpus by Alex Spengler is licensed under a Creative Commons Attribution 2.0 France License. You are free to copy, distribute, display and to make derivative works of the corpus under the condition that you acknowledge its use with the following citation: