Credits

Four datasets for multiclass classification can be downloaded here. The LETTER and USPS datasets come from the UCI repository1). The MNIST dataset2) is an handwritten digit recognition benchmark. The INEX3) dataset contains scientific articles from 18 journals and proceedings of the IEEE coded in a flat TF/IDF feature space.

Datasets

Classes Train. Ex. Test. Ex. Features Download
LETTER 26 16000 4000 16 letter.tar.gz
USPS 10 7291 2007 256 usps.tar.gz
MNIST 10 60000 10000 780 mnist.tar.gz
INEX 18 6053 6054 167295 inex.tar.gz

Format

The file are compressed as tar.gz archive. Their format is the so called LibSVM/SVMlight/SVMstruct format. Each example is represented by a line in the following format:

<line>    = <target> <feature>:<value> ... <feature>:<value> 
<target>  = <int>
<feature> = <integer> 
<value>   = <float>

The target value and each of the feature/value pairs are separated by a space character. The target value denotes the class of the example. For example, the line 5 1:0.43 3:0.12 9284:0.2 specifies an example belonging to class 5 for which feature number 1 has the value 0.43, feature number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other features have value 0.

3) see Denoyer, L. & Gallinari, P. The XML Document Mining Challenge in INEX 2006.
multiclass_data.txt · Last modified: 2009/06/22 15:43 by antojne
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0