Navigate
Contact
Antoine Bordes
UdeM - Dpt IRO
Pavillon André Aisenstadt
2920 Chemin de la Tour
Montréal, Qc, Canada, H3T 1J4
reveal my email
Four datasets for multiclass classification can be downloaded here. The LETTER and USPS datasets come from the UCI repository1). The MNIST dataset2) is an handwritten digit recognition benchmark. The INEX3) dataset contains scientific articles from 18 journals and proceedings of the IEEE coded in a flat TF/IDF feature space.
| Classes | Train. Ex. | Test. Ex. | Features | Download | |
|---|---|---|---|---|---|
| LETTER | 26 | 16000 | 4000 | 16 | letter.tar.gz |
| USPS | 10 | 7291 | 2007 | 256 | usps.tar.gz |
| MNIST | 10 | 60000 | 10000 | 780 | mnist.tar.gz |
| INEX | 18 | 6053 | 6054 | 167295 | inex.tar.gz |
The file are compressed as tar.gz archive. Their format is the so called LibSVM/SVMlight/SVMstruct format. Each example is represented by a line in the following format:
<line> = <target> <feature>:<value> ... <feature>:<value> <target> = <int> <feature> = <integer> <value> = <float>
The target value and each of the feature/value pairs are separated by a space character. The target value denotes the class of the example. For example, the line 5 1:0.43 3:0.12 9284:0.2 specifies an example belonging to class 5 for which feature number 1 has the value 0.43, feature number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other features have value 0.