Text classification software (link)

(1. Pre-Processing -> 2. Dictionary -> 3. Data Processing -> 4. Learning a model)

Data Processing

Transform a classical corpus into a sparse matrix, by projecting all documents on the dictionary (bag of word representation)

Usage:

 java -Xmx1g -jar bin/classif2012_processData.jar -path2data=data/book/ -dicoFile=dico.txt -spFile=SPbasic.sp -corpusNumFile=corpusnum.txt -labelNumFile=corpusnumL.txt

Options:

  • -path2data=path to the corpus
  • -dicoFile= path to the dico file
  • -spFile= path to the SP file
  • -corpusNumFile= name of the OUTPUT file (corpus in BOW representation)
  • -labelNumFile= name of the OUTPUT file for the labels