Sequential version

To deal with big corpus:

  • sequential computation of the dictionary
  • sequential building of the numerical corpus

Building the pre-processor

v0_06_preprocessing.jar

Example of use: -sp NameOfTheClass -spargs list of the attributes

 java -jar v0_06_preprocessing.jar -fileout=SPbi.sp  -sp StringProcessor_MarkPunctuation 
 -sp StringProcessor_TreeTagger -spargs -path2tt=../../treetagger -path2ttmodel=../../treetagger/models/english.par 
 -sp StringProcessor_NGram -spargs -size=2

Detailed Options, see link

Building the dictionary sequentially

v0_06_dicoBuilding.jar

Example of use:

 java -Xmx5g -jar v0_06_dicoBuilding.jar -dicoFile=tmp/dicospUnisp.txt -nFilter=3
 -path2data=data/reviewsNew.txt -sizeBatch=100000 -spFile=SPuni.sp
  • Deals with the corpus reviewsNew.txt (Bing Liu's format of file, http://liu.cs.uic.edu/download/data/)
  • Blocs of 100000 documents
  • Remove all terms apprearing less than 3 times
  • Use the preprocessing file SPuni.sp
  • Each block corresponds to a dictionary file (suffixed dicoFile names)

This creates one file for each block, then we need to merge those files:

v0_06_dicoFusion.jar

 java -Xmx15g -jar v0_06_dicoFusion.jar -dicoFile=dicospBisp.txt -nFilter=20
  • NB: dicoFile for makeDicoFusion.jar = dicoFile for makeDicoSeq.jar

Building the numerical corpus sequentially (libsvm format)

v0_06_process.jar

Example of use:

 java -Xmx15g -jar v0_06_process.jar -dicoFile=tmp/dicospUnispR20.txt -spFile=SPuni.sp
 -path2data=data/reviewsNew.txt -sizeBatch=100000 -corpusNumFile=data/revuesUni.lib
 [-corpusSeqFile=data/]
  • corpusNumFile corresponds to standard libSVM file
  • last optional argument corresponds to a sequential version of the numerical corpus (references of words are kept in the original order