Text classification software (link)

(1. Pre-Processing -> 2. Dictionary -> 3. Data Processing -> 4. Learning a model)

Preprocessor

A preprocessor is a chain made of small units. The aim of this piece of code is to save the processing chain in a file. NB: the chain is also loaded and tested as a sanity check.

The jar file should be used as follow:

 java -jar bin/classif2012_generateSP.jar OPTION

with OPTION formatted as:

 -fileout=fileOUT -sp StringProcessor_EnglishNegation -sp StringProcessor_MarkPunctuation
  • -fileout: name of the file in which the preprocessor will be stored
  • -sp UNIT_NAME : introduction of each unit with the class name
  • [OPT] -spargs -NAME1=VALUE1 -NAME2=VALUE2 ... : to set the arguments of the units

Full classical example:

 java -jar bin/classif2012_generateSP.jar
 -fileout=SPbasic.sp  -sp StringProcessor_MarkPunctuation 
 -sp StringProcessor_TreeTagger -spargs 
    -path2tt=../../treetagger 
    -path2ttmodel=../../treetagger/models/english.par 
 -sp StringProcessor_NGram -spargs -size=2

Ressources

See the download section: here

A bash script describes the usage script1_pp.sh

List of the preprocessors & arguments

Some others units are implemented and not describe here: they implement ad hoc function for special datasets.

Basic processings:

Basic regexp replaceAll, no arguments:

StringProcessor_EnglishNegatione.g. : don't -> do not
StringProcessor_LowerCaseNOT -> not
StringProcessor_MarkDigital89 -> DIGIT
StringProcessor_MarkHours2h30 -> HOUR
StringProcessor_MarkPunctuation... -> TROISPOINTS
StringProcessor_MarkPunctuationReverseTROISPOINTS -> ...
StringProcessor_MarkURLwww....fr -> URL
StringProcessor_RemoveAllAccents -> e
StringProcessor_RemoveAllSpecialCharacteridem
StringProcessor_RemoveDigit65 -> ""
StringProcessor_RemoveEndSpace 
StringProcessor_RemoveParentess() -> ""
StringProcessor_RemovePunctuation 
StringProcessor_RemoveShortWords ARG: limit of characters before removal
StringProcessor_RemoveUnderscore 

HTML processings:

StringProcessor_KeepBodyOnlyremove all content but the body in HTML files
StringProcessor_RemoveTags 
StringProcessor_RemoveTagsAndScripts 
StringProcessor_TransformAccentedLettersHTMLtransform HTML code to ASCII

Stemmatization/Lemmatization:

StringProcessor_EnglishPorterStemmerStemmatization
StringProcessor_FrenchPorterStemmeridem
StringProcessor_TreeTaggerLemmatization-path2tt=path to the TT exec -path2ttmodel= path to the model file -lemmatization=BOOLEAN -addpos=BOOLEAN

High level agregation:

StringProcessor_ConcatenateConcatenate 2 representations from 2 differents units described by 2 files-file1=... -file2=...
StringProcessor_NGramN-gram-size=N -mode=INT
StringProcessor_SubSequenceCombination of words-size=N

Misc.

StringProcessor_NLetterDivision of documents in groups of letters-size=N