I’ve written a Django web-app that I’m still tinkering with. I have it slowly gathering information from multiple sources and classifying each piece (corpus) for me. I’m really happy with the progress. NLTK made implementation pretty straight forward, though there was a definite learning curve for me. I have no background in this field, so I had to learn a bit. For someone approaching this problem that already has the right linguistics and some python background, I’ll bet that it’s amazingly easy to get started. The tweaking I have had to do so far has been, mostly, 1) picking out meaningless text that tends to appear in the data, and 2) adding bi-grams to the features being evaluated. Adding bi-grams was definitely an effective addition. That was the point where I saw a very nice jump of something like 40% accuracy to 60% accuracy. It is now creeping up toward 80% accuracy, mostly due to having more good learning data. Without going into the details of the corpora or the classifications I’m doing…your results may vary, and greatly! I presume that Bi-grams may not always be that useful, but if you are just getting started with this stuff and are looking to improve accuracy. In case it’s helpful as a ‘hint’ to someone, here is a small chunk of logic I have in my text-feature extraction that gathers bi-grams. It lives in the loop that walks through the text, one word at a time. The function returns ‘featureDict’ which is basically all of the individual words and the bi-grams.

biGram = ""
if(textAncestor1):
    if(text != "writes"):# this is to skip the "authorName writes" in some articles
        biGram = textAncestor1 + " " + text
        featureDict[biGram] = True
textAncestor1 = text

Robert Arles