robert arles robert.arles.us

robert.arles.us

Github Twitter?
about1 android1 backlight1 bash1 bayesian probability1 bayesian probablilty1 browsermob2 django5 docker1 firstpost1 flask1 go2 golang2 hugo1 init1 java1 jenkins2 jinja1 keyboard1 linux1 lubuntu1 mitmproxy1 nerdlynews1 nginx1 nltk1 pageloadstats2 pelican1 python11 qa automation1 rabot2 saucelabs1 selenium4 threading1 ubuntu2 webdriver4

NLTK

Naive Bayesian Probability is very cool…add bi-grams for extra coolness.

I’ve written a Django web-app that I’m still tinkering with. I have it slowly gathering information from multiple sources and classifying each piece (corpus) for me. I’m really happy with the progress. NLTK made implementation pretty straight forward, though there was a definite learning curve for me. I have no background in this field, so I had to learn a bit. For someone approaching this problem that already has the right linguistics and some python background, I’ll bet that it’s amazingly easy to get started. The tweaking I have had to do so far has been, mostly, 1) picking out meaningless text that tends to appear in the data, and 2) adding bi-grams to the features being evaluated. Adding bi-grams was definitely an effective addition. That was the point where I saw a very nice jump of something like 40% accuracy to 60% accuracy. It is now creeping up toward 80% accuracy, mostly due to having more good learning data. Without going into the details of the corpora or the classifications I’m doing…your results may vary, and greatly! I presume that Bi-grams may not always be that useful, but if you are just getting started with this stuff and are looking to improve accuracy. In case it’s helpful as a ‘hint’ to someone, here is a small chunk of logic I have in my text-feature extraction that gathers bi-grams. It lives in the loop that walks through the text, one word at a time. The function returns ‘featureDict’ which is basically all of the individual words and the bi-grams.

biGram = ""
if(textAncestor1):
    if(text != "writes"):# this is to skip the "authorName writes" in some articles
        biGram = textAncestor1 + " " + text
        featureDict[biGram] = True
textAncestor1 = text

Robert Arles

-