A Couple Personal Projects: NerdlyNews and PageLoadStats

I have worked on several web based projects. I recently created NerdlyNews, which uses Bayesian logic to grab interesting news from sites that I really like. I’m using a wordpress front-end for that one, and the JetPack extension so I can have the output of the Bayesian algorithm posted to the site using WordPress API’s. It’s really a nice way to go since I’m not a UI designer.

I also created PageLoadStats. This one is a performance monitoring tool with a pretty straight-forward front-end that graphs the data it gathers. This one is very useful to me. You just need to tell it what URLs to monitor, what the alert limits are, and it will monitor, graph, and send email alerts.

Combining Two iTunes Libraries, No Duplicates Wanted.

I needed to merge my wifes iTunes Library with mine, and decided to write a python script to handle it for me. My main requirement was to not create duplicates, and copy to my library only the music that was exclusive to her library. Basically, copy from hers what I didn’t have. This script should work fine on any two directories with music files. It will simply look for music based on song-title, artist and album info, not by file name or size. This script will not modify either library, but simply creates a third library that contains the difference between the two. No warranty or guarantee implied! I simply stopped coding when this worked for me. I’m simply sharing in case it’s interesting or helpful to someone else. Copy this code, comment on it, or ignore it as you like!

import os
from mutagen.easyid3 import EasyID3
from mutagen.easymp4 import EasyMP4
from mutagen.id3 import ID3NoHeaderError
import traceback
import shutil

def getmlib(rootDir, report = False, log = False):
    music_dict = dict()
    failures = list()
    noid3headers = list()
    duplicate_files = list()
    for dirName, subdirList, fileList in os.walk(rootDir):
        #print('Found directory: %s' % dirName)
        for fname in fileList:
            spath = dirName + "/" + fname
            audio = None
            stitle = ''
            salbum = ''
            sbitrate = 0
            slength = 0
            sartist = ''
                if 'm4a' in fname:
                    audio = EasyMP4(spath)
                    audio = EasyID3(spath)
                #print '[debug] keys ' + str(audio.valid_keys.keys())
                if audio.has_key('title'):
                    stitle = audio['title'][0]
                if audio.has_key('artist'):
                    sartist = audio['artist'][0]
                if audio.has_key('album'):
                    salbum = audio['album'][0]
                skey = stitle + '::' + sartist + '::' + salbum
                if music_dict.has_key(skey):
                music_dict[skey] = {'bitrate': sbitrate, 'artist': sartist, 'title': stitle, 'album': salbum, 'file': fname, 'path': dirName}
            except ID3NoHeaderError as nm:
            except Exception as e:
                failures.append({spath: "UNKNOWN FAILURE: \n" + traceback.format_exc()})

    if report:
        print '[NOID3HEADERS]' + str(len(noid3headers))
        print '[UNKNOWN FAILURES]' + str(len(failures))
        print '[INFO] Found [%i] songs' % len(music_dict)
        print '[INFO] Duplicate count is %i' % len(duplicate_files)

    if log:
        noidf = open('lib-noid3headers.log', 'w')
        for file in noid3headers:
            noidf.write(file + '\n')

        dupesf = open('lib-duplicates.log','w')
        for file in duplicate_files:
            dupesf.write(file + '\n')

    return music_dict

def getdifflib(core_lib_dir, alt_lib_dir):
    core_lib = getmlib(core_lib_dir, report=True, log=True)
    alt_lib = getmlib(alt_lib_dir, report=True, log=False)
    diff_lib = dict()
    for song_key in alt_lib.keys():
        if not core_lib.has_key(song_key):
            diff_lib[song_key] = alt_lib[song_key]
    return diff_lib

def makedifflib(diff_lib, diff_lib_dir):
    for song_key in diff_lib:
        song = diff_lib[song_key]
        artist = song['artist']
        album = song['album']
        file = song['file']
        orig_path = song['path']
        new_dir = diff_lib_dir + '/' + artist + '/' + album
        if not os.path.exists(new_dir):
            shutil.copy(orig_path + '/' + file, new_dir)
            print '[COPY FAIL] trying to copy '
            print orig_path
            print file
            print new_dir

if __name__ == '__main__':
    core_lib_dir = '[PATHTO]/Music/iTunes/iTunes Media/Music/'
    alt_lib_dir = "[PATHTO]/altmusic/"
    diff_lib_dir = "[PATHTO]/diffmusic/"
    #core_lib = getmlib(core_lib_dir, report = True, log = True)
    diff_lib = getdifflib(alt_lib_dir, core_lib_dir)
    makedifflib(diff_lib, diff_lib_dir)

A Bit of Modular Web Design in Django

I found myself creating a web page intended to display a set of data objects, each object similar in format. A pretty common need. The simple thing to do would be to simply iterate over the list of data in the django template, for example:

{% for o in some_list %}
    <div>#display data here#</div>
{% endfor %}

I want to be able to re-use and centrally control how the data is displayed, anywhere on the site. I figured out a nice way to do this using Django templates within Django templates. The code looks something like this:

def index(request):
    things = get_things()
    module_template = loader.get_template("thingModuleTemplate.html")
    things_html = ""
    for thing in things.itervalues():
        thing_context = RequestContext(request,{
    thing_html = module_template.render(thing_context)
    things_html += things_html
    c = RequestContext(request,{
    page_template = loader.get_template('index.html')
    return HttpResponse(page_template.render(c))

The module template looks something like this:


The code basically does what a normal Django view does, which is to render a page from a template and pass in data via a variable, but that for loop generates the web module html used to display each object using ‘thingModuleTemplate.html’  On each pass through the for loop, it renders a module and adds it to the thingsHtml variable.  Displaying the data on the index.html template is as simple as (using the ‘safe’ tag so that things_html is treated as html and not automatically escaped):


Robert Arles

Software QA: Jenkins + Jenkins Slave Nodes + Selenium 2 + Browsermob Proxy


I’ve just finished a new test setup that allows me to capture network traffic in a test suite that is launched by Jenkins onto a Selenium 2 Grid. It was as painful as it sounds, but just as satisfying to complete. One of the big hurdles was the need to run the tests via Jenkins and capture the network traffic generated by the test, which occurs on an unknown Selenium node.

If the test system we use were not distributed, this would have been relatively simple. Just start up a Java instance of a Browsermob Proxy, run the test, and read the net traffic from that proxy. This works locally, even during development, but in a scenario where a Jenkins server compiles and launches the test job, the Jenkins server would become the proxy. Not a solution. Also, in this setup, the Java code (the Selenium tests in my case) launched by Jenkins doesn’t know the address of the node that the tests are running on, so a proxy cannot be launched on that node. Even if you could, you couldn’t gather the network traffic, because, again, you don’t know which node it’s on. I’ve researched this problem quite a bit and found lots of “solutions” that do not work, at least for this environment. Most looked like partial solutions at best.

Solution. Set the tests needing a proxy to run on localhost (wait, wait… I’m not crazy) Create a Jenkins slave-node, I chose a JNLP slave on a Windows PC, with a selenium server running locally (see where I’m going with “localhost”, yet?) Setup the job in Jenkins to be restricted to that node. Now, for the proxy part. Write the test code to start and use a Browsermob proxy as a Java instance, not using the ‘java -jar filename’ option at the command line. This is a nice way to go because a Java proxy instance is FAR cleaner to work with than writing to and reading from the REST interface you have to use otherwise. A nice side effect for us using this setup with a Jenkins slave-node is that other tests in the test suite will still be distributed through our Selenium Grid Hub. This means we can still migrate to a slave-node model at any time, but only if and when we decide to do so.

Now when the tests are launched, they should run on the Jenkins slave-node, and that means the Java test code will run on the same server as the tests themselves (remember the crazy part where I said the tests should be set to run on localhost? They are local to the slave-node, now) Viola, the test code and proxy, and the Selenium browser session are running on the same machine, so can both contact the proxy using “localhost” as the server address.

If you find this article helpful, but could use a bit more detail on part of it, please leave a comment/question. I’d be glad to expand on this where I can.

Robert Arles


Naive Bayesian Probability is very cool…add bi-grams for extra coolness.

I’ve written a Django web-app that I’m still tinkering with. I have it slowly gathering information from multiple sources and classifying each piece (corpus) for me. I’m really happy with the progress. NLTK made implementation pretty straight forward, though there was a definite learning curve for me. I have no background in this field, so I had to learn a bit. For someone approaching this problem that already has the right linguistics and some python background, I’ll bet that it’s amazingly easy to get started. The tweaking I have had to do so far has been, mostly, 1) picking out meaningless text that tends to appear in the data, and 2) adding bi-grams to the features being evaluated.  Adding bi-grams was definitely an effective addition.  That was the point where I saw a very nice jump of something like 40% accuracy to 60% accuracy.  It is now creeping up toward 80% accuracy, mostly due to having more good learning data.  Without going into the details of the corpora or the classifications I’m doing…your results may vary, and greatly!  I presume that Bi-grams may not always be that useful, but if you are just getting started with this stuff and are looking to improve accuracy.
In case it’s helpful as a ‘hint’  to someone, here is a small chunk of logic I have in my text-feature extraction that gathers bi-grams. It lives in the loop that walks through the text, one word at a time.  The function returns ‘featureDict’ which is basically all of the individual words and the bi-grams.

biGram = ""
    if(text != "writes"):# this is to skip the "authorName writes" in some articles
        biGram = textAncestor1 + " " + text
        featureDict[biGram] = True
textAncestor1 = text

Robert Arles

My New Site

Sadly, for me at least, the data for my site was lost. Multiple copies, all useless bits. So I’ve set up a new WordPress site. Never again will I try Drupal, thank you very much.  It was too complicated, required maintenance, and seems to have eaten my data.  That’s not entirely fair, but it did contribute to the problem, making recovery much more difficult. Hopefully I can recover some history, bit by bit, over time.