Converting Gatherer to a Bunch of Uncoupled Micro-Services

I’m currently creating a version of Gatherer, the engine behind, in an uncoupled micro-service based design.  It’s going pretty smoothly, and the benefits of creating and internally consuming an API are pretty obvious. It’s interesting how fast development starts moving after getting some API functionality up and running. I’m starting work on the bit that will be choosing what is interesting, and I’m using the chance to expand the knowledge base and the way I use it to choose what’s interesting in the news. I have pretty high hopes already that the results are going to be noticeably better.


Fiddling With the Intelligence Behind NerdlyNews

I’ve spent a bit of time messing around with the algorithms behind NerdlyNews. It was doing what I wanted, picking out interesting articles from a large amount of noise, but given the large number of choices, it was still too noisy itself. I’ve tweaked a couple of parameters over the last couple of days and I’m hoping it’s now going to be a pretty solid but not too busy of stream to keep up with. Maybe it’ll get to the point of only noting some really interesting and important articles.

PageLoadStats is now on GitHub

PageLoadStats is a tool I wrote to grab performance and stats about web pages and chart the data in a simple and useful way. It also has the ability to send alerts when the page load times moving average gets past a configurable alert level. It’s written in Python using Django. If you have any interest in digging into the source code, grabbing a copy for ideas, or to start a new project, have at it! I’m open to any input, as well.

Waiting for elements on dynamic webpages Selenium / Webdriver

Below is my current favorite method to wait for an element to appear or become useful on a dynamic web page. In this case, my example is avoiding the exception thrown when webdriver fails to find an element by using the ‘find_elements'(plural) method rather than a ‘find_element'(singular) The ‘find_elements’ methods always return a list, even if empty, rather than throw an exception. Both are useful, but in this case I like the more readable code without the try/except requirements. I also like to have a way to record wait times for specific events. Here I’m just printing that info out for example, but I often record this in a db for charting or analysis later.

SOME_CSS_SELECTOR = '#interesting_element_selector'
MAX_WAIT = 5 # my style, this is often defined as a class constant
waited = 0
while waited < self.MAX_WAIT:
# wait for something. I most often put 'sleep(0) just before the 'waited+=1'
# but put it here in cases where I know there is a slow element load
elements = self.driver.find_elements_by_css_selector(self.SOME_CSS_SELECTOR)
if len(elements) > 1:
loading_indicator_element = loading_indicator_elements[0]
waited += 1
print '[INFO] Waited %i seconds for %s' % (waited, str(self.SOME_CSS_SELECTOR))

Python ErrorList object for use in Webdriver Testing

Here’s a bit of code from a post that was lost when my old site went down, data and all. I don’t recall if the original post was this python version or my original Java version (sorry if that’s what you’re here for, ask in comments and I can find and post that too) It’s an implementation of an ‘assert’ statement that allows for the test to continue on failure, storing the error. At any point I want the test to actually throw an exception, print a report and halt, I just ‘tester.assert_errors()’

# overly simple example.
tester = Tester()
# test something, and store the result to be seen and actually asserted later
tester.assert_true(some_result_boolean, '[FAIL] Aw, shoot. Test failed while checking for some_result')
# some more logic here
# Now print an error report of what has failed so far, and if there are failures, throw an exception


Here’s the tester class (with a missing send_email implementation)

from testutils.testutil import send_email

class Tester():
    A test helper that collects errors, prints reports, etc.

    def __init__(self):
        setup some instance vars
        self.errors = []
        self.is_error_free = True
        self.test_count = 0

    def assert_true(self, test_result, fail_message):
        Add failures to an error stack
        self.test_count += 1
        if not test_result:
            self.is_error_free = False
 Continue reading 

A Couple Personal Projects: NerdlyNews and PageLoadStats

I have worked on several web based projects. I recently created NerdlyNews, which uses Bayesian logic to grab interesting news from sites that I really like. I’m using a wordpress front-end for that one, and the JetPack extension so I can have the output of the Bayesian algorithm posted to the site using WordPress API’s. It’s really a nice way to go since I’m not a UI designer.

I also created PageLoadStats. This one is a performance monitoring tool with a pretty straight-forward front-end that graphs the data it gathers. This one is very useful to me. You just need to tell it what URLs to monitor, what the alert limits are, and it will monitor, graph, and send email alerts.

Combining Two iTunes Libraries, No Duplicates Wanted.

I needed to merge my wifes iTunes Library with mine, and decided to write a python script to handle it for me. My main requirement was to not create duplicates, and copy to my library only the music that was exclusive to her library. Basically, copy from hers what I didn’t have. This script should work fine on any two directories with music files. It will simply look for music based on song-title, artist and album info, not by file name or size. This script will not modify either library, but simply creates a third library that contains the difference between the two. No warranty or guarantee implied! I simply stopped coding when this worked for me. I’m simply sharing in case it’s interesting or helpful to someone else. Copy this code, comment on it, or ignore it as you like!

import os
from mutagen.easyid3 import EasyID3
from mutagen.easymp4 import EasyMP4
from mutagen.id3 import ID3NoHeaderError
import traceback
import shutil

def getmlib(rootDir, report = False, log = False):
    music_dict = dict()
    failures = list()
    noid3headers = list()
    duplicate_files = list()
 Continue reading 

A Bit of Modular Web Design in Django

I found myself creating a web page intended to display a set of data objects, each object similar in format. A pretty common need. The simple thing to do would be to simply iterate over the list of data in the django template, for example:

{% for o in some_list %}
    <div>#display data here#</div>
{% endfor %}

I want to be able to re-use and centrally control how the data is displayed, anywhere on the site. I figured out a nice way to do this using Django templates within Django templates. The code looks something like this:

def index(request):
    things = get_things()
    module_template = loader.get_template("thingModuleTemplate.html")
    things_html = ""
    for thing in things.itervalues():
        thing_context = RequestContext(request,{
 Continue reading 

Software QA: Jenkins + Jenkins Slave Nodes + Selenium 2 + Browsermob Proxy


I’ve just finished a new test setup that allows me to capture network traffic in a test suite that is launched by Jenkins onto a Selenium 2 Grid. It was as painful as it sounds, but just as satisfying to complete. One of the big hurdles was the need to run the tests via Jenkins and capture the network traffic generated by the test, which occurs on an unknown Selenium node.

If the test system we use were not distributed, this would have been relatively simple. Just start up a Java instance of a Browsermob Proxy, run the test, and read the net traffic from that proxy. This works locally, even during development, but in a scenario where a Jenkins server compiles and launches the test job, the Jenkins server would become the proxy. Not a solution. Also, in this setup, the Java code (the Selenium tests in my case) launched by Jenkins doesn’t know the address of the node that the tests are running on, so a proxy cannot be launched on that node. Even if you could, you couldn’t gather the network traffic, because, again, you don’t know which node it’s on. I’ve researched this problem quite a bit and found lots of “solutions” that do not work, at least for this environment. Most looked like partial solutions at best.

Solution. Set the tests needing a proxy to run on localhost (wait, wait… I’m not crazy) Create a Jenkins slave-node, I chose a JNLP slave on a Windows PC, with a selenium server running locally (see where I’m going with “localhost”, yet?) Setup the job in Jenkins to be restricted to that node. Now, for the proxy part. Write the test code to start and use a Browsermob proxy as a Java instance, not using the ‘java -jar filename’ option at the command line. This is a nice way to go because a Java proxy instance is FAR cleaner to work with than writing to and reading from the REST interface you have to use otherwise. A nice side effect for us using this setup with a Jenkins slave-node is that other tests in the test suite will still be distributed through our Selenium Grid Hub. This means we can still migrate to a slave-node model at any time, but only if and when we decide to do so.

Now when the tests are launched, they should run on the Jenkins slave-node, and that means the Java test code will run on the same server as the tests themselves (remember the crazy part where I said the tests should be set to run on localhost? They are local to the slave-node, now) Viola, the test code and proxy, and the Selenium browser session are running on the same machine, so can both contact the proxy using “localhost” as the server address.

If you find this article helpful, but could use a bit more detail on part of it, please leave a comment/question. I’d be glad to expand on this where I can.

Robert Arles


Naive Bayesian Probability is very cool…add bi-grams for extra coolness.

I’ve written a Django web-app that I’m still tinkering with. I have it slowly gathering information from multiple sources and classifying each piece (corpus) for me. I’m really happy with the progress. NLTK made implementation pretty straight forward, though there was a definite learning curve for me. I have no background in this field, so I had to learn a bit. For someone approaching this problem that already has the right linguistics and some python background, I’ll bet that it’s amazingly easy to get started. The tweaking I have had to do so far has been, mostly, 1) picking out meaningless text that tends to appear in the data, and 2) adding bi-grams to the features being evaluated.  Adding bi-grams was definitely an effective addition.  That was the point where I saw a very nice jump of something like 40% accuracy to 60% accuracy.  It is now creeping up toward 80% accuracy, mostly due to having more good learning data.  Without going into the details of the corpora or the classifications I’m doing…your results may vary, and greatly!  I presume that Bi-grams may not always be that useful, but if you are just getting started with this stuff and are looking to improve accuracy.
In case it’s helpful as a ‘hint’  to someone, here is a small chunk of logic I have in my text-feature extraction that gathers bi-grams. It lives in the loop that walks through the text, one word at a time.  The function returns ‘featureDict’ which is basically all of the individual words and the bi-grams.

biGram = ""
    if(text != "writes"):# this is to skip the "authorName writes" in some articles
        biGram = textAncestor1 + " " + text
        featureDict[biGram] = True
textAncestor1 = text

Robert Arles