Investigative Data Parsing Courtesy of Hacks/Hackers

I haven’t been to a Hacks/Hackers event in a good long time. The sessions are ostensibly geared towards journalists and investigative reporters–of which I’m neither–but the presenters often discuss useful data manipulation tools with wide applications outside the news industry–say, for example in marketing.

At this last H/H meetup on Wednesday, the audience of mostly newsies with some developers and designers had the chance to preview a few talks that will be given at the NICAR (National Institute for Computer Assisted Reporting) conference. I heard Amanda Hickman, CUNY Professor of Journalism, discuss regular expressions. And you know, she did darn good job of it.

Checking my resume again, I noticed that I have a computer science degree and in my pre-writing days I was a UNIX developer–not a good one, but certainly capable. And once upon a time I was taught regular expressions, finite state machines, etc. So why now when I’m up against impenetrable tables containing security breaches and privacy statistics do they put me into a meditative state that ends up with alpha wave activity and snoring sounds.

‘Cause I don’t want to commit an act of programming and debug regexs and spent hours getting it right, that’s why. Professor Hickman–btw, I wish I had her teaching and motivating me in the ways of data instead of the CS and Math types I experienced–showed us all how to test the unfriendly regex syntax using something called Rubular.

Rubular is a Ruby app that does a few things for you. After you enter a sample of your target text and your first iteration of your regular expression, it points out any syntax errors. Once you’ve reached acceptable syntax, Rubular will display search results in a separate pane. Oh my gosh, all of this is done on the fly. It’s far better than the pre-Web era in which you stared at command line messages spewed out from grep. For those who remember how to use the “()” for grouping results, Rubular will conveniently display grouped results it another pane. And it even provides a cheat sheet of regex syntax so you don’t have to remember the difference between \s and \w.

And then Prof. Hickman told us all about another app called regex101 that provides additional diagnostic information about exactly what the expression you’ve come up with is doing. It can be that case that after you get your regex to parse data, it’s not exactly working the way you intended.That’s where regex101 comes in–a definite time saver for researchers, data journalists, marketers, and the rest of us who have deal with pathological data formats.

I had to run back to Jersey before the session ended. Though I learned from The Times Robert Gebeloff about how to analyze US census data using University of Minnesota’s powerful IPUMS tool. Thanks Robert!