Tabula is great at scraping tables from PDFs

If my hunt for numeric datasets leads me to a PDF file, I raise the white flag of surrender and close up shop for the day. There’s little I can do short of copying each row of data manually … for the next 6 months. Then I found out about Tabula.

Tabula is a great piece of open-source software that liberates table data embedded in PDFs. It’s a kind of a magic wand for researchers, marketers, public policy wonks, and data scientists. You can try a demo version here to see what it can do. And there’s a github with instructions for setting up your own production version of Tabula.

The sofware was a joint effort of Knight-Mozilla Fellow Manuel Aristarán and the Jeremy B. Merrill of Pro Publica.