dataset-fcc

The Journey so Far

(not so) Simple Data Entry

In early 2017, I started working as a Research Assistant to Dr. David Byrne at the University of Melbourne.

I was given 6 “BATCH” folders of scanned microfiche, which some poor souls in Washington D.C. pulled from the depths of the Federal Communication Commission’s archives. Each subfolder (CALLSIGN/) in a BATCH/ corresponded to a set of licence transfer applications (FILING/) for a particular licence number/call-sign. In each FILING/ was a varying number of .jpg images. All the documents date from the late 1980s to the late 1990s. After this period filings were made digitally.

My assigned task was to take each scanned application form for licence transfers and enter the details into a Excel. Simple right? Not so much.

Let me detail a few of the problems I encountered.

The archived forms were what can only be described as red tape standard. There was missing information, inconsistencies, changing formats, and lots of redundancy.

On top of the forms making no sense, the scanned files lacked consistent structure, and image quality was often poor.

The upshot of these issues was that I had no references for whether I was doing useful work or not, or if I was making progress towards any tangible goal. It took time to work out a lot of these quirks, and once a lot of the data entry I had done became garbage. For example, in the beginning I was recording the first date I saw as the transaction date. It was how the .jpg files were named, so I figured it was good enough. Turns out that was a mistake.

Data entry is not exactly anyone’s favourite past-time, but realising you’ve been entering junk into your spreadsheets? I can think of few things more demoralising. Except maybe the realisation that you have to do it all again, AND that so far you’ve only touched a fraction of the total number of scans. I was told they’ve been scanning these microfiche for almost 4 years…

BASH those files!

After a month of two of creating basically junk data, I was slowing losing my mind. Worse yet, I was constantly either avoiding or apologising to my supervisor for my lack of progress. I had no idea what to do with all these quirks. The messiest datasets I’d encountered so far were ABS spreadsheets – missing values, and weirdly named variables.

My supervisor had told me to just keep note of any questions or issues I came across. The problem was that my list of “things I’m not sure about” was growing faster than the actual data set – What’s the difference between an Assignment and a Transfer? What does “Pro-Forma” mean, and why is it written on some transactions? What if there are more than two parties to a transaction?

The dataset itself was also not much of a table. “Record information that seems useful.” What happens when the supporting documents contain all kinds of useful information? – lists of subsidary companies, details of other licences owned, notices for the issuance of new licence numbers for partial assignments? Well, in my case, the number of columns and sheets I was working just grew exponentially.

Eventually I figured out I need to do some reorganising.

Tesseract misadventures (invest in set-up or continue manually? human-in-the-loop decisions)

ImageMagick (regularising images into individual source units)

Meta-Tagging, clustering images, elimination (identifying high density images)

Finally encoding data, triangulation (detective work)

Reinforcements are here! But who does what? (operationalisation)