Where's The Data?

The Web is amazing for answering questions. Suppose you want to answer a question like, "what does the .JPG file extension mean", then the answer is just an internet search away. Millions of answers. However, if you stray from the common path just a tiny bit things get hairy. What if you want to get a list of all file extensions? This is harder to find. Occasionally you might find a PDF listing them, but if you are asking for all file extensions then you probably want to do something with that list. This means you want the list in some computable form. A database or at least a JSON file. Now you are in the world of ‘public’ data. You are in a world of pain.

Searching for “list of file extensions” will take you to the Wikipedia page, which is open but not computable friendly. Every other link you find will be spam. An endless parade of sites which each claim they are the central repository of file extension data. They all have two things in common:

  • They are filled with horrible spam like ‘scan your computer to speed it up’ and ‘best stock images on the web’ and ‘get your free credit report now’.
  • They let you add new extensions but don’t let you download a complete list of the existing ones.

What I want is basic facts about the world; facts which are generated by the public and really should belong to the public. And I want these facts in a computable form. So far I cannot find such a source for file extensions. These public facts, as they exist on the internet, have morphed into a spam trap: vending tiny bits of knowledge in exchange for eyeball traffic. These sites take a public resource and capture all value from it, providing nothing in return but more virus scanner downloads. That they also provide so little useful information is the reason I have not linked to them (though they are obviously a search away if you care).

The closest I can find to a computable file extension list is the mime type database inside of Linux distros. This brings up a second point. Every operating system, and presumably web browser, needs a list of all file extensions, or at least a reasonable subset. Yet each vendor maintains their own list. Again, these are public facts that should be shared, much as the code which processes them is shared.

File Extensions are not the only public facts which suffer the fate of spam capture. I think this hints at a larger problem. If humanity is to enable global computing, then we need a global knowledge base to work from. A knowledge base that belongs to everyone, not just a few small companies, and especially not the spammers.

Wikipedia and it’s various data offshoots would seem to be the logical source of global computable data, yet the results are dismal. After a decade of asking, Wikipedia’s articles still aren’t computable in any real sense. You can do basic full text search and download articles in (often malformed) wikimarkup. That’s it. Want to get the list of all elements in the periodic table? Not in computable form from Wikipedia. Want to get a list of all mammals? Not from them. Each of these datasets can actually be found on the web, unlike the list of file extensions, but not in a central place and not in the same format. The data offshoots of Wikipedia have even bigger problems, which I will address in a followup blog.

So how do we fix this? Honestly, I don’t know. Many of these datasets do require work to build and maintain and those maintainers need to recoup their costs (though many of them are already paid for with public funds). If this was source code I’d just say it should be a project on GitHub. I think that's what we need.

We need a GitHub for data. A place we can all share and fork common data resources, beholden to no one and computable by everyone.

Building and populating a GitHub for data, at least for these smaller and well defined data sets, doesn't seem like a huge technical problem. Why doesn’t it exist yet? What am I missing?

Talk to me about it on Twitter

Posted April 29th, 2014

Tagged: rant data