Runs on Data

A new golden age of Open Data

Leveraging the vast unstructured human effort for good

As mentioned in my Race Roundup announcement post, I’m trying to contribute to Open Data.

I have a pretty big fascination with data, and had a 2 year career stint trying to build knowledge graphs for the Pharmaceutical industry. This introduced me to the world of ontologies, wikidata, SPARQL, and all the rest.

I left that project having mixed feelings about trying to classify “all of the data” with an ever increasing suite of hammers for each use case that came our way. I saw wikidata as the gold standard for achieving something like this at scale, and as many people have previously pointed out, it requires a tremendous amount of effort, critically, human effort.

The public benefit of this work was readily apparent, but it always seemed like a futile effort to put in. You could scale your impact by building a few data pipeline that ingested various different datasets into your standardised format, but that simply shifts the human step upstream to the source datasets. Another option was to run NLP pipelines over “all the data”, which my day job entailed. But despite results being useful enough for internal purposes, there was always a very clear line in the sand between machine learning based approaches and human curation.

As people in the know will tell you, there was a step-change at the frontier of capabilities that happened in december 2025 with the release of Claude Opus 4.5 I’m now convinced we can in fact “categorise all the things”.

I have my opinions on what the future of AI will hold, but for now I consider it a fun sort of volunteer service to manage these agents for open data initiatives.

All of the data is begging to be shared to the masses, it’s just waiting to be unlocked.