Archive

Posted 29 September, 2013 by Christopher Wilson

Towards Ecologies of Usable Open Data?

This blogpost summarizes a workshop held on interoperability and standards for social good data at Open Knowledge Conference. It outlines some of the issues identified for working with disparate data sets in social good and accountability work, and describes working-group outputs, including an introductory guide to resources for working with open data, notes towards principles for producing usable open data, and notes towards a set of lowest common denominator of standards for different types of data. These outputs are all very preliminary, but we hope that there is interest out there to get in touch and move some of this work forward.


Last week was #OKCon (eventifier page here). It was a great event, with a host of interesting people, and an agenda that sprawled elegantly from large themes and plenary keynotes, to small, focused, hands-on sessions. iilab and the engine room collaborated on putting together one of the latter. It was a small workshop focused on what we see as one of the big, still unexplored challenges to the ethos of open: when you actually have a pile of data sets released on any given topic, how do you make sense of it?

There’s been some recent discussion about the problem of too much disparately packaged data, and what it means for the demand side of open data. Just last week, Martin Tisne offered a short post concisely framing some of the key questions, Owen Ryan offered some heuristics, and  IDS launched an initiative to start digging into some of the issues at the country level. We like that kind of specificity, and focused our workshop on specific use cases because we think the big questions about data standards and interoperability (in non-jargon: connecting data sets to find meaning) are just too hard to handle in the abstract. Our hope was that specific use cases would at least indicate where we should work more, and highlight some challenges data users should watch out for.

To this end, we prepped four data clusters for the workshop (multiple, disparate data sets on extractive industries in Nigeria, aid programs in Nepal, internet freedom in Iran and international social good initiatives using data or technology). Our plan was to warm up and then break into small groups for each use case, to see what each of the data clusters could tell us. Perhaps predictably, it didn’t go as planned. We had a great mix of participants, spanning the full gamut from hard core geeks to strategic and project experts, and in the first two hours we came up with almost an entirely new agenda of fundamental issues and challenges to address.

Some of these translated directly into things-to-do:

  • Compile a resource list for open data people and organizations who want to open or use open data, with introductions to basic concepts strategies and tools, a kind of guided tour through the open data jungle

  • Build an uber-basic standard for all types of data that are opened, a kind of taxonomical lowest common denominator that every data set should adhere to, and upon which any data structure can build

  • Start threshing out basic principles for opening usable data, including UX and user-centric architecture principles that allow people with varying degree of data literacy to access, make sense of, use, and interoperated open data

  • Start building out an “App store-style” repository for open data, that would provide easy access to multiple data sources, and promote standards for specific types of data

  • Develop specifications for a validator that can help data producers to check if they are adhering to good practices for data interoperability (such as the lowest common standards denominator).

  • Enumerate draft principles for data governance and coordination among data producers (especially those that, like development organizations) are already working in the social good field

We added the prepared data clusters onto the end of this list, and let people form groups based on their interest. We only had an hour more to go at this point, but we captured some important insights, and laid some groundwork for further work.

Here are some of the insights and areas which might be worth pursuing further (extrapolated from notes, so omissions and missteps are mine):

 

  • Open Data Resource Kit
    Jun Matsushita (iilab.org) and Arnaud Sahuguet (Google) put together this fantastic, 8 page introduction to the universe of open, aiming to collect «Everything you ever wanted to know when starting your open data project …but were too afraid to ask » It’s an open document and they are looking for input and contributions. Take a look.

  • Lowest Common Standards Denominator
    This group started working towards basic minimum standards that should apply to all kinds of open data that can be used for public or social good. They considered  “lowest common denominator” as a kind of datum that cannot be broken down and could enable taxonomic interoperability, but would be content agnostic. The group saw Dublin Core Metadata as an interesting example of this, worth exploring further.
    The group also considered identifying “champion” organizations to sponsor, steward and promote LCD standards for specific areas, such as FAO for data on food and agriculture data, Open Street Map for location data (point-based or shapefile polygons) or Open Contracting for open contract data. Other ideas included:

    • establishing a resource repository for issue specific information that can facilitate communication and advocacy around the subject, and sharing informing projects about what has already been mapped/standardized in that field.

    • visualizing existing open datasets and related standards (cloud maps)

  • Principles for Usable Data
    This group began by proposing some fundamental definitions. They defined Usability as when data can be understood by any person asking a question of it, and Consumability as when a machine (that is not storing the data) can read data.

 

Here is a checklist that the group come up with to make sure your data is ready for an interoperating world:

  • When your data contains unstructured text (freeform qualitative inputs)

    • Use tagging : Add tags to your unstructured texts (i.e. create a new field to add tags/keywords that help understand the key concepts in the unstructured text)

    • Use formats that make your text “machine readable” :

    • Use tools and formats for “inline” tagging :

      • Such as Akoma Ntoso (a format for enriching/tagging legislative data).

      • Bungini

  • When your data contains quantitative aspects (such as statistics, measurements,…)

    • Publish your metadata codes : Describe in detail what each of your numbers mean. Others can’t reuse it if they don’t know what you’re measuring.

    • Use Standards metadata codes (or Map your internal codes to Standard ones): If you’re using well known units of measurement or quantity then try to adhere to standards (such as ISO standards)

    • Consider involving Standards Developing Organisations (http://en.wikipedia.org/wiki/Standards_organization) to support the effort to standardize your metadata

  • When giving access to your data

    • Allow your data to be downloaded in multiple formats: keeping the same structure, and descriptions but

    • Allow your data to be available for humans and provide the context (metadata, descriptions, source,…) to help understand it.

 

  • Principles for Data Governance and Coordination
    This group identified a number of activities that could be used to promote better coordination among data producers. A compilation of arguments for the benefits of open data was notably lacking and would be useful for advocating to government, NGOs and other stakeholders. They also noted the importance of lobbying the private sector and using the data philanthropy norm to encourage corporations to open more data, and the importance of supporting successful and sustainable services that are developed by 3rd parties (for example start-ups and social enterprises).

  • Aid Data in Nepal
    This group looked at three data sets on foreign aid projects in Nepal and discussed different approaches to merge them into a single database. Two of the three data sets (OECD Creditor Reporting System and IATI) provided information from the national level while one data set (AMP) included information on projects as reported on by the implementing organization on the local level. Merging the three sources might help to track aid projects and inconsistencies between the data from the national and the local level might be an indication for fraud. As there was no common identifier across the different data sets, our suggestion was to develop a routine that would compare a set of variables across the different files (starting date, end date, amounts, donor country, project sector). This routine would return projects as matching when there was sufficient overlap across the majority of variables (same starting date plus/minus two months, same donor, same amount plus/minus x%, etc). In our view, this would provide a first indication for identifying the same project across the three data sets. However, our feeling was that these results would still have to be validated by the human eye.

  • Nigerian Extractives Data
    We actually didn’t get a chance to dig into this data cluster in the workshop, but there is a lot of interest around it, and in the days following, we have come across even more data. We’re also talking with Open Oil and OKFN’s School of Data about organizing a more focused collaborative effort around this data cluster, to try and see what the data can tell us and what we are missing. For us, the big question will be if getting our hands dirty with this data cluster can tell us more about the opportunities and limits for data-driven advocacy around the extractive sector in other countries as well.

If we can get the ball rolling on these seemingly giant issues in a mere 3-hour session, I think there is a lot of room to make things easier for people wanting to use data. (or something like that) Granted, these are all drops in the proverbial bucket, but each of them promise significant potential – to make it easier to open data and to use data, to promote better open data, to understands the limits and potentials of #open in specific data ecologies.

It is in any case a start, and one that many of us got excited about. If you get excited about any of these issues, and want to be put in touch with other people interested in working on them, give us a shout.

Related articles