In April, I attended the corruption data comparability workshop hosted by CIVICUS and the engine room. We spent a lot of the time discussing how to effectively mashup different corruption related datasets, so I thought I would share a general process that I have found works well for identifying powerful mashups.
I’ve been working on software to help journalists and human rights groups uncover issues like corruption. In particular, I’ve been creating tools that make it easier to rapidly conduct complex analyses that detect specific problems and actionable details. A fair amount of manual analysis is still necessary, but the process below can be a good starting point for analyzing sets of public interest documents.
1. Make Documents Machine Readable
Combining documents is easiest if they are machine readable. In some cases, this means formatting documents in a JSON, CSV, XML file, or excel spreadsheet or making data accessible via an API, but this is time consuming for totally unstructured text documents. In these cases, it is possible to leverage metadata in the following ways to gain some machine readability.
Extract Entities
Documents include entities like names, locations, and dates. In cases where the people involved, location an event occurred, or date something happened are of particular interest, it is possible to gain most of the necessary details by extracting entities from documents for later comparison with other datasets. It is also possible to extract lists of set terms (to see which terms occur in a document) and terms matching certain patterns.
Extract Headings
Most so-called unstructured documents contain some structure in the form of headings. If these headings use a consistent size or style of font, it is possible to use them as standalone metadata or field names for a structured dataset (where the following paragraphs are the values for that field).
Extract Metadata
When a file is created, the date of creation, user who created the file, program used for creation, and other details are saved. Some webpages also contain metadata fields. These can be extracted and later compared with extracted entities or visualized.
Indexes
Indexes that contain details like the number of times each word occurs in the document can be helpful for general content analysis. They can also be used to determine the similarity of two documents or the most important phrases.
User-Defined Tags/Metadata
Sometimes people will be focused on information not included in the entities mentioned, metadata, headings, or general content on the document. In these cases, users can manually add some structure by tagging documents. It may also be possible to create structured datasets out of parts of the document highlighted by the user.
2. Find Other Related Datasets
It is easiest to get useful and actionable findings by combining disparate sets of documents or data. These are some examples of common datasets that may be useful. In some cases, it might be necessary to use some of the techniques in step one to make the additional datasets machine readable.
Documents Meant for Release
These are datasets and documents that governments and institutions know they will have to release to the public from the start.
Votes/Legislation
Many legislative bodies release data on votes, proposed amendments, text of laws, meeting dates, etc.
Court Documents
Many courts release dockets and other court documents. This is just as important as legislative data as it contains details on the applications of laws and statements made under oath.
Donations and Finances
Frequently some reporting of donations to political campaigns or non-profits is required. Similarly, governments often have to release detailed budgets. Like court documents, financial data reveals how organizations operate in practice. It can also be used to map influence networks and funding priorities.
Public Registries
Governments and other groups often maintain public registries of things like corporations, land ownership, and government contracts. These registries provide more details on who does what and where they do it.
Government Websites
Government agencies post lots of information and documents on their websites. This information might not be as formal or official as that in laws, court cases, or other reports, but it still provides important operational details.
Unofficial Documents/Info Not Meant for Release
These documents relate to governments and institutions but are released in an unofficial capacity, unexpectedly, or independently by external groups. Some of the most interesting mashups combine documents not initially meant for release with information that was intended to be publicly available from the start.
Social Media
Institutions are made up of people and corruption is due to individual people. Individuals have social media profiles on sites like LinkedIn, Twitter, and Facebook. These can be used to analyze what people are working on, who people know, and how organizations are structured. Beyond individuals, the content of many social media posts can also be correlated to events in other datasets.
News Articles
Journalists regularly report on events all around the world and these articles contain valuable details. With the extraction tools described in the section above (for extracting dates, names, or locations or seeing if lists of terms are mentioned), data on these events as reported by journalists can be compared to information on the events in other documents. In many cases news articles are written by journalists independently, but in others (when based off of press releases or heavily influenced by governments) articles would fall under the documents meant for release category.
Freedom of Information Laws
Freedom of information acts require governments to release documents in response to requests. Like other government filings, the response to FOI requests contain details on the operation of government agencies and programs, but unlike other government filings they were not always written knowing that they would be public.
Leaked Documents
The information unexpectedly disclosed by whistleblowers or otherwise unexpectedly released without permission is incredibly valuable to compare against the publicly stated details on operations of institutions.
3. Combine the Documents
Once you have two (or more) datasets, it is time to combine them. While there are lots of different tools that can be used to mashup data by generating visualizations, running filters, analyzing key terms, and cross referencing fields, most combinations have one of only a few different goals. These are listed below with the goals that frequently result in the most specific and actionable findings first.
Uncovering Discrepancies
Documents can be combined to discover discrepancies. For example, to compare statements in a court case to statements in another document that should match. If those statements don’t match, someone is probably lying somewhere or there is missing data. Discrepancies are powerful because they may indicate blatant wrongdoings.
Matching/Correlation
Correlation and patterns are also potentially interesting. For example, it is possible to see if fundraising influences actions of legislators by comparing details on fundraising events to legislative actions. Similarly, general patterns in court outcomes around certain topics could also be useful information. Matching and correlation might not indicate that something is outright wrong, but they can help show when an event or action is a little suspicious.
Getting More Data
One dataset can be used to get more data. For example, a list of NSA surveillance programs could be used as search terms to get LinkedIn profiles mentioning those programs. Or a list of censored websites could be used to get the text of the web pages. This additional information could be interesting with more analysis. In these cases, this could mean making a network graph from the LinkedIn profiles linking people to surveillance programs or using content analysis tools on the text of websites to determine censorship agendas.
Supplementing Info
Sometimes it is not possible to combine two datasets to get more specific findings with discrepancies or matching or to get more data, but it might still be helpful to match fields in two datasets so the details can all be viewed in one place.
What’s the takeaway?
While some combination of filtering, cross referencing, content analysis, and visualization tools can help automate this mashup process, some manual legwork will still be needed to obtain specific findings. However, careful comparison of document content and metadata can make it significantly easier to uncover actionable details about corruption.