Prying the lid off of government data

Open Up

© @BC:Lead Image © GloriaSanchez, 123rf.com.png

© @BC:Lead Image © GloriaSanchez, 123rf.com.png

Article from Issue 272/2023
Author(s):

The open data movement extends the ideals of open source to government data and, ultimately, all the world's knowledge.

For as long as governments have kept records, citizens have debated over access to those records. When the ancient Mesopotamians started recording data on clay tablets, I have no doubt that lively discussions erupted over who should or should not get to see those tablets. Information, after all, is power, and data is the first step on the path of information.

But data has a funny way of becoming more powerful with more eyes looking at it. Enlightened government eventually came to understand that the whole society could benefit from giving access to outside viewers who might find insights. The rise of democratic governments, where the people (at least in theory) set the rules, and the concept of press freedom brought more attention to the need for public access, leading to the development of massive national archives, as well as university libraries, presidential libraries, and other portals that led to data acquired through government funding.

Then came the information age, and the volume of data, as well as the tools for accessing it, became so vast and complicated that the system quite lost its moorings. Back when data was primarily stored and communicated through paper (or clay tablets), the "format" as we use the term today was much less critical. When government data came to be stored electronically, format became a point of control, and we are still trying to work out what that means and what to do about it.

Governments and research institutions funded by governments were early users of computers. In those early days, development teams tended to work independently – often without a long-term view of how to prepare data for future access. Many government agencies didn't have the staff or budget to do their own programming and, instead, out-sourced it to contractors. The contractors were well aware of the power of their position, and some became adept at positioning themselves as gatekeepers to the data. The government would commission a study, and the contractor or grant recipient would provide a final report. But the status of the data after that point was often ill-defined. In theory, it was government data, but the knowledge and tools for how to access it belonged to the contractor, and the only way to gain additional insights and value from the data was to pay the contractor for another study.

It took several years for elected officials to understand the full scope of this problem and to start to address it. If you think of source code as "data" describing how a program works, the Free Software movement was an early effort at addressing the problems of secrecy and too many gatekeepers in the software development industry. And certainly, the ideals of Free Software and Open Source software have had a strong influence on recent open data initiatives.

When the smoke cleared from the first computer revolution, officials gradually began to see that public interest was not well served by throwing government money into data that was stored in closed formats. In 2003, the European Union approved the Directive on the Re-Use of Public Sector Information [1], which came to be known as the PSI Directive. The PSI Directive laid out a legislative framework for member states to establish policies that promote data re-use, stating that public information should be "open by design and by default." The PSI Directive has been amended twice, most recently in 2019, and it is now called the Open Data Directive [2]. The PSI and Open Data directives stake out the position that keeping data in a reusable state has value, regardless of whether a predefined plan exists for how and when it will be reused.

In the UK, the Open Knowledge Foundation (OKF) [3] launched in 2004 to provide services, tools, training, and community campaigns in support of open knowledge (Figure 1). One of the first accomplishments of the OKF was the Open Definition, a proposed reference for what it means to be "open." The full Open Definition is available online with later amendments [4]. The OKF summarizes the definition as follows, "Open means anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness)."

Figure 1: The Open Knowledge Foundation was an early advocate for open data and creator of the Open Definition.

Similar forces were at play across the ocean, although in the US, with its long history of skepticism about government, it took a little longer to raise awareness. In 2007, a group of 30 visionaries met in Sebastopol, California, for a summit that would lead to the launch of the Open Government Data initiative [5]. Unsurprisingly, many in the group were familiar figures in FOSS circles, including Tim O'Reilly, founder of O'Reilly Media, and Stanford Law Professor Lawrence Lessig, founder of the Creative Commons license collection. One of the youngest members of the group was the late Aaron Schwartz, a FOSS and free data activist and creator of RSS. The Sebastopol group established eight principles for defining when government data is "open":

  • complete – all public data "that is not subject to valid privacy, security, or privilege limitations" is made available, without missing pieces that could limit its usefulness or obscure its meaning.
  • primary – data is saved as collected at the source, not aggregate or processed into a modified form.
  • timely – data is made available as quickly possible to preserve the potential value.
  • accessible – data is available on the Internet and stored in a form that promotes the widest possible range of uses.
  • machine processable – data should be "reasonably structured" to allow automated processing.
  • non-discriminatory – anyone can access the data, with no requirement for registration.
  • non-proprietary – data should be stored in a format "over which no entity has exclusive control."
  • license-free – data is not subject to any "copyright, patent, trademark, or trade secret regulation."

The Open Definition of the OKF and the eight principles of the Open Government Data initiative were important for establishing a clear vision for what the term "open data" meant.

In 2013, President Barack Obama signed an executive order making open and machine-readable data the default for government data sets [6]. One of the initiatives launched during the Obama administration was Project Open Data [7], a collection of tools and resources for promoting open data. Project Open Data was replaced by the online repository resources.data.gov (Figure 2) with the passage of the Open Government Data act in 2019. As of this writing, the US government has posted more than 255,000 open datasets to the data.gov website [8].

Figure 2: resources.data.gov is the US government's repository for open data tools and resources.

The Details: What Is It Really

Once you get organized, it doesn't really cost more money to pursue open data policies (and it might cost less in the long run). In another sense, it does take a certain amount of energy and diligence to ensure that things are done in an open way. Government directives and executive orders are useful for ensuring that project managers and bureaucrats stay focused on the goal. Down at the level of implementation, the open data movement depends primarily on three things:

  • open formats
  • specifications and contract language for requiring open data practices
  • tools for supporting and maintaining open data

A number of tools exist for facilitating open data. Many of these tools are focused on bringing existing data into a form that is easy to publish on the web. The tools range from full-scale data management systems to special-purpose data transformation utilities. Project Open Data, for instance, included tools for converting a database or a CSV file into a RESTful API or directly into HTML (Figure 3). These kinds of tools make it very easy to take data stored in an arbitrary form of the past and make it available as open data through a web portal. CKAN, which you'll learn more about later in this issue, is an open source data management tool created by the OKF that is used throughout the world as a tool for publishing and aggregating open data.

Figure 3: Project Open Data developed a large collection of tools for cataloging and converting data. Many of these tools have been replaced by newer tools at resources.data.gov.

Where It Stands

Governments have made much progress on open data in the past 20 years, but the quest continues. The EU Open Data Directive, for instance, is quite promising, but the Directive itself does not really have the force of law and is, instead, a directive to the member states to pass their own laws facilitating open data. According to Wikipedia, almost half the EU member states were still out of compliance with the Open Data Directive as of July 2022.

In the US, executive orders are not exactly like laws either and are subject to changes and reinterpretation by future administrations. A law like the Open Government Data Act has more teeth than an executive order, however, the law only applies at the federal level. States, localities, and research projects that do not use federal dollars can set their own rules. Different states have varying levels of support for open data. It is worth noting, however, that even in today's hyper-partisan political culture, Open Government Data does have bi-partisan support. The Open Government Data Act had four sponsors – two Republicans and two Democrats.

Conclusion

Open data is a big topic that touches on laws, privacy matters, and politics, as well as programming, data storage, and web development topics. We can't cover all of it, but we'll give you a look at some things going on right now that are advancing the cause of open data. We'll introduce you to the CKAN document management system, and we'll talk to Lisa Allen of the Open Data Institute.

The topic of open data, as it is commonly used today, applies to government data, but we are all well aware that control of other forms of data is also a problem. I don't think anyone would want private user data to be "open" in the sense that anyone could access it, however, many users at least want access to their own data in order to migrate it to other platforms. The need for more control over personal social media data has led to the rise of open source platforms such as Mastodon and Diaspora. However, big tech vendors like Google and Microsoft have also heard the call for easier migration. Also in this month's issue, we take a look at the Data Transfer Project, an effort to arrive at an open form for migrating closed source social media data.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News