The Information Machine is a short film written, produced and directed by Charles and Rae Eames for the IBM Pavillion at the 1958 Brussels World’s Fair. Animation by John Whitney. Music by Elmer Bernstein. The topic is primarly about the computer in the context of human development but I think it also represents our fascination and need to collect, organize data and abstract the world around us. Since it was written in 1958 it does go on about he, his, him, man and men’s yada yada at nauseaum, it nonethelss remains a cute informative short film in the public domain and captured in the Internet Archive and does represent ideas as relevant to us today as they were then!

via: Information Aesthetics 

Jon Udell writes about the outside edge of what’s happening on the web (including lawnmowers), but his focus is often as much about how regular, non-digerati people might be helped by new changes and technologies. Formerly the blogger-in-chief at Infoworld, he’s now working with microsoft. He’s been writing recently about public data, and I wanted to find out why.

1. you seem to be spending much time recently writing about access to public data … why is that?

I’ve always thought the real purpose of information technology was to harness our collective intelligence to tackle complex and pressing problems. When I heard Doug Engelbart’s talk at the 2004 Accelerating Change conference. I realized for the first time how all of his work points toward that one goal. Graphical user interfaces, networks, hyperlinked webs of information — for him, these are all means by which we “augment” our human capabilities so we can have some hope of dealing with the challenges we face as a species.

In that context, getting data into shared information spaces is just part of the story. We’ll also need to be able to share the tools we use to analyze and interpret the data, and the conversations we have about the analysis and interpretation.

2. what do you think is the most compelling argument for making public data available to citizens?

Well it’s ours, our taxes paid for it, so we should have it. But the compelling reason is that we need more eyeballs, hands, and brains figuring out what’s going on in the world, so that when we debate courses of action we can ground our thinking in the best facts and interpretations.

3. are you convinced by any arguments *against* making public data available to citizens?

Here’s an argument I don’t buy: That amateur analysts will do more harm than good. I don’t buy it because there will be checks and balances. Those who don’t cite data will be laughed at. Those who do cite data but interpret it incorrectly will be corrected. Those who do great work will develop reputations that are discoverable and measurable.

Here’s an argument I do buy: There’s the risk of violating privacy. The District of Columbia, for example, has released a lot of data but has postponed releasing adult arrests and charges until the location information can be aggregated. We will increasingly have to make these kinds of calls.

5. public data is an issue that most people will have trouble getting excited about. how do you think “data activists” should approach it?

The best advice I’ve heard comes from Tom Steinberg, founder of MySociety.org. He counsels activists to use data in ways that matter directly to people. Suppose you could get geographic data on planned highway routes, for example. Nobody cares, until you connect the dots and show people their houses will have to be bulldozed to make way for it. Then they really care.

6. in your experience with government officials, how have *they* reacted to your requests for data?

When I started asking my local police department for crime data, they stonewalled. Eventually I had to get a lawyer to write them a letter citing our state’s ‘Right to Know’ act, and we were both unhappy about having to do that.

But once I met with the police chief and explained my interest in exploring both local patterns as well as this whole general process, he was OK with that. Better than OK, actually. I think he was relieved when he saw that some questions people have been speculating about might now be discussed in a more rational way. And he’s really excited by the prospect of geographical analysis because they haven’t had that capability.

8. what do you think are the connections between open access to public data and other similar movements – free culture, free software etc?

There’s an arc that runs from free and open-source software, to open data, to Web 2.0-style participation, and now to the collaborative use of software, services, and public data in order to understand and influence public policy.

9. with your crystal ball, where do you think the confluence of these movements will take us in, say, 5 years?

I’m sure it won’t happen that soon, but here’s what I’d like to see. Imagine some local, state, or national debate. The facts and interpretations at issue are rarely attached to URLs, much less to to primary sources of data at those URLs and to interactive visualizations of the data. We spend lots of time arguing about facts and interpretations, but mostly in a vacuum with no real shared context, which is wildly unproductive. If we could establish shared context, maybe we could argue more productively, and get more stuff done more quickly and more sanely.

I have been paying attention to infrastructures lately. More recently, I seemed to be coming across more stories about infrastructural failures, Submarine Cables in Asia, or in the Ring of Fire. The most recent failure being in Minneapolis. Today’s Globeand Mail online has a story that links to some AP video data and this one in particular – U.S Infrastructure under scrutiny – does a good review on how engineers gather their primary data, the nature of that data, and the making of safety reports. Seems like those reports get shelved allot! William Ibbs from UC Berkely an expert on construction risk said it well with a knowing smirk on his face:

well, ah, we’ve had had ah maybe some other social priorities for the past few years in the nation and public works have taken, ah, a bit of a back seat.

The map below shows the distribution of deficient bridges in the US. I thought I was hearing more stories and this data seems to support that my assumptions were not entirely off base!Bridges US

Then I wondered about Canada so I did some superficial digging and found the following report – The Age of Public Infrastructure produced by Statistics Canada. The great thing about all of their report is that you can access their methodology documents, data sources and contacts which is great education material for amateur data geeks who wants to collect data themselves and want to find a systematic and statistically sound way to do so. I also found an Infrastructure Canada report that discusses the Government’s Infrastructure Assets and their management. The collapse in Minneapolis created a media context and receptivity on the subject as seen here – Canada’s infrastructure needs urgent attention, while some specialized think tanks look at particular infrastructures related to investment and stock prices in the energy industry – Aging Energy Infrastructure Could Drive Molybdenum Demand Higher -which is loaded with data particular to engineers in that field.

Why, talk about that here! Well, mostly because infrastructure is a boring thing that we rarely think about yet there is a ton of citizen money locked into these very huge material physical artefacts, also because there is little citizen generated data on the topic and the data available or the decisions that are being made rarely have a price tag or the name of the responsible agent attached to them! Yet without infrastructure we can cannot function! Infrastructure is what distinguishes a good city to live in versus a not so good city to live in, and well infrastructure is an inseparable part of our human habitat.

Imagine a concerted effort by citizens to collect data about satellite dishes, or receiving ground stations, server farms, isp offices, aging bridges, cool sewers, following the complete cycle of ones local water purification plant, or telephone switching station, where one’s poo goes once flushed, where one’s data is stored, and sharing and visualizing all that data on a map. We are starting to see some really interesting adventurer/art urban exploration projects or how some boyz are navigating the 3d elements of a city’s hardware in parkour. I love stuff like this Pothole reporter, could we develop collaborative tools to report missing manhole covers, Ottawa’s thriving road side ragweed cultivations, where the public washrooms are/are not along with public water fountains, Montreal’s missing trees in sidewalk planters (Michael‘s idea on location portal content gathering) and so on.

  1. Accessing literature,
  2. obtaining materials,
  3. and sharing data.

Science is a collaborative endeavour and these 3 roadblocks are impeding scientific discovery according to John Wilbanks, executive director of the Science Commons initiative, founder of the Semantic Web for Life Sciences project and the Neurocommons.

The good folks at freeourdata.org.uk (one of the major inspirations for this blog) met with the UK’s Minister of Information, Michael Wills. The whole interview is interesting, of course, but just the opening remarks from Michael Wills show a remarkable openness to the idea of freed data:

Personally I’m very excited by this area, I asked to do this as part of my portfolio… The whole issue of data is I think tremendously exciting for all the reasons that you’ve said, it’s part of the infrastructure now of our society and our economy and it’s going to become more so with what’s happening with data mashing, the extraordinary intellectual creative energy that’s being unleashed is something that as a government we have to respond to, and the power of information you know is a very exciting document, something that I think is very much where government wants to be.

[more…]

I met with Wendy Watkins at the Carleton University Data Library Carleton University Data Library yesterday. She is one of the founders and current co-chair of DLI and CAPDU (Canadian Association of Public Data Users), a member of the governing council of the International Association of Social Science Information Service and Technology (IASSIST) and a great advocate for data accessibility and whatever else you can think of in relation to data.

Wendy introduced me to a very interesting project that is happening between and among university libraries in Ontario called the Ontario Data Documentation, Extraction Service Infrastructure Initiative (ODESI). ODESI will make discovery, access and integration of social science data from a variety of databases much easier.

Administration of the Project:

Carleton University Data Library in cooperation with the University of Guelph. The portal will be hosted at the Scholar’s Portal at the University of Toronto which makes online journal discovering and access a dream. The project is partially funded by the Ontario Council of University Libraries (OCUL) and OntarioBuys operated out of the Ontario Ministry of Finance. It is a 3 year project with $1 040 000 in funding.

How it works:

ODESI operates on a distributed data access model, where servers that host data from a variety of organizations will be accessed via Scholars’ Portal. The metadata are written in the DDI standard which produces XML. DDI is the

Data Documentation Initiative [which] is an international effort to establish a standard for technical documentation describing social science data. A membership-based Alliance is developing the DDI specification, which is written in XML.

The standard has been adopted by several international organizations such as IASSIST, Interuniversity Consortium for Political and Social Research (ICPSR), Council of European Social Science Data Archives (CESSDA) and several governmental departments including Statistics Canada, Health Canada and HRSDC.

Collaboration:

This project will integrate with and is based on the existing and fully operational Council of European Social Science Data Archives (CESSDA), which is cross boundary data initiative. CESSDA

promotes the acquisition, archiving and distribution of electronic data for social science teaching and research in Europe. It encourages the exchange of data and technology and fosters the development of new organisations in sympathy with its aims. It associates and cooperates with other international organisations sharing similar objectives.

The CESSDA Trans-Border Agreement and Constitution are very interesting models of collaboration. CESSDA is the governing body of a group of national European Social Science Data Archives. The CESSDA data portal is accompanied by a multilingual thesaurus, currently 13 nations and 20 organizations are involved and data from thousands of studies are made available to students, faculty and researchers at participating institutions. The portal search mechanism is quite effective although not pretty!

In addition, CESSDA is associated with a series of National Data Archives, Wow! Canada does not have a data archive!

Users:

Users would come to the portal, search across the various servers on the metadata fields, access the data. Additionally, users will be provided with some tools to integrate myriad data sets and conduct analyses with the use of statistical tools that are part of the service. For some of the data, basic thematic maps can also be made.

Eventually the discovery tools will be integrated with the journal search tools of the Scholar’s Portal. You will be able to search for data, find the journals that have used that data or vice versa, find the journal and then the data. This will hugely simplify the search and integration process of data analysis. At the moment, any data intensive research endeavour or data based project needs to dedicate 80-95% of the job to find the data from a bunch of different databases, navigating the complex licensing and access regimes, maybe pay a large sum of money, organizing the data in such a way that it is statistically accurate then make those comparisons. Eventually one gets to talk about results!

Data Access:

Both the CESSDA data portal project and ODESI are groundbreaking initiatives that are making data accessible to the research community. These data however will only be available to students, faculty and researchers at participating institutions. Citizens who do not fall into those categories can only search the metadata elements, see what is available but will not get access to the data.

Comment:

It is promising that a social and physical infrastructure exists to make data discoverable and accessible between and among national and international institutions. What is needed is a massive cultural shift in our social science data creating and managing institutions that would make them amenable to the creation of policies to unlock these same public data assets, some of the private sector data assets (Polls, etc.) and make them freely (as in no cost) available to all citizens.

More interesting stuff from Jon Udell, this time taking some climate data for his area, using the ManyEyes platform and trying to see what has been happening in New Hampshire in the last century.

The experiment is non-conclusive, but there is an excellent debate in the comment threads, about the problems with amateurs getting their hands on the data – and the hash they can make of things because they are not experts.

Says one commenter (Brendan Lane Larson, Meteorologist, Weather Informaticist and Member of the American Meteorological Society)

Your vague “we” combined with the demonstration of the Many Eyes site trivializes the process of evidence exploration and collaborative interpretation (community of practice? peer review?) with an American 1960s hippy-like grandiose dream of democratization of visualized data that doesn’t need to be democratized in the first place. Did you read the web page at the URI that Bob Drake posted in comments herein? Do you really think that a collective vague “we” is going to take the time to read and understand (or have enough background to understand) the processes presented on that page such as “homogenization algorithms” and what these algorithms mean generally and specifically?

To which Udell replies:

I really do think that the gap between what science does and what the media says (and what most people understand) about what science does can be significantly narrowed by making the data behind the science, and the interpretation of that data, and the conversations about the interpretations, a lot more accessible.

To turn the question around, do you think we can, as a democratic society, make the kinds of policy decisions we need to make — on a range of issues — without narrowing that gap?

There is much to be said about this … but Larson’s comment “Do you really think that a collective vague “we” is going to take the time to read and understand (or have enough background to understand) the … XYZ…” is the same question that has been asked countless times, about all sorts of open approaches (from making software, to encyclopaedia, to news commentary). And the answer in general is “yes.” That is, not every member of the vague “we” will take the time, but very often with issues of enough importance, many of the members of the vague “we” can and do take the time to understand, and might just do a better job of demonstrating, interpreting or contextualizing data in ways that other members of the vague “we” can connect with and understand.

The other side of the coin of course, is that along with the good amateur stuff there is always much dross – data folk are legitimately worried about an uneducated public getting their hands on data and making all sorts of errors with it – which of course is not a good thing. But, I would argue, the potential gains from an open approach to data outweigh the potential problems.

UDATE: good addition to the discussion from Mike Caulfield.

The Canadian Recording Industry Association (CRIA) releases all kinds of data related to sales.  It is also an organization that has quite a bit of power with the Canadian Government.

Michael Geist has an interesting piece on interpreting CRIA sales data!  It is an industry I know very little about and I would probably have just accepted their reported numbers as I would not have had the contextual knowledge to frame what they were saying otherwise!

Numbers are tricky rascals at best! Especially when an industry is trying to lobby for its own interests and at times politicians just believe any ole number thrown at them!  Worse the wrong numbers, or numbers out of context get picked up by newswires and get repeated at nauseam!  Just depends who’s ear a particular industry has I guess and how much homework a reporter does.

Quality Repositories, is a website that comes out of a stats (?) course at University of Maryland. It aims to evaluate the usefulness and availability of various sources of public data, from US Government, non-US government, academic, and sports related (?) data sets. Evaluations are based on criteria such as: online availability, browsability, searchability, retrievable formats etc. The about text:

Data repositories provide a valuable resource for the public; however, the lack of standards in terminology, presentation, and access of this data across repositories reduces the accessibility and usability of these important data sets. This problem is complex and likely requires a community effort to identify what makes a “good” repository, both in technical and information terms. This site provides a starting point for this discussion….

This site suggests criteria for evaluating repositories and applies them to a list of statistical repositories. We’ve selected statistical data because it is one of the simplest data types to access and describe. Since our purpose is partly to encourage visualization tools, statistical data is also one of the easiest to visualize. The list is not comprehensive but should grow over time. By “repositories” we mean a site that provides access to multiple tables of data that they have collected. We did not include sites that linked to other site’s data sources.

The site was created by Rachael Bradley, Samah Ramadan and Ben Shneiderman.

(Tip to Jon Udell and http://del.icio.us/tag/publicdata)

One of the great data myths is that cost recovery policies are synonymous with higher data quality. Often the myth making stems from effective communications from nations with heavy cost recovery policies such as the UK who often argue that their data are of better quality than those of the US which have open access policies. Canada, depending on the data and the agencies they come from is at either end of this spectrum and often in between.

I just read an interesting study that examined open access versus cost recovery for two framework datasets. The researchers looked at the technical characteristics and use of datasets from nations of similar socio-economic, jurisdiction size, population density, and government type (Netherlands, Denmark, German State of the North Rhine Westfalia, US State of Massachusetts and the US Metropolitan region of Minneapolis-St. Paul). The study compared parcel and large scale topographic datasets typically found as framework datasets in geospatial data infrastructures (see SDI def. page 8). Some of these datasets were free, some were extremely expensive and all under different licensing regimes that defined use. They looked at both technical (e.g. data quality, metadata, coverage, etc.) and non-technical characteristics (e.g. legal access, financial access, acquisition procedures, etc.).

For Parcel Datasets the study discovered that datasets that were assembled from a centralized authority were judged to be technically more advanced while those that require assembly from multiple jurisdictions with standardized or a central institution integrating them were of higher quality while those of multiple jurisdictions without standards were of poor quality as the sets were not harmonized and/or coverage was inconsistent. Regarding non-technical characteristics many datasets came at a high cost, most were not easy to access from one location and there were a variety of access and use restrictions on the data.

For Topographic Information the technical averages were less than ideal while for non-technical criteria access was impeded in some cases due to involvement of utilities (tendency toward cost recovery) and in other cases multiple jurisdictions – over 50 for some – need to be contacted to acquire a complete coverage and in some cases coverage is just not complete.

The study’s hypothesis was:

that technically excellent datasets have restrictive-access policies and technically poor datasets have open access policies.

General conclusion:

All five jurisdictions had significant levels of primary and secondary uses but few value-adding activities, possibly because of restrictive-access and cost-recovery policies.

Specific Results:

The case studies yielded conflicting findings. We identified several technically advanced datasets with less advanced non-technical characteristics…We also identified technically insufficient datasets with restrictive-access policies…Thus cost recovery does not necessarily signify excellent quality.

Although the links between access policy and use and between quality and use are apparent, we did not find convincing evidence for a direct relation between the access policy and the quality of a dataset.

Conclusion:

The institutional setting of a jurisdiction affects the way data collection is organized (e.g. centralized versus decentralized control), the extent to which data collection and processing are incorporated in legislation, and the extent to which legislation requires use within government.

…We found a direct link between institutional setting and the characteristics of the datasets.

In jurisdictions where information collection was centralized in a single public organization, datasets (and access policies) were more homogenous than datasets that were not controlled centrally (such as those of local governments). Ensuring that data are prepared to a single consistent specification is more easily done by one organization than by many.

…The institutional setting can affect access policy, accessibility, technical quality, and consequently, the type and number of users.

My Observations:
It is really difficult to find solid studies like this one that systematically look at both technical and access issues related to data. It is easy to find off the cuff statements without sufficient backup proof though! While these studies are a bit of a dry read, they demonstrate the complexities of the issues, try to tease out the truth, and reveal that there is no one stop shopping for data at any given scale in any country when it comes to data. In other words, there is merit in pushing for some sort of centralized, standardized and interoperable way – which could also mean distributed – to discover and access public data assets. In addition, there is an argument to be made to make those data freely (no cost) accessible in formats we can readily use and reuse. This of course includes standardizing licensing policies!

Reference Institutions Matter: The Impact of Institutional Choices Relative to Access Policy and Data Quality on the Development of Geographic Information Infrastructures by Van Loenen and De Jong in Research and Theory in Advancing Data Infrastructure Concepts edited by Harlan Onsrud, 2007 published by ESRI Press.

If you have references to more studies send them along!

« Older entries § Newer entries »