This Thursday (11th March) I’m speaking at the Forum Virium’s Open Up the City event in Helsinki.

This year their focus is on “open data, design, interfaces and innovation” and I’m speaking under the title “Open Data: What, Why, How?”.

This Wednesday (27th of January) at 1pm I’m giving one of Cambridge University Library’s regular lunch-time talks on Openness and Libraries. Attendance is free and anyone can come along!

Update (28th Jan): talk is done and slides are now up.

Blurb

Over the past few years, open licensing (http://www.opendefinition.org/) has facilitated the explosive growth of a ‘knowledge commons’. To give a few prominent examples: Open Access journals, Open Educational Resources and Open Data in scientific research have all been enabled by licenses which permit material to be freely re-used and re-distributed. This outpouring of support for openness has led to an incredible rise in community-led development and innovative uses.

Bibliographic records are a key part of our shared cultural heritage and essential to anyone working with cultural materials (books, music, films etc). Opening up those records for access and re-use offer a variety of benefits.

First, it would allow libraries to share records more efficiently and improve quality more rapidly through better, easier feedback. Second, easier access to catalogue data would spur development of the multifarious services, technologies and research that use that data, including, for example, search engines, book or music websites, researchers working on information production, journalists writing on orphan works, as well as many other areas we cannot even imagine in advance.

With a growing number of Government agencies and public institutions making data open, is is now time for the library community to do likewise?

Open Notebook Social Science

October 22nd, 2009

The other day I posted up some work-in-progress on the subject of patterns of knowledge production.

That material is still in a fairly preliminary state. However, my decision to release it it in this form was a conscious decision and part of an ongoing attempt on my part to practice a more open “release early, release often” approach to research.

In doing this I’m drawing direct inspiration from the open source and open notebook (science) communities and seeking to engage in what might be termed open notebook social science!

I think most researchers (including myself) feel a reluctance to put out material that isn’t at a reasonable level of maturity. While there are some good reasons for this, I think the main motivations are less positive, and are primarily to do with fear: be it of criticism or that your ideas are “taken” by others. While such fears can have some basis, it seems to me the benefits of an open approach — in terms of visibility, dissemination, and potential for collaboration — significantly outweigh any of the associated risks.

Over the last year, I’ve already been making some effort to move in this direction but from this point on I’m aiming to do this more thoroughly and methodically. A first step in this will be to put all the “patterns” and data online.

The Open Knowledge Foundation’s 2009 Open Knowledge Conference (OKCon), which I help organize, will take place next Saturday 28th March – less than a week away.

Full details including programme can be found either in this blog post or on the OKCon home page.

As usual this will be a fun and informal day so if you’re free this Saturday and interested in “Open” stuff come along to UCL and take part.

I should also add that for the two days before (Thursday + Friday) there is also the 5th COMMUNIA Workshop which is about Accessing, Using, Reusing Public Sector Content and Data which is being co-organized by the Open Knowledge Foundation together with the London School of Economics and taking place at LSE (all thanks to the tireless work of Jonathan Gray and Prodromos Tsiavos!).

As a member of the Econometric Society I received yesterday the following announce:

The Council and the Fellowship of the Econometric Society have both voted in favor of a plan for the Society to publish two open-access journals: Quantitative Economics (QE) and Theoretical Economics (TE). All voting Council members were in favor of the proposal. Among the active Fellows, 277 (66.4% of the total) cast their ballots, with 240 votes (86.6%) in favor, 30 (10.8%) against, and 7 (2.5%) abstentions. An announcement together with a description of the new journals may be found in http://www.econometricsociety.org/news1.asp?ref=81 .

QE will be started from scratch and its first issue is planned for 2010. TE has been published by the Society for Economic Theory (http://econtheory.org/ ), but is to be adopted by the Econometric Society later this year. The first issue in 2010 will be the first one as a Society journal.

This is great news.

Recent Work on Open Economics

January 23rd, 2009

Over the Christmas break I had a chance to make some substantial improvements/additions to our Open Economics including:

  1. Improved javascript graphing.
  2. Extend Millenium Development Goals package and added web interface.
  3. First efforts at ‘Where Does My Money Go’

More details on each of these can be found below. Also we’d be delighted to here from anyone interested in getting involved in this, especially with the last item, so if interested do get in touch.

1. Updated javascript graphing package to use flot.

This also allows us to use javascript make the graphing stuff more interactive, in particular to select chart type and the series to plot. See e.g. the data on Daily Wages of Thatchers in the Middle Ages or Wheat, barley, oat, mutton and wool prices, and agricultural wages, 1500-1849.

2. Improved Millenium Development Goals package/dataset and added a web interface.

Extended ‘packagization’ of the MDG data by creating a mini-domain model and an associated sql version of data in addition to the existing csv normalized-tabular version of the data:

http://knowledgeforge.net/econ/svn/trunk/econdata/mdg/db.py

This is much more convenient for analysis (e.g. finding all countries which have at least one entry for any of these 3 series between 1995 and 2005 …). It is also essential for:

New web interface for Millenium Development Goals

Using the sql version of the data is was easy to build a quick-and-dirty web interface to enables one to browse and view the data quickly:

http://www.openeconomics.net/mdg/

For example here’s chart and data showing “Children under 5 moderately or severely underweight, percentage” for Afghanistan, China, India, United States:

http://www.openeconomics.net/mdg/view?commit=Show+Values&series=559&countries=4&countries=156&countries=356&countries=840

3. First efforts at ‘Where Does My Money Go’

Two parts to this project a) getting the data on government revenue/expenditure b) displaying it nicely in a web interface.

Part (a) is encapsulated in a new ukgovfinances dataset:

http://knowledgeforge.net/econ/svn/trunk/econdata/ukgovfinances/

Using this data we have made a (small) start on the web interface:

http://www.openeconomics.net/wdmmg/

The Open Knowledge Foundation (which I’m involved in) is co-organizing with MySociety and OPSI, a Workshop on Finding and Re-using Public (Sector) Information.

The event takes place this Saturday (1st of November) at the London Knowledge Lab near Holborn in London. Full details in this OKFN blog post and you can sign up the wiki page:

http://okfn.org/wiki/PublicInformation

One of the active Open Knowledge Foundation projects is Open Economics. A substantial part of that effort ends up being data acquisition and ‘cleaning’: getting hold of economic data, parsing it into (computer) usable form and adding it to the Store. (Wouldn’t it be nice if that data was already nicely packaged up or at least in a decent raw form …).

Once this job is done, the data is there in a nice clean state for others to use — plus we can draw some nice graphs (as we will see below). As an illustration of this process, we’ll look at one particular dataset acquired earlier this year when, motivated by the large increases in commodity prices and the concerns expressed regarding their impact, I decided to see what data I could dig up on food prices (starting with Wheat).

As usual, it was US government material that was most easily available (in a decent format) and I decided to start off with historical information on wheat to be found in the Wheat Yearbook, in particular the contents of:

http://www.ers.usda.gov/data/wheat/yearbook/WheatYearbookTables-Recent.xls

While the data was available (and open — since US Govt provided) it was in a format that was not immediately computer usable (lots of blank lines etc). Thus, the first step was to parse this into standard csv file format (see script here) and then upload this to Open Economics. The result:

http://www.openeconomics.net/store/517d7c4e-3cb7-4e8f-aaa1-745dd665ad1f

Not only do we now have nice clean data but, thanks to plotkit, Open Economics has javascript graphing so without any more effort we can automatically have graphs of the resulting material. Not only does this allow us to answer our original question (see Fig 4) but these graphs also tell a fascinating historical story:

US Wheat: 1866 – 2007

NB: if the figures are too small click through for the full-size versions on Open Economics (the dates at the bottom run from 1866 to 2007)

Figure 1: Output (Millions of Bushels)

US Wheat Data

First up is output. As can be seen here output rose steadily (approximately linearly) up until the First World War. It then stayed flat or even fell during the inter-war period — the Great Depression and the Dust Bowl can be seen in the sharp dip in the early 1930s. Following the Second World War output rose, accelerating (exponentially?) up until the early 1980s when it has flattened out, even declining (with sharp variations) to the present.

Looking at these raw output figures the immediate question one asks (at least as an economic historian) is: what underlying causes drove these changes in output. In particular, output is the product of two factors: total acreage in use and yield (average output per acre) so it would be interesting to see time-series for them as well. Fortunately this data is also available:

Figure 2: Acreage (Millions of Acres)

US Wheat Data

The first thing to note is that these series start in 1866, the year after the American Civil War ended. This was a period of great westward expansion in cultivation in the United States — the “Opening of the Prairies”. The graph bears graphic witness to these changes: we can see that harvested acreage tripled between 1866 and the outbreak of WWI in 1914.

This massive expansion was to have a profound effect far outside of the US: food prices dropped around the world due to the increase in supply. In Western Europe this lead to a ‘Great Depression’ in agriculture right up until the First World War (which in turn had a significant effect on European politics creating protectionist alliances between peasants and landowners in many European countries). It also assisted industrialization by keeping the price of bread low for the fast growing industrial proletariat.

However, by the end of WWI most of the acreage that could be cultivated was already in use. After that point, while there has been variation in planted acreage (perhaps driven by substitution between wheat and other crops) there has been no long term trend (whether increasing or decreasing). Thus, while the increase in output up to WWI can be largely explained by increases in acreage under cultivation [^1] the large increases in output in the post-WWII period can’t be. This brings us then to the second major factor in explaining changes in output: yields.

[^1]: a crude eyeballing suggests that output increased somewhere between 3-4 times between 1866 and WWI. This is in line with the increase in acreage. That said, diminishing returns arguments (best land is cultivated first) would suggest that to maintain yield per acre on a vastly increased acreage would have necessitated some increase in yields.

Figure 3: Yield (Bushels / Acre)

US Wheat Data

One could not ask for a sharper confirmation of our previous hypothesis than Figure 3. As it shows average yields were almost perfectly flat from 1866 up until the end of the Second World War. From that point yields took off growing sharply, but at an almost constant rate, up until the mid 70s, following which the growth rate slowed substantially (though yields still continued to grow albeit with increased variability). In concrete terms this corresponded to a rise in yield from around 12 bushels per acre at the end of WWII to somewhere around 35 bushels per acre in the 70s — and around 40 today.

To put this most starkly: there was a roughly 3-fold increase in yields in this 30 year period. Again this is a particularly ‘graphic’ testament to the ‘green revolution’ of the post-war period which was driven largely by the development and adoption of new corn varieties (hybrid corn), fertilizers etc.

Figure 4: Price ($ per Bushel)

US Wheat Data

Lastly we come to price. Here, despite substantial fluctuations the basic trends fit with our historical intuition. There is little change between 1866 and WWI, a sharp rise during the war, a substantial decline in the inter-war period, then another sharp-rise during WWII (wars are good for farmers!) followed by stabilization (or even slight decline) until the mid 1970s when there is another sharp rise. Following that there is substantial variation but no great changes until the present when the line shoots up again (doubling from around $3 per bushel to somewhere near $6 in a year).

As basic economics tell us, price should reflect the interaction of supply and demand. The marked stability of price over long periods (particularly those where supply has increased) suggests then that demand has matched supply (or vice-versa) fairly well over this period (one might also need to take account of the fact that there may also have been substantial government intervention to stabilize prices).

Given that supply has risen substantially through the whole period, and especially since WWII (see Fig 1) this means that demand has also been climbing sharply. This is true: world population has increased at least 5x since 1850 and roughly tripled since WWII (in addition many people, especially in developed countries have increased their per-capita consumption, by eating more and better — as well as wasting more).

It would be interesting to imagine what would have happened if this kind of population increase, particularly that since WWII, had occurred without the massive increase in yields shown in Figure 3 (part of the answer may be that population would not have increased so much …). Certainly the price increases seen recently may reflect the kind of growing surplus of demand over supply that we would have seen without the ‘green revolution’. As such, they may be signals of the significant readjustments that will be needed in the near future, whether that be increases in supply, reductions in demand or more efficient use of existing supplies.

Sören Auer posted today to the okfn-discuss lists about plans for Open Participatory Research. Reading this I was particularly struck by his mention of ‘open peer review’ as this seemed directly related to some recent ideas of my own. Specifically I’ve been working on an economics paper with an academic colleague on the subject of dissemination of scholarly information. This is still at an early stage but the basic ideas in it are quite simple — as set out in the current introduction which can be found below.

Introduction

It is well known that in order to (completely) address a given number of (independent) goals one needs an equal number of instruments. For example, if one is seeking to address both congestion and pollution in relation to road-traffic, a single instrument, for example petrol taxes, will be insufficient to address both goals exactly (of course it will allow one to address both goals partially). The same issues arise in relation to the dissemination of scholarly information.

Here too there are multiple independent goals. Traditional academic publishing provides but a single instrument. Originally there was nothing that could be done (for reasons discussed further below), but changes in technology render this restriction to a single instrument unnecessary. Unfortunately, the two-sided nature of the journal market (based on expectations), combined with the current evaluation structure of academia, continue to lock society into this inefficient restriction. Open-access journals provides one, though as we shall argue, not the only, or even most efficient, way to improve the current situation.

Goals and Instruments

Crudely put, the two main goals (or tasks), in relation to the dissemination of scholarly information are:

  • Distribution (transmission of the data/information) — `Making material available for Reading’
  • Filtering/Recommendation — `Deciding what to Read’

It seems clear that these are distinct and hence require distinct instruments for their achievement. Journals can be seen as a single instrument which traditionally have tried to address both ends simultaneously. The deficiency of academic publishing can then be seen as one of insufficient instruments. Initially, because of the limitations of reproduction and distribution technologies, there was little that could be done about this. Today with the advent of the computer and the Internet this is no longer the case and it is possible to these two distinct goals with two distinct instruments.

Why then did restricted-access Journals originally come about? The answer lies in technology, in particular the nature of the technology available in earlier periods to manage distribution (printing and transmission). When many journals were originally started the cost of transmitting information was very high. Journals essentially acted as a club good by which the costs of reproduction and distribution could be (efficiently) shared (the efficiency arising here from economies of scale).

At the same time, given the limited ‘bandwidth’ it was natural for Journals to take on some filtering role in order to economize on the scarce transmission capacity. In this situation, dissemination is limited and with only one instrument available (Journals) and it is natural to tie dissemination and filtering together (with filtering in many ways secondary). Once filtering is being done it is natural for journals to `tie’ material to the journal explicitly via copyright — though at an early stage given the scale economies of journals this explicit tying was not actually necessary and was probably done for simple legal convenience.

With the advent of digital communications, in particular the Internet, bandwidth is no longer scarce. What is now scarce is attention. In this setup the importance of a journal is not its role in efficiently sharing reproduction and distribution costs but its role as a filtering mechanism. However, while when distribution is central it is natural to `add-in’ filtering, it is not natural, or necessary, to tie distribution in to filtering when filtering is central. In fact it seems clear that distribution and filtering can be done entirely separately (i.e. one can have two instruments focused on distribution and filtering respectively). The Open Access movement can be seen as largely about achieving this separation: with open access there is no longer a connection between access/distribution (which would be free) and the filtering mechanism (the choice of which articles go in a particular journal).

That said the `Open Access’ movement still has a large focus on journals — albeit open-access ones. This, in our view, is a mistake. Technology has also affected possibilities for filtering. In particular it is no longer clear why the centralized mechanism of official peer-review and journals is superior to alternative decentralized options. The last decade, has witnessed widespread, and often successful, experimentation with distributed voting and evaluation mechanisms (for example Slashdot’s story-ratings and Google’s link-based site rankings).

Thus, to be more radical, it may make sense not only to remove centralized control of distribution but also centralized control of filtering. A more distributed (market-like?) filtering mechanism would permit the same freedom (and same status?) to participate in reviewing and recommendation as it does in the production of scholarly information. At the same time it would deliver greater transparency, and by permitting `free-entry’ in filtering, would allow greater specialization, greater diversity, increased participation and greater competition.

As such, the gains from going ‘open’ are not simply wider access, but a reduction in the time and energy scholars spend finding and processing research information. Significantly, this second item, which is less frequently mentioned in discussions of ‘Open Access‘, may well be the most significant.

I’ll be giving a talk at Open Tech 2008 on Saturday (5th July) about some of the work I do at the Open Knowledge Foundation. The talk is entitled “Opening Data” and its rough subject is indicated by the blurb:

We all want more open data to analyse and mashup be it for urban planning or to better understand 12th Century Canon Law. But how do we go about reaching data ‘Nirvana’? What are the obstacles and why is openness so crucial to getting there? This talk explores these questions touching on some of the more prominent recent developments in the area along the way.

OpenTech/NotCon has been a great experience over the years and this time looks to be no exception.