I was recently asked to put together a short document outlining my main policy recommendations in the area of “innovation, creativity and IP”. Below is what I prepared.

General IP Policy

Recommendation: IP policy, and more generally innovation policy, should aim at the improvement of the overall welfare of UK society and citizens and not just at promoting innovation and creativity

Innovation is, of course, a major factor in the improvement of societal welfare — but not the only factor, access to the fruits of that innovation is also important.

IP rights are monopolies and such monopolies when over-extended do harm rather than good. The provision of IP rights must balance the promotion of innovation and creativity with the need for adequate access to the results of those efforts both by consumers and those who would seek to innovate and create by building upon them. A policy which aims purely at maximizing innovation, via the use of IP rights, will almost certainly be detrimental to societal welfare, since it will ignore the negative consequences of extending IP on access to innovation and knowledge. As such, IP policy is about having “enough, but not too much”.

This basic point is often overlooked. To help minimize the risk of this occurring in future it is suggested that this basic purpose — of promoting the welfare of UK citizens — be explicitly embedded within the goals of organisations and departments tasked with handling policies related to innovation and IP.

Recommendation: Move away from a focus on intellectual property to look at innovation and information policy more widely

IP rights are but one tool for promoting innovation and often a rather limited one. The focus should be on the general problem — promoting societal welfare through innovation and access to innovation — not on one particular solution to that problem.

Provision and Pricing of Public Section Information

Background

Public sector information (PSI) is information held by a public sector organisation, for example a government department or, more generally, any entity which is majority owned and/or controlled by government. Classic examples, of public sector information in most countries would include, among many others: geospatial data, meteorological information and official statistics.

While much of the data or information used in our society is supplied from outside the public sector, compared to other parts of the economy, the public sector plays an unusually prominent role. In many key areas, a public sector organization may be the only, or one among very few, sources of the particular information it provides (e.g. for geospatial and meteorological information). As such, the policies adopted regarding maintenance, access and re-use of PSI can have a very significant impact on the economy and society more widely.

Funding for public sector information can come from three basic sources: government, ‘updaters’ (those who update or register information) and ‘users’ (those who want to access and use it). Policy-makers control the funding model by setting charges to external groups (’updaters’ or ‘users’) and committing to make up any shortfall (or receive any surplus) that results. Much of the debate focuses on whether ‘users’ should pay charges sufficient to cover most costs (average cost pricing) or whether they should be given marginal cost access — which equates to free when the information is digital. However, this should not lead us to neglect the third source of funding via charges for ‘updates’.

Policy-makers must also to concern themselves with the regulatory structure in which public sector information holders operate. The need to provide government funding can raise major commitment questions while the fact that many public sector information holders are the sole source of the information they supply raise serious competition and efficiency issues.

Recommendation: Make digital, non-personal, upstream PSI available at marginal cost (zero)

The case for pricing public sector information to users at marginal cost (equal to zero for digital data) is very strong for a number of complementary reasons. First, the distortionary costs of average rather than marginal cost pricing are likely to be high. Second, the case for hard budget constraints to ensure efficient provision and induce innovative product development is weak. As such, digital upstream public sector information is best funded out of a combination of ‘updater’ fees and direct government contributions with users permitted free and open access. Appropriately managed and regulated, this model offers major societal benefits from increased provision and access to information-based services while imposing a very limited funding burden upon government.

Recommendation: Regulation should be transparent, independent and empowered. For every public sector information holder there should be a single, clear, source of regulatory authority and responsibility, and this ‘regulator’ should be largely independent of government.

This is essential if any pricing-policy is to work well and is especially important for marginal-cost pricing where the Government may be providing direct funding to the information holder. Policy-makers around the world have had substantial experience in recent years with designing these kinds of regulatory systems and this is, therefore, not an issue that should be especially difficult to address.

Copyright Term

Background

The optimal term of copyright has been a very live policy issue over the last decade. Recently, in the European Union, and especially in the UK, there has been much debate over whether to extend the term of copyright in sound recordings from its current 50 years.

The basic trade-off inherent in copyright is a simple one. On the one hand, increasing copyright yields benefits by stimulating the creation of new works but, on the other hand, it reduces access to existing works (the welfare ‘deadweight’ loss). Choosing the optimal term, that is the length of protection, presents these two countervailing forces particularly starkly. By extending the term of protection, the owners of copyrights receive revenue for a little longer. Anticipating this, creators of work which were nearly, but not quite, profitable under the existing term will now produce work, and this work will generate welfare for society both now and in the future. At the same time, the increase in term applies to all works including existing ones — those created under the term of copyright before extension. Extending term on these works prolongs the copyright monopoly and therefore reduces welfare by hindering access to, and reuse of, these works.

Recommendation: Reduce Copyright Term – And Certainly Do Not Extend It

Current copyright term is significantly over-extended. Calculations performed in the course of my own work indicate that optimal copyright term is likely around 15 years and almost certainly below 40 (the breadth of the estimates here are a direct reflection of the existing data limitations but this upper bound is still (far) below existing terms).

Even a simple present-value calculation would indicate that the incentives for creativity today offered by extra term 50 years or more in the future are negligible — while the effect on access to knowledge can be very substantial, especially when term extensions are applied retrospectively (as they almost always are).

It is also noteworthy that recent extensions, such as that for authorial copyright in the US (the CTEA) and the proposed extension of recording copyright in the EU, have been opposed well-nigh unanimously by academic economists and other IP scholars. Policy-making in this area should be evidence-based and designed to promote the broader welfare of society as a whole. Policies that appear to reflect nothing more than special-interest lobbying will only perpetuate the “marked lack of public legitimacy” which the Gowers report lamented, discouraging those who wish to contribute constructively to future Government policy-making in these areas, and making enforcement ever harder — effective enforcement, after all, depends on consent borne of respect as well as obedience coerced through punishment.

Deliverance is a great library that lets you easily re-theme external websites on the fly. Designed as WSGI middleware, it can be easily combined with some proxying to integrate a bunch of websites together

You can use deliverance plus proxying out-of-the-box using the deliverance-proxy command. However, I was interested in using Deliverance as middleware from code. This turned out to be none too trivial to do — all the examples on the internet seemed to focus on using deliverance-proxy or using it in an ini file.

After much wrestling, most notably with odd issues with gzipped (deflated) content I got it working and you can find a demo implementation (see demo.py and README.txt) here:

http://rufuspollock.org/code/deliverance/

I should also mention the following sources which were all of help in my quest:

Instructions on using sqlalchemy migrate with Pylons, especially to convert an existing pylons project to use sqlalchemy migrate

This is based off several excellent sources including this guide and these threads.

One important point to note is that you are likely to end up with two versions of your model tables: one in yourapp/model and one in yourapp/migration/versions/*.py with the former representing your tables at HEAD and the second containing upgrade/downgrade scripts whose final result is HEAD. This duplication is a bit annoying and I discuss how it can be avoided below.

1. Install sqlalchemy migrate for your project e.g.

  pip -E {your-virtualenv} install sqlalchemy-migrate
  # or
  easy_install sqlalchemy-migrate

NB: latest version of migrate are only compatible with sqlalchemy >= 0.5 (for previous version of sqlalchemy you need migrate <= 0.4.5)

2. Create the migrate repository (i.e. store for upgrade scripts …).

In your project directory

  migrate create myapp/migration/ "MyApp"

Now create a temporary helper script:

  migrate manage dbmanage.py --repository=myapp/migration/ --url={your-sqlalchemy-db-uri}

3. Set up db version control

  python dbmanage.py version_control

Check the current version (should be 0)

  python dbmanage.py version

4. Create a migration script for your existing db

  python dbmanage.py script "Add existing tables"

This will create a script in myapp/migration/versions/001addexisting_tables.py

Copy into that file the definition for all your existing tables (and other database stuff such as constraints) and then create those tables in the upgrade() function (and delete them in downgrade()).

That’s it! (in theory)

Additional Issues

1. Duplication of model/db code

You now have two places for model/db code:

  1. Your migration scripts
  2. Your actual model

This doesn’t have to be a problem but it is an obvious way for bugs to creep especially when some people start by creating their DB from the model code and others from the migration scripts.

Warning: this method will not work if do stuff in your table creation that is not persisted into the actual DB sql (e.g. column default values based on a function, custom db types …).

One way to avoid the duplication is to have all table creation and alteration confined to your migration scripts and then have your model tables set up directly from the DB using the autoload=True option. The one disadvantage of this is you can’t see the complete DB set up in one places as tables construction may be spread over several migrate scripts. One solution to this is provided by the experimental ‘create_model’ command which dumps the current DB model in the required sqlalchemy table code.

More discussion in this migrate-users thread

Bringing the Migration DB up to date

If you do set up your DB (from new) directly from your model code rather than the migration scripts then this requires that you set up the migration stuff and update the migrate version to the correct number. (I note there is an experimental updatedbto_model command which is supposed to do this for you). You can do this as follows (assuming your migrate repository is at YOURAPP:

      from migrate.versioning.api import version_control, version
      import YOURAPP.migration.versions
      v = version(YOURAPP.migration.__path__[0])
      # log.info( "Setting current version to '%s'" % v )
      # url is your sqlalchemy db url 
      version_control(url, YOURAPP.migration.__path__[0], v)

Extras

  • Should wrap upgrade/downgrade in transactions. I found one way to do this here but testing indicated this didn’t work for me (rollback was not working properly when there was an error)

In doing research for the EU Public Domain project (as here and here) we are often handling large datasets, for example one national library’s list of pre-1960 books stretched to over 4 million items. In such a situation, an algorithm’s speed (and space) can really matter. To illustrate, consider our ‘loading’ algorithm — i.e. the algorithm to load MARC records into the DB, which had the following steps:

  1. Do a simple load: i.e. for each catalogue entry create a new Item and new Persons for any authors listed
  2. “Consolidate” all the duplicate Persons, i.e. a Person who is really the same but for whom we create duplicate DB entries in part 1 (we can do this because MARC cataloguers try to uniquely identify authors based on name + birth date + death date).
  3. [Not discussed here] Consolidate “items” to “works” (associate multiple items (i.e. distinct catalogue entries) of, say, a Christmas Carol, to a single “work”)

The first part of this worked great: on a 1 million record load we averaged between 8s and 25s (depending on hardware, DB backend etc) per thousand records with speed fairly constant throughout (so that’s between 2.5 and 7.5h to load the whole lot). Unfortunately, at the consolidate stage we ran into problems: for a 1 million item DB there were several 100 thousand consolidations and we were averaging only 900s per 1000 consolidations! (This also scaled significantly with DB size: a 35k records DB averaged 55s per 1000). This would mean a full run would require several days! Even worse, because of the form of the algorithm (all the consolidation for a given person were done as a batch) we ran into memory issues on big datasets with some machines.

To address this we switched to performing “consolidation” on load, i.e. when creating each Item for a catalogue entry we’d search for existing authors who matched the information we had on that record. Unfortunately this had a huge impact on the load: time grew superlinearly and had already reached 300s per 1000 records at the 100k mark having started at 40 — Figure 1 plots this relationship. By extrapolation, 1M records would take 100 hours plus — almost a week!

At this point we went back to the original approach and tried optimizing the consolidation, first by switching to pure sql and then by adding some indexes on join tables (I’d always thought that foreign keys were auto indexed but it turned out not to be the case!). The first of these changes solved the memory issues, while the second resolved the speed problems providing a speedup of more than 30x (30s per 1000 rather 900s) and reduced the processing time from several days to a few hours.

Many more examples of this kind of issue could be provided. However, this one already serves to illustrate the two main points:

  • With large datasets speed really matters
  • Even with optimization algorithms can take a substantial time to run

Both of these have a significant impact on the speed, and form, of the development process. First, because one has to spend time optimizing and profiling — which like all experimentation is time-consuming. Second because longer run-times directly impact the rate at which results are obtained and development can proceed — often bugs or improvements only become obvious once one has run on a large dataset, plus any change to an algorithm that alters output requires that it be rerun.

speed.png

Figure 1: Load time when doing consolidation on load

Background

I’m working on a EU funded project to look at the size and value of the Public Domain. This involves getting large datasets about cultural material and trying to answer questions like: How many of these items are in the public domain? What’s the difference in price and availability of public domain versus non public domain items?

I’ve also been involved for several years in Public Domain Works, a project to create a database of works which were in the public domain (especially recordings).

The Problem

Suppose we have data on cultural items such as books and recordings. For a given item we wish to:

  1. Identify the underlying work(s) that item contains.
  2. Identify the copyright status of that work, in particular whether it is Public Domain (PD)

Putting 1 and 2 together allows us to assign a ‘copyright status’ to a given item.

Aside: We have to be a bit careful here since the copyright status of an item and its work may not be exactly the same: for example, even books containing pure public domain texts may have copyright in their typesetting — or there may be additional non-PD material such as an introduction or commentaries (though, in this case, at least theoretically, we should say the item contains 2 works a) the original PD text b) the non-PD introduction).

Note our terminology here (based off FRBR): by an ‘item’ we mean something like a publication be that book, recording or whatever. By a work we mean the underlying material (text, sounds etc) contained within that. So for example, Shakespeare’s play “Hamlet” is a single work but there are many associated items (publications). (Note that we would count a translation of a work as a new work — though one derived from the original work).

Almost all the data available on cultural material is about items. For example, library catalogues list items, databases listing sales (such as Nielsen) list items and online sites providing information on currently available material (along with prices) such as booksinprint, muze or even Amazon list items.

Determining Copyright (or Public Domain) Status

With our terminology in place determining copyright status is, in theory, simple:

  1. Given information on an item match it to a work (or works).
  2. For each work obtain relevant information such as date work first published (as an item) and death dates of author(s)
  3. Compute copyright status based on the copyright laws for your jurisdiction.

While copyright law is not always simple, step three is generally fairly straightforward, especially if one is willing to accept something that almost but not quite 100% accurate (say 99.99% accurate).[^peterpan]

[^peterpan]: Not being 100% accurate means we can ignore some of the “special cases” and one-off exceptions in copyright law. For example, in the UK the Copyright Designs and Patents Act para 301 contains a special provision which mean that “Peter Pan” by J.M. Barrie will never enter the Public Domain (royalties will be payable in perpetuity for the benefit of Great Ormond Street Hospital).

What is not so straightforward are the first two steps especially step 1. This is because most datasets give only a limited amount of information on the items they contain.

Frequently information on authors will be limited or non-existent, and they certainly may not be unambiguously identified (this is especially true of datasets containing ‘commercial’ information such as prices and availability). Often the exact form of the title, even for the same item will vary between datasets and that leaves aside the possibility of varying titles for different titles related to the same work (is it “Hamlet” or “William Shakespeare’s Hamlet” or “Hamlet by William Shakespeare” or “Hamlet, Prince of Denmark” etc).

At the same time, speed matters because the size of the datasets involved are fairly substantial. For example, there were approx 64 thousand titles that sold more than 5 copies in 2007 in the UK. If computing public domain status for each title takes 1 second then a full run will take 18 hours. If it takes 30s per title it will take 22 days.

Some Examples

To illustrate the difficulties here I present the results of two different attempts at computing the PD status for the list of 64k titles which sold at least 5 copies in the UK in 2007.

Example 1: Open Library

I ran this algorithm (by_work method) against the Open Library database via their web api. This was a very slow process. First, because web apis are relatively slow and second because, perhaps due to overloading, the OL API would stop responding at some point and a manual reboot would be required (to try avoid overloading the API we’d already added a significant delay between requests — another reason the process was quite slow). Overall it took more around 10 days to run through the whole 64k item dataset. The results were as follows:

Total PD: 2206.0
Total Items: 63937
Fraction PD: 0.0345027136087
Total Matched: 0.588469900058

As this shows matching was not that successful with only around 3/5 of items successfully matched. Part of this may be due to the fact that:

  • I limit the number of title matches to 10 in order to keep the time within reasonable bounds
  • The difficulty of allowing enough, but not too much, fuzziness in the matching process.

Overall, approximately 3.5% of all items were identified as PD (that being 5.8% of those actually matched). The PD determination algorithm was a conservative one with an item labelled as PD only if all authors were positively identified as PD.

Thus, this is likely to be lower bounds (at least assuming the match process was reasonable — and allowing for the fact that some PD items included non-PD material such as commentaries). It was certainly clear from basic eyeballing that a substantial number of PD works were either not matched or not computed as PD (because of incorrect authors or missing death dates).

Example 2

Our second algorithm ran against a local copy of Philip Harper’s NGCOBA database (data, code). The algorithm was as follows:

  1. Matched by title and authors.
    • If match: compute PD status strictly (all death dates known and all less than 1937)
    • Else: continue
  2. Pick first author and find all (approx) matching authors (allow extra first names)
    • If no match: Not PD
    • Intialize PD score to 0
    • For each matched author alter score in following manner:
      • If author PD: +1
      • If not PD: -3
      • If unknown (no death_date) -0.5
    • PD if score > 0 (Else: Not PD)

This algorithm took a few hours to run (this could likely be much improved with a bit of DB optimization and a move from sqlite to something better). The results were:

Total PD: 6404.0
Total Items: 63917
Fraction PD: 0.100192437067

As can be seen the fraction PD here was substantially higher at around 10%. One might be concerned that this was due to our more lenient PD algorithm (the problem was that without such ‘leniency’ a very large number of PD works/authors were being misclassified as not PD). However, basic eye-balling indicates that the number of false positives is not particularly high (and that there are also some false negatives).

Summary

  1. Computing PD status is non-trivial largely because a) it is hard to match a given item to a work or person b) we lack data such as authorial death dates and dates of first publication that are required.
  2. As such we need to adopt approximate and probabilistic methods (such as the scoring approach)
  3. (Very) preliminary calculations suggest that between 3 and 10% of titles actively sold at any one time are public domain
    • NB: this does not mean 3-10% of sales were public domain (in fact this is very unlikely since few, if any of the best-selling items are PD)

Recent Work on Open Economics

January 23rd, 2009

Over the Christmas break I had a chance to make some substantial improvements/additions to our Open Economics including:

  1. Improved javascript graphing.
  2. Extend Millenium Development Goals package and added web interface.
  3. First efforts at ‘Where Does My Money Go’

More details on each of these can be found below. Also we’d be delighted to here from anyone interested in getting involved in this, especially with the last item, so if interested do get in touch.

1. Updated javascript graphing package to use flot.

This also allows us to use javascript make the graphing stuff more interactive, in particular to select chart type and the series to plot. See e.g. the data on Daily Wages of Thatchers in the Middle Ages or Wheat, barley, oat, mutton and wool prices, and agricultural wages, 1500-1849.

2. Improved Millenium Development Goals package/dataset and added a web interface.

Extended ‘packagization’ of the MDG data by creating a mini-domain model and an associated sql version of data in addition to the existing csv normalized-tabular version of the data:

http://knowledgeforge.net/econ/svn/trunk/econdata/mdg/db.py

This is much more convenient for analysis (e.g. finding all countries which have at least one entry for any of these 3 series between 1995 and 2005 …). It is also essential for:

New web interface for Millenium Development Goals

Using the sql version of the data is was easy to build a quick-and-dirty web interface to enables one to browse and view the data quickly:

http://www.openeconomics.net/mdg/

For example here’s chart and data showing “Children under 5 moderately or severely underweight, percentage” for Afghanistan, China, India, United States:

http://www.openeconomics.net/mdg/view?commit=Show+Values&series=559&countries=4&countries=156&countries=356&countries=840

3. First efforts at ‘Where Does My Money Go’

Two parts to this project a) getting the data on government revenue/expenditure b) displaying it nicely in a web interface.

Part (a) is encapsulated in a new ukgovfinances dataset:

http://knowledgeforge.net/econ/svn/trunk/econdata/ukgovfinances/

Using this data we have made a (small) start on the web interface:

http://www.openeconomics.net/wdmmg/

Imagemagick convert notes

January 12th, 2009

Helping myself remember how to do common things using imagemagick’s (excellent but many-optioned) convert utility.

convert -scale 10% x y

convert -type Grayscale x y

convert -monochrome x y

# invert colours
convert -negate in out

convert -rotate x y

For large data centres a big industry player estimated costs of £22 / GB / Month = £250k / TB / Year. Majority of this was hardware and energy costs (not costs of human sysadmins). This seems quite a lot. However, Amazon S3 quote for Europe (cheaper for US):

Storage
$0.18 per GB-Month of storage used

Data Transfer
$0.100 per GB - all data transfer in

$0.170 per GB - first 10 TB / month data transfer out
$0.130 per GB - next 40 TB / month data transfer out
$0.110 per GB - next 100 TB / month data transfer out
$0.100 per GB - data transfer out / month over 150 TB

Requests
$0.012 per 1,000 PUT, POST, or LIST requests
$0.012 per 10,000 GET and all other requests*
* No charge for delete requests

Subtracting say £2 for costs of storage and transfer leaves £20 per GB Month = $40 / GBM. On Amazon’s figures this is around 235 GB of transfer (0.235 TB). A ratio of 235 to 1 on the underlying data. Not necessarily an infeasible level (235 users / byte / month). This also demonstrates that b/w costs will dwarf storage costs in most cases.

I’ll be giving a talk at Open Tech 2008 on Saturday (5th July) about some of the work I do at the Open Knowledge Foundation. The talk is entitled “Opening Data” and its rough subject is indicated by the blurb:

We all want more open data to analyse and mashup be it for urban planning or to better understand 12th Century Canon Law. But how do we go about reaching data ‘Nirvana’? What are the obstacles and why is openness so crucial to getting there? This talk explores these questions touching on some of the more prominent recent developments in the area along the way.

OpenTech/NotCon has been a great experience over the years and this time looks to be no exception.

A new version (v1.2) of my python script for converting markdown to latex is now done. markdown2latex (renamed from mkdn2latex) has been extensively refactored to become a proper python-markdown extension. This means it can be used seemlessly alongside plain markdown conversion, as well as independently whether as a module or, in its classic form, from the command line.

In addition for ease of installation it has also been turned into a proper python package and registered on pypi so you can just do:

$ easy_install markdown2latex

Alternatively you can still get it straight from the repository at:

http://knowledgeforge.net/okftext/svn/trunk/python/markdown2latex/