Algorithm Speed and the Challenge of Large Datasets
June 22nd, 2009
In doing research for the EU Public Domain project (as here and here) we are often handling large datasets, for example one national library’s list of pre-1960 books stretched to over 4 million items. In such a situation, an algorithm’s speed (and space) can really matter. To illustrate, consider our ‘loading’ algorithm — i.e. the algorithm to load MARC records into the DB, which had the following steps:
- Do a simple load: i.e. for each catalogue entry create a new Item and new Persons for any authors listed
- “Consolidate” all the duplicate Persons, i.e. a Person who is really the same but for whom we create duplicate DB entries in part 1 (we can do this because MARC cataloguers try to uniquely identify authors based on name + birth date + death date).
- [Not discussed here] Consolidate “items” to “works” (associate multiple items (i.e. distinct catalogue entries) of, say, a Christmas Carol, to a single “work”)
The first part of this worked great: on a 1 million record load we averaged between 8s and 25s (depending on hardware, DB backend etc) per thousand records with speed fairly constant throughout (so that’s between 2.5 and 7.5h to load the whole lot). Unfortunately, at the consolidate stage we ran into problems: for a 1 million item DB there were several 100 thousand consolidations and we were averaging only 900s per 1000 consolidations! (This also scaled significantly with DB size: a 35k records DB averaged 55s per 1000). This would mean a full run would require several days! Even worse, because of the form of the algorithm (all the consolidation for a given person were done as a batch) we ran into memory issues on big datasets with some machines.
To address this we switched to performing “consolidation” on load, i.e. when creating each Item for a catalogue entry we’d search for existing authors who matched the information we had on that record. Unfortunately this had a huge impact on the load: time grew superlinearly and had already reached 300s per 1000 records at the 100k mark having started at 40 — Figure 1 plots this relationship. By extrapolation, 1M records would take 100 hours plus — almost a week!
At this point we went back to the original approach and tried optimizing the consolidation, first by switching to pure sql and then by adding some indexes on join tables (I’d always thought that foreign keys were auto indexed but it turned out not to be the case!). The first of these changes solved the memory issues, while the second resolved the speed problems providing a speedup of more than 30x (30s per 1000 rather 900s) and reduced the processing time from several days to a few hours.
Many more examples of this kind of issue could be provided. However, this one already serves to illustrate the two main points:
- With large datasets speed really matters
- Even with optimization algorithms can take a substantial time to run
Both of these have a significant impact on the speed, and form, of the development process. First, because one has to spend time optimizing and profiling — which like all experimentation is time-consuming. Second because longer run-times directly impact the rate at which results are obtained and development can proceed — often bugs or improvements only become obvious once one has run on a large dataset, plus any change to an algorithm that alters output requires that it be rerun.

Figure 1: Load time when doing consolidation on load
Computing Copyright (or Public Domain) Status of Cultural Works
March 12th, 2009
Background
I’m working on a EU funded project to look at the size and value of the Public Domain. This involves getting large datasets about cultural material and trying to answer questions like: How many of these items are in the public domain? What’s the difference in price and availability of public domain versus non public domain items?
I’ve also been involved for several years in Public Domain Works, a project to create a database of works which were in the public domain (especially recordings).
The Problem
Suppose we have data on cultural items such as books and recordings. For a given item we wish to:
- Identify the underlying work(s) that item contains.
- Identify the copyright status of that work, in particular whether it is Public Domain (PD)
Putting 1 and 2 together allows us to assign a ‘copyright status’ to a given item.
Aside: We have to be a bit careful here since the copyright status of an item and its work may not be exactly the same: for example, even books containing pure public domain texts may have copyright in their typesetting — or there may be additional non-PD material such as an introduction or commentaries (though, in this case, at least theoretically, we should say the item contains 2 works a) the original PD text b) the non-PD introduction).
Note our terminology here (based off FRBR): by an ‘item’ we mean something like a publication be that book, recording or whatever. By a work we mean the underlying material (text, sounds etc) contained within that. So for example, Shakespeare’s play “Hamlet” is a single work but there are many associated items (publications). (Note that we would count a translation of a work as a new work — though one derived from the original work).
Almost all the data available on cultural material is about items. For example, library catalogues list items, databases listing sales (such as Nielsen) list items and online sites providing information on currently available material (along with prices) such as booksinprint, muze or even Amazon list items.
Determining Copyright (or Public Domain) Status
With our terminology in place determining copyright status is, in theory, simple:
- Given information on an item match it to a work (or works).
- For each work obtain relevant information such as date work first published (as an item) and death dates of author(s)
- Compute copyright status based on the copyright laws for your jurisdiction.
While copyright law is not always simple, step three is generally fairly straightforward, especially if one is willing to accept something that almost but not quite 100% accurate (say 99.99% accurate).[^peterpan]
[^peterpan]: Not being 100% accurate means we can ignore some of the “special cases” and one-off exceptions in copyright law. For example, in the UK the Copyright Designs and Patents Act para 301 contains a special provision which mean that “Peter Pan” by J.M. Barrie will never enter the Public Domain (royalties will be payable in perpetuity for the benefit of Great Ormond Street Hospital).
What is not so straightforward are the first two steps especially step 1. This is because most datasets give only a limited amount of information on the items they contain.
Frequently information on authors will be limited or non-existent, and they certainly may not be unambiguously identified (this is especially true of datasets containing ‘commercial’ information such as prices and availability). Often the exact form of the title, even for the same item will vary between datasets and that leaves aside the possibility of varying titles for different titles related to the same work (is it “Hamlet” or “William Shakespeare’s Hamlet” or “Hamlet by William Shakespeare” or “Hamlet, Prince of Denmark” etc).
At the same time, speed matters because the size of the datasets involved are fairly substantial. For example, there were approx 64 thousand titles that sold more than 5 copies in 2007 in the UK. If computing public domain status for each title takes 1 second then a full run will take 18 hours. If it takes 30s per title it will take 22 days.
Some Examples
To illustrate the difficulties here I present the results of two different attempts at computing the PD status for the list of 64k titles which sold at least 5 copies in the UK in 2007.
Example 1: Open Library
I ran this algorithm (by_work method) against the Open Library database via their web api. This was a very slow process. First, because web apis are relatively slow and second because, perhaps due to overloading, the OL API would stop responding at some point and a manual reboot would be required (to try avoid overloading the API we’d already added a significant delay between requests — another reason the process was quite slow). Overall it took more around 10 days to run through the whole 64k item dataset. The results were as follows:
Total PD: 2206.0
Total Items: 63937
Fraction PD: 0.0345027136087
Total Matched: 0.588469900058
As this shows matching was not that successful with only around 3/5 of items successfully matched. Part of this may be due to the fact that:
- I limit the number of title matches to 10 in order to keep the time within reasonable bounds
- The difficulty of allowing enough, but not too much, fuzziness in the matching process.
Overall, approximately 3.5% of all items were identified as PD (that being 5.8% of those actually matched). The PD determination algorithm was a conservative one with an item labelled as PD only if all authors were positively identified as PD.
Thus, this is likely to be lower bounds (at least assuming the match process was reasonable — and allowing for the fact that some PD items included non-PD material such as commentaries). It was certainly clear from basic eyeballing that a substantial number of PD works were either not matched or not computed as PD (because of incorrect authors or missing death dates).
Example 2
Our second algorithm ran against a local copy of Philip Harper’s NGCOBA database (data, code). The algorithm was as follows:
- Matched by title and authors.
- If match: compute PD status strictly (all death dates known and all less than 1937)
- Else: continue
- Pick first author and find all (approx) matching authors (allow extra first names)
- If no match: Not PD
- Intialize PD score to 0
- For each matched author alter score in following manner:
- If author PD: +1
- If not PD: -3
- If unknown (no death_date) -0.5
- PD if score > 0 (Else: Not PD)
This algorithm took a few hours to run (this could likely be much improved with a bit of DB optimization and a move from sqlite to something better). The results were:
Total PD: 6404.0
Total Items: 63917
Fraction PD: 0.100192437067
As can be seen the fraction PD here was substantially higher at around 10%. One might be concerned that this was due to our more lenient PD algorithm (the problem was that without such ‘leniency’ a very large number of PD works/authors were being misclassified as not PD). However, basic eye-balling indicates that the number of false positives is not particularly high (and that there are also some false negatives).
Summary
- Computing PD status is non-trivial largely because a) it is hard to match a given item to a work or person b) we lack data such as authorial death dates and dates of first publication that are required.
- As such we need to adopt approximate and probabilistic methods (such as the scoring approach)
- (Very) preliminary calculations suggest that between 3 and 10% of titles actively sold at any one time are public domain
- NB: this does not mean 3-10% of sales were public domain (in fact this is very unlikely since few, if any of the best-selling items are PD)
Recent Work on Open Economics
January 23rd, 2009
Over the Christmas break I had a chance to make some substantial improvements/additions to our Open Economics including:
- Improved javascript graphing.
- Extend Millenium Development Goals package and added web interface.
- First efforts at ‘Where Does My Money Go’
- Aim: Dig up govt finance info and visualize the results (online)
- http://okfn.org/wiki/projects/Where_Does_My_Money_Go
More details on each of these can be found below. Also we’d be delighted to here from anyone interested in getting involved in this, especially with the last item, so if interested do get in touch.
1. Updated javascript graphing package to use flot.
This also allows us to use javascript make the graphing stuff more interactive, in particular to select chart type and the series to plot. See e.g. the data on Daily Wages of Thatchers in the Middle Ages or Wheat, barley, oat, mutton and wool prices, and agricultural wages, 1500-1849.
2. Improved Millenium Development Goals package/dataset and added a web interface.
- MDG entry in the store
- Source for MDG package: http://knowledgeforge.net/econ/svn/trunk/econdata/mdg/
Extended ‘packagization’ of the MDG data by creating a mini-domain model and an associated sql version of data in addition to the existing csv normalized-tabular version of the data:
http://knowledgeforge.net/econ/svn/trunk/econdata/mdg/db.py
This is much more convenient for analysis (e.g. finding all countries which have at least one entry for any of these 3 series between 1995 and 2005 …). It is also essential for:
New web interface for Millenium Development Goals
Using the sql version of the data is was easy to build a quick-and-dirty web interface to enables one to browse and view the data quickly:
http://www.openeconomics.net/mdg/
For example here’s chart and data showing “Children under 5 moderately or severely underweight, percentage” for Afghanistan, China, India, United States:
3. First efforts at ‘Where Does My Money Go’
Two parts to this project a) getting the data on government revenue/expenditure b) displaying it nicely in a web interface.
Part (a) is encapsulated in a new ukgovfinances dataset:
http://knowledgeforge.net/econ/svn/trunk/econdata/ukgovfinances/
Using this data we have made a (small) start on the web interface:
Imagemagick convert notes
January 12th, 2009
Helping myself remember how to do common things using imagemagick’s (excellent but many-optioned) convert utility.
convert -scale 10% x y
convert -type Grayscale x y
convert -monochrome x y
convert -rotate x y
Data Storage and Transfer Costs: Some Back of the Envelope Estimates
September 2nd, 2008
For large data centres a big industry player estimated costs of £22 / GB / Month = £250k / TB / Year. Majority of this was hardware and energy costs (not costs of human sysadmins). This seems quite a lot. However, Amazon S3 quote for Europe (cheaper for US):
Storage
$0.18 per GB-Month of storage used
Data Transfer
$0.100 per GB - all data transfer in
$0.170 per GB - first 10 TB / month data transfer out
$0.130 per GB - next 40 TB / month data transfer out
$0.110 per GB - next 100 TB / month data transfer out
$0.100 per GB - data transfer out / month over 150 TB
Requests
$0.012 per 1,000 PUT, POST, or LIST requests
$0.012 per 10,000 GET and all other requests*
* No charge for delete requests
Subtracting say £2 for costs of storage and transfer leaves £20 per GB Month = $40 / GBM. On Amazon’s figures this is around 235 GB of transfer (0.235 TB). A ratio of 235 to 1 on the underlying data. Not necessarily an infeasible level (235 users / byte / month). This also demonstrates that b/w costs will dwarf storage costs in most cases.
Talking at OpenTech on Saturday
July 3rd, 2008
I’ll be giving a talk at Open Tech 2008 on Saturday (5th July) about some of the work I do at the Open Knowledge Foundation. The talk is entitled “Opening Data” and its rough subject is indicated by the blurb:
We all want more open data to analyse and mashup be it for urban planning or to better understand 12th Century Canon Law. But how do we go about reaching data ‘Nirvana’? What are the obstacles and why is openness so crucial to getting there? This talk explores these questions touching on some of the more prominent recent developments in the area along the way.
OpenTech/NotCon has been a great experience over the years and this time looks to be no exception.
A new version (v1.2) of my python script for converting markdown to latex is now done. markdown2latex (renamed from mkdn2latex) has been extensively refactored to become a proper python-markdown extension. This means it can be used seemlessly alongside plain markdown conversion, as well as independently whether as a module or, in its classic form, from the command line.
In addition for ease of installation it has also been turned into a proper python package and registered on pypi so you can just do:
$ easy_install markdown2latex
Alternatively you can still get it straight from the repository at:
http://knowledgeforge.net/okftext/svn/trunk/python/markdown2latex/
Some Notes on Converting From Subversion to Mercurial (hg)
May 17th, 2008
Distributed versioning systems (VCMs) have now matured to the point that I’ve been planning to switch from subversion for quite a while — at least for own personal repositories where there are no coordination issues. Having chosen mercurial (hg) as my DVCM of choice the next step was to actually convert. While there is quite a bit of documentation on this topic available online I didn’t always find these had the necessary info. Combined with my experience of several ’snags’ along the way I thought it worth documenting my experience in case it proves useful to others.
I’d waited until hg 0.9.5 was available on my distro precisely because I wanted to use the hg convert functionality (alteratives such as tailor looked to have difficulties and though it turned out I could have used hgsvn without problem my original impression of it had been it was oriented for integration of hg and svn rather than straight conversion).
Before I document the steps it is important to get clear one v. important thing about how hg works:
There is no distinction between a working copy and a repository.
In particular, each repository is a working copy and vice versa. The actual repo is stored inside the working copy at its root in a .hg directory. When you ‘checkout’ (svn terminology) you do so simply by ‘cloning’ an existing repository (or if just want a limited set of changes — e.g. those since you lasted ‘updated’ (svn terminology) you can do a ‘pull’). In fact you could even just make a plain copy that .hg directory and send it to someone — though obviously this might not work so well if you are moving between 2 OSs with different filesystems.
Anyway, the main point to take from this is that the result of an hg convert will simply be a new directory with all the files (the working copy) plus a .hg directory in that directory (the repo).
To convert all you do is::
$ hg convert <svn-repo-or-co> <some-new-directory>
If you get weird errors while trying to convert (e.g. “Unknown Repository type”) then turn on debug to get more info:
hg -v --debug convert SOURCE DEST
(For example, in my case it turned out I hadn’t got the python svn bindings installed …)
The devil however is in the detail:
- svn-repo-or-co can be the uri of a subversion repo or the path of a svn checkout. Where a checkout hg convert will just work out the source repo and pull from there
- Note however hg convert will not move across working copy files themselves. The obvious solution to this is to do the convert and then just move the .hg file across into your svn checkout and delete all the .svn directories (or vice-versa)
- some-new-directory: this is where the new hg repo/working copy with end up.
- After doing hg convert rather surprisingly all of the files in the new hg repository will be listed as ‘?’ (not tracked) when you do a
hg status. To solve this just do ahg update - To speed up conversions it is often worth getting a local copy of the subversion repo (to save pulling lots of stuff over the network connection). To do this either use svnsync or just dump the remote repo and load into a local one (if converting from a working copy you’ll then just need to do a
svn switch --relocate - My repository did not have a branches/tags/trunk layout (instead it has multiples subprojects …). This led to weird errors involving files and directories at the root of the repository which looked like: ‘hg convert abort: path contains illegal component’. I solved this by using the
--filemapoption tohg convertand putting explicit renames of the form:/root-path-1 root-path-1in that file. What do you for all the other working copies once you have converted the repo/your working copy? This is now simple:
- Clone your hg repo to each of the machines with a working copy. (For this purpose you will probably want to make your original hg repo available over the Internet using either ssh or http protocols (for details see mercurial docs).
- Delete all wc files in this new hg repo
- Copy svn working files (just from trunk …) into this hg repo (alternatively do this the other way round and copy the relevant hg stuff .hg, .hgtags etc into the svn working copy …)
Remove .svn directories:
$ find . -name '\.svn' -exec rm -Rf {} \;- Update inside the new hg repo:
$ hg up
svk Sync Bug “Bad URL passed to RA layer: Malformed URL for repository”
October 13th, 2007
I record briefly my experience resolving this issue in case it helps others. As background I note that I use svk to allow local commit and replay for some of the subversion repos I use and over the last week I’d started encountering problems when trying to svk sync on one of these receiving the following error message:
Bad URL passed to RA layer: Malformed URL for repository
The solution to this is the following patch provided by Peter Werner to the svk-devel list a few days ago:
-------------- next part --------------
--- SVN-Mirror-0.73.orig/lib/SVN/Mirror/Ra.pm 2007-03-19 23:59:12.000000000 +0100
+++ SVN-Mirror-0.73/lib/SVN/Mirror/Ra.pm 2007-10-07 08:37:36.000000000 +0200
@@ -168,6 +168,9 @@
$self->{config} ||= SVN::Core::config_get_config(undef, $self->{pool});
$self->{auth} ||= $self->_new_auth;
+ # escape URI (% is already escaped)
+ $arg{'url'} =~ s/([^-_:.%\/a-zA-Z0-9])/sprintf("%%%02X", ord($1))/eg if defined $arg{'url'};
+
SVN::Ra->new( url => $self->{rsource},
auth => $self->{auth},
config => $self->{config},
In addition to this solution below I report the process by which I discovered it. I do this as it provides an interesting case study of the way that open source communities work, and particularly how ‘user-driven bug-fixing’ happens.
- Searching on the web turned up a variety of earlier reports [1][2][3] of this issue which it seemed related to having a spaces in svn url names (see [1.1] and [2] in particular). This seemed plausible as a source of the error as it occurred after someone had added a directory with spaces in it to the repository (a very rare occurrence).
- This issue did not seem to occur for all users and CLK (the maintainer) suggested upgrading SVN::Mirror to 0.73. [2.1]
- This I did but the bug was still there (as other users had noted [2.2]) however the source now seemed to be pinpointed as being in the SVN::Mirror perl module. Unfortunately I’m not a perl hacker …
- Finally a hand search of the svk lists turned up a post from less than a week ago [4] (obviously too recent for Google to have picked up yet as I had earlier done a specific search for the error name over the svk lists …). In addition to reporting the problem this mail provided a 2 liner patch to a specific perl module. I applied this patch, tried svk sync and hey presto! the bug was gone.
The issue progressed from an unconfirmed one whose aetiology was unclear [1], to a confirmed one whose cause was fairly well known [2] (though not its source in code), solutions were suggested and tested by users [2.1, 2.2], the issue remained unresolved for several more months with the fix eventually provided by an independent user to the list [4].
It is also especially noteworthy that much of this tracking down was only possible because the software involved was open enabling users to poke around to see what was wrong. For example, tying the bug to spaces in the underlying repository url resulted from the original reporter of the issue hand-modifying a svn source file so as to make the error message more verbose [1.1] — something which is clearly only possible if the code is open.
- [1]:http://lists.bestpractical.com/pipermail/svk-devel/2007-January/000570.html
- [1.1]:http://lists.bestpractical.com/pipermail/svk-devel/2007-January/000571.html
- [2]: http://lists.bestpractical.com/pipermail/svk-devel/2007-March/000755.html
- [2.1]: http://lists.bestpractical.com/pipermail/svk-devel/2007-March/000757.html
- [2.2]: http://lists.bestpractical.com/pipermail/svk-devel/2007-March/000766.html
- [3]: http://lists.bestpractical.com/pipermail/svk-devel/2007-June/000944.html
- [4]: http://lists.bestpractical.com/pipermail/svk-devel/2007-October/001065.html
A Review of Plotting Libraries for Python
June 5th, 2007
An (ongoing) summary of my experience with some of the utilities available for plotting from a python perspective.
Last updated: 2008-03-06
Ploticus
- (+) Fast, powerful, mature, well-documented
- (-) Not python based
C-based rather than python-based but fast and powerful. There is a (fairly crude) set of python bindings available here: http://www.srcc.lsu.edu/~davids/ploticus_module.html. Alternatively one can just call the ploticus command from a python script.
Matplotlib
- (+) Fairly powerful, mature, well-documented, nice pure python API
- (-) A little slow; requires a backend to be installed (so installation on a server is a problem)
- Could support object-orientation better
PyChart
- (+) Pure python, quite simple to use, good documentation
- (-) Not quite as nice looking or as powerful as e.g. ploticus
Biggles
http://biggles.sourceforge.net/
- last updated: 2004-03-08
- looks fine but does not seem to be actively developed any longer
Example
See: http://home.gna.org/pychart/examples/index.html. This is the bar/line example from there:

from pychart import *
theme.get_options()
data = [(10, 20, 30), (20, 65, 33),
(30, 55, 30), (40, 45, 51),
(50, 25, 27), (60, 75, 30)]
ar = area.T(size = (150,120),
y_grid_interval=10,
x_axis=axis.X(label="X label", label_offset=(0,-7)),
y_axis=axis.Y(label="Y label"),
legend = legend.T(), y_range = (0, None))
ar.add_plot(bar_plot.T(label="foo", data=data),
line_plot.T(label="bar", data=data, ycol=2))
ar.draw()
