Public Domain Calculators Workshop
November 6th, 2009
I’m one of the co-organizers of a workshop on Public Domain Calculators workshop taking place next week, on the 10th and 11th of November, at Emmanuel College, University of Cambridge.
Hosted by the Open Knowledge Foundation in association with the Centre for Intellectual Property and Information Law at the University of Cambridge, it’s a meeting of European experts on copyright and the digital public domain taking place as part of the Communia project.
The purpose of the workshop is to produce materials such as legal flow charts and public domain “algorithms” which will help with the representation of different national copyright laws and the determination of public domain status.
Details of the meeting are as follows:
- When: 10-11th November 2009
- Where: Emmanuel College, Cambridge
- Wiki: http://wiki.okfn.org/PublicDomainCalculators/Meeting
- Participate: Free but space is limited. If you are interested in coming, email the organizers at: info@okfn.org
Background
There is often a tendency to talk of ‘the public domain’ and of works falling out of copyright and ‘into the public domain’ – as though there is a single set of works which are out of copyright all over the world. In fact, of course, there are different national laws about the nature and duration of copyright in different types of works – and hence what is in the public domain is different in different countries.
Efforts are currently underway to build a series of public domain calculators – which will help to determine whether or not a given work is in copyright in a given jurisdiction. At the time of writing groups and individuals in more than 17 jurisdictions are assisting in this effort.
Talk at ATRIP Conference: How Long Should Copyright Last?
September 22nd, 2009
Last week I was at the ATRIP Conference to give an invited talk on “How Long Should Copyright Last?”, based on my paper: Forever Minus a Day? Calculating the Optimal Term of Copyright.
Slide are here, and you can find the text of the accompanying introduction below (I plan to write up the full exposition as a short essay — but that is to come).
Most ATRIP participants were lawyers not economists, so this was an opportunity to do a more non-technical presentation (so no equations!). As with most economics, the fundamentals of calculating copyright term are simple: it is just a demand curve plus “welfare analysis” (a fancy name for adding up social benefits and costs), and shorn of “obfuscating” algebra these matters should be understandable by anyone.
How Long Should Copyright Last: Introduction
Before I begin it is important to note that in considering copyright and its term, we must leave to one side the questions of attribution and integrity — their existence and term can and should be considered quite separately from the ‘economic’ rights that form the core of copyright as it operates today.
This small caveat done, I beg your indulgence for a brief historical excursion. In particular, I ask you to cast your mind back a century and a half and more to the Houses of Parliament in the February of 1841.
[[Picture of Serjeant Talfourd]]
As many of you will be aware Serjeant Talfourd had, by this point, been doggedly pursuing a new copyright act for four years — since 1837. Originally wide in scope the Bill had been narrowed and the attention of both supporters and critics alike had come to focus on a single feature of that Act: the proposed extension in the term of protection. Specifically Talfourd’s Act proposed changing the then rule of 28 years or life (whichever being the longer) to life plus sixty — remarkably close to the life plus 70 of today.
[[Picture of Macaulay]]
By February 1841 Talfourd’s Bill had failed no less than 4 times. On its fifth attempt it had reached a second reading and on the fifth of February it came before the House. After a brief introduction by Talfourd — mindful that this was not the first time the matter had been discussed — Thomas Babbington Macaulay rose to speak. In a masterly disquisition, both in content and rhetoric, Macauley set out his opposition to the Bill, and did so so tellingly that the motion was defeated. Talfourd, who lost his seat at the next election, and therefore only saw his Bill pass in the hands of another — and in much reduced form — remained forever embittered by Macaulay’s intervention — coming so late and so decisively in the process.
To read Macaulay’s speech, and, for that matter, the views expressed on all sides in that debate, is to be struck by how little has changed.
[[Valenti Picture]]
When Jack Valenti and Mary Bono are found in recent times calling for a term of ‘Forever Minus a Day’ one hears the echoes of Serjeant Talfourd all those years ago, just as one can hear echoes of those who oppose extensions today of the likes of Henry Warburton, a radical politician and vehement opponent of Talfourd, who claimed the extension was “a robbery upon the public” and that copyright ought to be fixed, “only on such a term of years as would prove a sufficient inducement for authors to write good books”.
And the analogy is telling in other ways. Though Talfourd’s Bill was beaten back by a swell of opposition year after year eventually it was passed — albeit in reduced form and by Lord Mahon — with this success attributable to a persistence made possible not, primarily, by the size, but by the concentration of the interests who sought its passage. Like Fabius Cunctator the proponents of extension, sustained by deep reservoirs of emotional and financial commitment, can afford to wait, able to return, as necessary, again and again, until an opportune moment presents itself for the attainment of their purposes — for the opposition to extension, though broad is ’shallow’ and therefore more easily dissipated by distraction and division.
[[Philosophical Differences]]
Even more striking are the similarity in the issues that occupy centre stage in this debate. First, the fundamental ‘philosophical’ question — which colours all of discussion — of whether we confer copyright because it is a natural right — which should therefore last forever — or for ‘utilitarian’ purposes, that is the public good — in which case it almost certainly should not. Second, descending from these lofty heights of principle, what is the actual effect is copyright? In particular, does it operate to raise price and restrict access — that is: is it a monopoly?; and what specifically are the benefits that accrue to the producers of copyrightable works, and what costs to the public and others who wish to use and reuse them.
I think it is clear that economists — or any group for that matter — have no great claim to authority on answering this first question of principle, for it seems, ultimately, one of opinion. That said, I would note two points which must raise grave doubts as to the existence of any fundamental natural right from which copyright might spring.
First, term limited in all jurisdictions. Second, the breadth of copyright’s application both in subject matter, quality and ownership. For can we truly convince ourselves that “eternal expressions of the human spirit”, worthy of exclusivity for all time, subsist in an advert for toothpaste; or convince ourselves of the special status of the creator when so much copyright today, perhaps even the majority, is immediately, and indeed often automatically, assigned from the ‘creator’ to a corporation.
However, it is not my intention to enter into this debate any further here. Rather, in the interests of ‘full disclosure’ I wish only to make clear my views — and those of economists generally — on the matter, namely that copyright is not a natural right but is created and maintained for the purpose of promoting and securing the public good, no more, no less. (These are views which can come as no surprise given the nature of this talk — an analysis of term only makes sense if its basis is a utilitarian one!)
[[My Views]]
Furthermore, let me also make clear, right at the outset, my view, and one again I think shared by almost all economists, that copyright is a monopoly. This is not to say that copyright is bad — far from it. But to deny that copyright is a monopoly is to obscure its basic nature and operation — an obscuration that has, furthermore and unfortunately, been most common and attractive to those pursuing copyright’s enlargement.
[[The Big M Word]]
And what is the general tendency of monopoly — to echo Macaulay once again? It is indeed to raise prices and limit access. Now, of course, we may debate the precise extent of these effects, but there can be no denying that the very purpose of copyright’s existence is to confer on a single entity — the copyright holder — the power to control the dissemination, and hence the price, of all instances of a particular good — i.e. all copies of a given work.
This is the very definition of a monopoly and the fact that there may exist other goods, other works, which compete with that one makes no difference — a monopoly of apples is no less a monopoly because one does not control oranges. Of course, the existence and proximity of substitutes will alter the affect of the monopoly, but one must be cautious here: close substitutes may limit the negative effects of the copyright monopoly but they will, for the very same reasons, also limit the gains (those increased revenues for copyright-holder).
Returning then to Macaulay whose expression of the matter I cannot better:
[[Macaulay Again]]
“It is good that authors be remunerated; and the least exceptionable way of remunerating them is by a monopoly. Yet monopoly is evil. For the sake of the good we must submit to the evil, but the evil ought not to last a day longer than is necessary for securing the good.”
Our task then is to answer the implicit question: how long should copyright last (so as to not be a day longer than is needed)? More specifically what are the degrees of benefit and harm created by copyright’s monopoly and at what level should term be set to achieve the most advantageous balance of the two?
The Dissemination of Scholarly Information: Journals, Open-Access and Distributed Filtering
July 20th, 2009
Current methods of disseminating scholarly information focus on the use of journals who retain exclusive rights in the material they publish. Recently there has been increasing dissatisfaction with this model, with suggestions for alternative approaches such as “Open Access”.
Together with a colleague (Omar Al-Ubaydli) I’ve been working to explore the reasons for the development of the traditional journal model, why it is no longer efficient and how it could be improved upon. We’re particularly interested in going beyond the basic question of distribution (access) to that of filtering, i.e. the process of matching information with the scholars who want it.
With the volume of information production ever growing — and attention ever more scarce — filtering is becoming crucial. Digital technology offers us some radically new possibilities. In particular, distribution and filtering can be separated, in turn, allowing filtering to be decentralized and distributed — a model which promises dramatic increases in transparency, innovation and efficiency.
Below is an overview of our analysis with the full version of the current paper here: http://rufuspollock.org/economics/papers/scholars_and_journals.pdf
Overview
It is crucial to the progress of any domain of scholarship that those engaged therein are able to communicate their discoveries and activities to others. As such a variety of systems and institutions have been developed in order to support ’scholarly communication’ in one form or another ranging from personal letters to physical meetings. In recent times, the growth of scholarship, combined with its increasing geographical dispersion, have resulted in the centrality of the written word and its dissemination via ‘journals’. In this paper we consider the purposes of any system of scholarly communication and consider the current academic journal system in light of them. This examination highlights several deficiencies and also suggest various possible improvements.
When thinking about the possible mechanisms of scholarly communication it is useful to specify in more detail the criteria against which they should be measured. That is, to put it more succinctly, what do we want a good mechanism for scholarly communication to do? In particular, when we say communicate we must ask ourselves what, to whom, in what form, etc etc. For it is clear that when we talk of communication we usually mean more than the simple transmission of a piece of information. In fact, today, with so much scholarship available, the challenge may often not lie in the transmission from the author to the reader but in the matching of authors and readers — the decision of ‘what to read’. This growing focus on choice is a natural one in a world where time and attention are limited and the amount of scholarship available is ever increasing. As such it suggests that there are at least two distinct functions performed by a system of scholarly communication:
- Distribution — getting information from authors to readers (and back again)
- Selection (filtering) — deciding what to distribute and to whom
In appreciating this distinction it is illuminating to consider how practice has changed over time. Originally communication between scholars, at least in written form, primarily took the form of letters between the individuals involved. As such, the two activities of distribution and filtering would be almost completely identical. Then, as the number of authors and readers grew this became infeasible and dedicated journals would be created which would then disseminate to their particular readers a selection of what was submitted to them. Thus, what was once a direct peer-to-peer relationship became mediated by a new institutional form: the academic journal — though of course journals were often run by the very readers and authors who used them. Finally, today, thanks to digitization and the Internet peer-to-peer is once again a possibility though with important differences: unlike in the past, where a letter writer chooses the recipient, the modern peer-to-peer approach more resembles journals in that the author and reader act independently — the author uploads or publishes his/her work to a repository entirely separately from the reader finding, downloading and reading it. This last discussion suggests breaking down our original two categories a little further:
- ‘Making available’ — publishing material
- Discovery — finding out what is available
- Choice — choosing from what is available
- Reading — getting access to the material (in the form required)
Here, the first and fourth item would come under the ‘distribution’ heading while the second and third would come under ’selection’. In addition we should mention two other functions performed by such a system, both of which relate to selection: a) improvement of work via peer-review (distinct from filtering process itself); b) ‘quality signalling’ whereby the selection of work helps signal the quality of its creators which in turn is important for the purpose of resource allocation (jobs, grants etc) within the scholarly community.
With these added to the list we now have a good number of separate goals which a scholarly communication mechanism may seek to satisfy. The next stage is to consider how the current system, largely based on academic journals, fares in respect of them.
Goals, Instruments and the Current Journal System
It is well known that in order to fully address a given number of (independent) goals one needs an equal number of instruments. For example, if one is seeking to address both congestion and pollution in relation to road-traffic, a single instrument such as petrol taxes, will be insufficient.
Here too there are multiple independent goals, most notably distribution and selection (matching). These are clearly distinct goals and require distinct instruments for their achievement but journals are but a single instrument which combine distribution and filtering in one mechanism.
Originally, the restrictions of reproduction and distribution technologies, meant they were the best instrument available. Today, with the advent of the computer and the Internet, this is no longer true: distribution (the uploading and downloading) can be done by almost anyone and quite separately from recommendations and rating of that material.
As such, the traditional journal system is becoming a serious constraint, particularly in its closed access form. There are two distinct aspects of this constraint. First, on the distribution side, journals delay and restrict access as a result of higher prices arising either from simple monopoly control or the costs of the (inefficient) selection mechanism the traditional model necessitates. Second, on the selection side, the forced combination of selection and distribution and the associated monopoly control of content greatly limit the efficiency (and utility) of the selection and filtering processes used to match authors and readers together.
Unfortunately, the two-sided nature of the journal market (based on expectations), combined with the current evaluation structure of academia, continue to lock society into this inefficient restriction. Open-access journals provides are an important part of improving the current situation. However, as we discuss below, they are only a first step: in order to reap the full benefits of new technology we must move away from the traditional ‘journal’ model to a system that allow for full separation between the distribution and selection operations.
The Technological Origins of Modern Inefficiency
At this point it is worth considering in a little more detail why restricted-access journals originally came about. The answer lies in the nature of the technology available in earlier periods to manage distribution (printing and transmission). When many journals were originally started the cost of transmitting information was very high and journals acted as a club good by which the costs of reproduction and distribution could be (efficiently) shared (the efficiency arising here from economies of scale).
At the same time, given the limited ‘bandwidth’ it was natural for journals to take on some filtering role in order to economize on the scarce distribution capacity. In this situation, dissemination is limited and with only one instrument available (journals), it is natural to tie dissemination and filtering together (with filtering in many ways secondary). Once filtering is being done it is natural for journals to ‘tie’ material to the journal explicitly via copyright — though at an early stage given the scale economies of journals this explicit tying was not actually necessary and was probably done for simple legal convenience.
With the advent of digital communications, in particular the Internet, bandwidth is no longer scarce. What is now scarce is attention. In this setup the importance of a journal is not its role in efficiently sharing reproduction and distribution costs but its role as a filtering mechanism. However, there is now a problem: when distribution is central it is natural to ‘add-in’ filtering, it is not natural, or necessary, to tie distribution to filtering when filtering is central. In fact it seems clear that distribution and filtering can be done entirely separately (there are potentially lots of ways for you to download my paper quite separate from getting it from a journal — and lots of ways to do matching and filtering other than by journal editors and reviewers). The Open Access movement can be seen as largely about achieving this separation: with open access there is no longer a connection between access/distribution (which would be free) and the filtering mechanism (the choice of which articles go in a particular journal).
That said the ‘Open Access’ movement still has a large focus on journals — albeit open-access ones. This, in our view, is a mistake. Technology has also affected possibilities for filtering. In particular it is no longer clear why the centralized mechanism of official peer-review and journals is superior to alternative decentralized options. The last decade, has witnessed widespread, and often successful, experimentation with distributed voting and evaluation mechanisms (for example Slashdot’s story-ratings and Google’s link-based site rankings).
Thus, to be more radical, it makes sense not only to remove centralized control of distribution but also centralized control of filtering. A more distributed (market-like?) filtering mechanism would permit the same freedom (and same status) for reviewing and recommendation as it does in the production of scholarly information. At the same time it would deliver greater transparency and, by permitting ‘free-entry’ in filtering, would permit greater specialization, greater diversity, increased participation and the increasing efficiency flowing from greater competition.
As such, the gains from going ‘open’ are not simply wider access, but a reduction in the time and energy scholars spend finding and processing research information. Significantly, this second item, which is less frequently mentioned in discussions of ‘Open Access‘, may well be the most significant.
Size of the Public Domain II
July 16th, 2009
This follows up my previous post. Here we are going to calculation public domain numbers based directly on authorial birth/death date information rather than on guesstimated weightings. We’re going to focus on the Cambridge University Library (CUL) data we used previously.
| Pub. Date | Total | No Author | Any Date | Death Date |
|---|---|---|---|---|
| 1870-1880 | 50564 | 6634 (13%) | 23016 (45%) | 21876 (43%) |
| 1880-1890 | 66857 | 8225 (12%) | 31135 (46%) | 28570 (42%) |
| 1890-1900 | 66883 | 8733 (13%) | 32169 (48%) | 28971 (43%) |
| 1900-1910 | 70360 | 8594 (12%) | 35401 (50%) | 29922 (42%) |
| 1910-1920 | 60489 | 7722 (12%) | 31336 (51%) | 24608 (40%) |
| 1920-1930 | 78670 | 9023 (11%) | 44219 (56%) | 32658 (41%) |
| 1930-1940 | 90576 | 11004 (12%) | 46849 (51%) | 29372 (32%) |
| 1940-1950 | 72692 | 7638 (10%) | 36495 (50%) | 22155 (30%) |
Table 1: PD Relevant Information Availability
Table 1 presents a summary of how much relevant information is available for items (books) of particular vintages in the CUL catalogue — we only show data from 1870 to 1950 on the presumption that (almost) all pre-1870 publications are PD (their authors would have had to live for more than 70 years post-publication for this not to be the case) and almost all publications post 1950 are in copyright today (their authors would have to have died before 1940 for this not to be the case).
As the table shows, at best only just over 40% of items have a recorded authorial death date and extending to include birth dates only raises this proportion to, at best, the mid mid-to-low fifties. Taking account of items which lack any associated author, raises these figures somewhat further to around 60%, though we should note that the reason for the lack of an associated author is not clear — is it because they are genuinely anonymous or simply because the information has not been recorded? Thus, even for the earliest items listed a large proportion of items (50% or more) lack the necessary information for direct computation of public domain status.
At the same time, we can take some heart, and some interesting facts, from this table. First, a reasonable proportion, amounting to many thousands of items, did have associated death dates. Second, at least for older items, the majority of those with any date had a death date (95% for 1870-1880 and still at over 70% for 1920-1930). Third, and this is a more general observation, proportions were surprisingly constant over time. For example, the proportion of ‘anonymous’ items lies in a narrow band between 10% and 13% for the entire periods. Similarly the proportion of items with any date information ranged only from 45% to 56%. At the same time, and reassuringly, though the proportion with death dates is relatively constant for the oldest periods, in the more recent ones it falls substantially; as one would expect given that some of the authors from those more recent eras are still alive.
| Pub. Date | Total | PD | Not PD | ? | Prop 1 | Prop 2 |
|---|---|---|---|---|---|---|
| 1870-1880 | 50565 | 22157 (43%) | 68 (0%) | 28340 (56%) | 99% | 96% |
| 1880-1890 | 66858 | 28325 (42%) | 649 (0%) | 37884 (56%) | 97% | 90% |
| 1890-1900 | 66884 | 26723 (39%) | 2418 (3%) | 37743 (56%) | 91% | 83% |
| 1900-1910 | 70362 | 24032 (34%) | 5838 (8%) | 40492 (57%) | 80% | 67% |
| 1910-1920 | 60491 | 16200 (26%) | 8306 (13%) | 35985 (59%) | 66% | 51% |
| 1920-1930 | 78671 | 16127 (20%) | 16351 (20%) | 46193 (58%) | 49% | 36% |
| 1930-1940 | 90583 | 8973 (9%) | 20835 (23%) | 60775 (67%) | 30% | 19% |
| 1940-1950 | 72696 | 5000 (6%) | 19316 (26%) | 48380 (66%) | 20% | 13% |
Table 2: PD Status by Decade. ‘?’ indicates items where PD status could not be computed. Prop(ortion) 1 equals total PD divided by total for which status could be computed (sum of total PD and Not PD). Prop(ortion) 2 equals total PD divided by number of items for which any author date was known (’Any Date’ in previous table).
Table 2 reports the results of direct computation of PD status based on the information available. Note that, in doing these computations, we have augmented the basic life plus 70 rule with the additional assumptions that a) all items published in 1870 or before are PD b) no author is older than 100 (so if a birth date is more 170 years ago the item is PD) c) every author lives at least until 30 (so that any work published by an author born less than a 100 years ago is automatically not PD).
As is to be expected, for the majority of the periods, the availability of PD status (either PD or Not PD) closely tracks the availability of death date information — the total for which PD status can be determined (the sum of PD and Not PD) almost exactly equals the total for which death date information is available. It is only in the last period 1940-1950 that the birth date appears to make any contribution. More interesting, is how the number PD and Not PD vary over time, especially relative to each other (and as a proportion of the records for which any date is available).
These two proportions/ratios are recorded in the last two columns which record, respectively: 1) the PD total relative to the number of items for which any status could be computed (i.e. the sum of PD and Not PD) 2) the PD total relative to the total number of items for which any date information is available. These ratios change dramatically over the periods shown: starting in the 1870-1880 period in the high 90%s by the 1940s they are down to 20% or below.
| Pub. Date | % PD |
|---|---|
| 0000-1870 | 100 |
| 1870-1880 | 95 |
| 1880-1890 | 90 |
| 1890-1900 | 85 |
| 1900-1910 | 65 |
| 1910-1920 | 40 |
| 1920-1930 | 25 |
| 1930-1940 | 10 |
| 1940-1950 | 6 |
| 1950-Now | 0 |
Table 3: Suggested PD Proportions
The key question for us is how to extrapolate these PD proportions to the full set of records — i.e. from the set of records for which there is the necessary birth/death date information to that where there is not. The simplest, and most obvious, approach is to assume that the proportions are identical and therefore that the PD proportions calculated on the partial dataset apply to the whole. However, there are some obvious deficiencies in this approach.
In particular, our ability to compute a PD status is largely linked to the existence of a death date and it is likely that the presence of this information is itself correlated with authorial age — after all a death date can only exist once that person has died! This correlation, and the bias it gives rise to, is probably small in the early periods — the authors of any pre 1930 work are almost certainly no longer alive today. However, for the later periods, the bias may be more substantial — it is in these last two periods (1930-1940 and 1940-1950) that there is a significant reduction in the number of records with a death date and (relatedly) a significant increase in the number of records for whom the PD status is unknown.
Thus, in converting the partial PD proportions to full PD proportions it seems sensible to revise down somewhat the partial figures with the revision being greater in later periods. Moreover, we have a lower bound for any downwards revision provided by the total PD as a proportion of all records — which even in the 1940-1950 period stood at 6%. In light of these considerations Table 3 gives fairly conservative figures for PD proportions that when estimating PD size based on publication dates. Interestingly, even with out conservative assumptions, these proportions are rather higher than those used in our previous analysis.
The Size of the Public Domain
June 12th, 2009
This post continues the work begun in this earlier post on “Estimating Information Production and the Size of the Public Domain”. Update: 2009-07-17 there is now a follow-up post.
Having already obtained estimates of the number of items (publications) produced each year based on library catalogue data our next step is to convert this into an estimate of the “size” of the public domain. (NB: as already discussed, “size” could mean several different things. Here, at least to start with, we’re going to take the simplest and crudest approach and equate size with number of publications/items.)
The natural, and most obvious, approach here is to go through our 1 million+ items and compute their public domain status (as discussed in this earlier post). Unfortunately, as detailed there, this is problematic because we often have insufficient information in library catalogues with which to compute PD status with certainty — in particular, author death dates are frequently absent. Thus, it will be necessary to fall back on some approximate method.
For example, we can use base PD status on simple publication dates: if a book was published, say, 140 years ago it is very likely it is in the public domain — for it to be in copyright its author must have lived more than 70 years after the book came out (remember copyright lasts for life plus 70 years in the EU)! Conversely, any publication less than 70 years old is almost certainly not in the public domain. For periods in between we can assume some proportion of publications are PD starting close to zero for more recent items and rising towards one for older ones. A calculation along those lines is provided in the following table:
| Start | End | Items | % PD | Number PD |
|---|---|---|---|---|
| 1400 | 1870 | 389291 | 100 | 389291 |
| 1870 | 1880 | 50564 | 95 | 48035 |
| 1880 | 1890 | 66857 | 90 | 60171 |
| 1890 | 1900 | 66883 | 80 | 53506 |
| 1900 | 1910 | 70360 | 50 | 35180 |
| 1910 | 1920 | 60489 | 30 | 18146 |
| 1920 | 1930 | 78670 | 10 | 7867 |
| 1930 | 1940 | 90576 | 5 | 4528 |
| Total | 873690 | 0.71 | 616724 |
Number of UK Public Domain Publications (Based on Cambridge University Library Catalogue Data)
So, based on the assumptions regarding PD proportions given in the table, there are somewhat over 600 thousand PD books according to the holdings of Cambridge University Library (of which just over half, approx 390k are from before 1870). The British Library dataset is approx 4x as big as Cambridge University Library and the numbers scale up roughly proportionately giving a total of over 2.4 million items.
Of course this is a fairly crude approach based purely on publication date and it be improved in a variety of ways, most notably by using the authorial birth date information which is usually present in catalogue data (we can also use death date information where present). This will be the subject of the next post. (2009-07-17 the post is up here).
Here we’re going to look at using library catalogue data as a source for estimating information production (over time) and the size of the public domain.
Library Catalogues
Cultural institutions, primarily libraries, have long compiled records of the material they hold in the form of catalogues. Furthermore, most countries have had one or more libraries (usually the national library) whose task included an archival component and, hence, whose collections should be relatively comprehensive, at least as regards published material.
The catalogues of those libraries then provide an invaluable resource for charting, in the form of publications, levels of information production over time (subject, of course, to the obvious caveats about coverage and the relationship of general “information production” to publications).
Furthermore, library catalogue entries record (almost) the right sort of information for computing public domain status, in particular a given record usually has a) a publication date b) unambiguously identified author(s) with birth date(s) (though unfortunately not death date). Thus, we can also use this catalogue data to estimate the size of the public domain — size being equated here to the total number of items currently in the public domain.
Results
To illustrate, here are some results based on the catalogue of Cambridge University Library which is one of the UK’s “copyright libraries” (i.e. they have a right to obtain, though not an obligation to hold, one copy of every book published in the UK). This first plot shows the numbers of publications per year (as determined by their publication date) up until 1960 (when the dataset ends) based on the publication date recorded in the catalogue.
A major concern when basing an analysis on these kinds of trends is is that fluctuations over time derive not from changes in underlying production and publication rates but changes in acquisition policies of the library concerned. To check for this, we present a second plot which shows the same information but derived from the British Library’s catalogue. Reassuringly, though there are differences, the basic patterns look remarkably similar.

Number of items (books etc) Per Year in the Cambridge University Library Catalogue (1600-1960).

Number of items (books etc) Per Year in the British Library Catalogue (1600-1960).
What do we learn from these graphs?
- In total there were over a million “Items” in this dataset (and parsing, cleaning, loading and analyzing this data took on the order of days — while the preparation work to develop and perfect these algorithms took weeks if not months)
- The main trend is a fairly consistent, and approximately exponential, increase in the number of publications (items) per year. At the start of our time period in 1600 we have around 400 items a year in the catalogue while by 1960 the number is over 16000.
- This is a forty-fold increase and corresponds to an annual growth rate of approx 0.8%. Assuming “growth” began only around the time of the industrial revolution (~ 1750) when output was around 1000 (10-year moving average) gives a fairly similar growth rate of around 0.89%.
- There are some fairly noticeable fluctuations around this basic trend:
- There appears to be a burst in publications in the decade or decade and a half before 1800. One can conjecture several, more or less intriguing, reasons for this: the cultural impact of the French revolution (esp. on radicalism), the effect of loosening copyright laws after Donaldson v. Beckett, etc. However, without substantial additional work, for example to examine the content of the publications in that period these must remain little more than conjectures.
- The two world wars appear dramatically in our dataset as sharp dips: the pre-1914 level of around 7k+ falls by over a third during the war to around 4.5k and then rises rapidly again to reach, and pass, 7k per year in the early 20s. Similarly, the late 1930s level of around 9.5k per year drops sharply upon the outbreak of war reaching a low of 5350 in 1942 (a drop of 45%), and then rebounding rapidly at the war’s end: from 5.9k in 1945 to 8k in 1946, 9k in 1947 and 11k in 1948!
To do next (but in separate entries — this post is already rather long!):
- Estimates for the the size of the public domain: how many of those catalogue items are in the public domain
- Distinguishing Publications (”Items”) from “Works” — i.e. production of new material versus the reissuance of old (see previous post for more on this).
Colophon: Background to this Research
I’m working on a EU funded project on the Public Domain in Europe, with particular focus on the size and value of the public domain. This involves getting large datasets about cultural material and trying to answer questions like: How many of these items are in the public domain? What’s the difference in price and availability of public domain versus non public domain items?
I’ve also been involved for several years in Public Domain Works, a project to create a database of works which were in the public domain.
Colophon: Data and Code
All the code used in parsing, loading and analysis is open and available from the Public Domain Works mercurial repository. Unfortunately, the library catalogue data is not: library catalogue data, at least in the UK, appears to be largely proprietary and the raw data kindly made available to us for the purposes of this research by the British Library and Cambridge University Library was provided only on a strictly confidential basis.
Empirical Assessment of Impact of DRM on Exceptions and Limitations by Patricia Akester
May 7th, 2009
Patricia Akester, a colleague of mine in the Centre for Intellectual Property and Information Law has just published the results of her recent research in the form of a 208 page report entitled Technological accommodation of conflicts between freedom of expression and DRM: the first empirical assessment.
There has been a lot of debate as to whether DRM/TPM can be used to go ‘beyond copyright’ and restrict legitimate uses of copyrighted material but little empirical work. Patricia’s work is therefore very valuable in providing the first systematic empirical data that we can use to assess what is going on. Here I’ll let her conclusions speak for herself but I strongly encourage readers to take a look at the study itself via the above link:
[From p. 99-100] This project looked at the impact of DRM on the ability of users to take advantage of certain exceptions to copyright. Based on a series of interviews with key organisations and individuals, involved in the use of copyright material and the development and deployment of DRM, this study examined how these issues are working out in practice. While the nightmarish vision of digital lock up has not materialised, this survey concluded, nevertheless , that significant problems do exist, and others can readily be foreseen:
- Although DRM has not impacted on many acts permitted by law, certain permitted acts are being adversely affected by the use of DRM;
- This is in spite of the existence of technological solutions (enabling partitioning and authentication of users. to accommodate those permitted acts (privileged exceptions.;
- Beneficiaries of privileged exceptions who have been prevented from carrying out those permitted acts (because of the employment of DRM. have not used the complaints mechanism set out in UK law;
- Article 6(4. of the Information Society Directive put an onus on content owners to accommodate privileged exceptions voluntarily. Voluntary measures have emerged in the publishing field, but not all content owners are ready to act unless they are told to do so by regulatory authorities.
These four conclusions will be explained in more detail and this will be followed by proposed solutions and recommendations.
European Parliament Votes on Term Extension: The Result
April 24th, 2009
Yesterday, the European Parliament voted on the term extension proposal.
Unfortunately though opposition was substantial it was not enough to prevent the modified (70-year) extension passing:
- Amendment in favour of the rejection: 222 IN FAVOUR, 370 AGAINST, 10 ABSTENTION
- Key amendment to ensure benefits only to performers: rejected (no roll-call vote so numbers unknown)
- All other good amendments (no ex-post, lifetime of performer only): rejected (~150 in favour 400 against)
Final vote: 317 in favour 178 against 37 abstention
Though this is a depressing result this is not yet the end of the matter by any means: the Council has not yet resolved its position and there is a possibility of a second reading.
The level of opposition was also impressive given that there was strong support for the extension not only from the rapporteur (Mr Crowley), but also from the main political groupings (EPP and PSE) led by their shadow rapporteurs Mr Toubon and Ms Gill respectively (on a fairly obscure issue such as this most MEPs will have little time to scrutinize the matter and will usually follow the “party line” as determined by the party rapporteur and coordinator for that dossier).
European Parliament Votes on Copyright Term Extension Tomorrow
April 22nd, 2009
Tomorrow, the European Parliament will vote on the issue of copyright term extension for sound recordings, known in Parliamentese as “the Crowley Report (A6-0070/2009) on the Term of protection of copyright and related rights” (Mr Brian Crowley is the rapporteur for this report and a strong supporter of the extension).
Extending term would be a tragic mistake and a blatant example of special-interest lobbying winning out of the interests of society as a whole.
Let us therefore hope that the proposal is rejected.
That’s the line being by some right-thinking MEPs including Eva Lichtenberger, Greens, Sharon Bowles, ALDE, Andrew Duff, ALDE, Zuzana Roithova, EPP, Christofer Fjellner, EPP, Guy Bono, PSE who have put forward a rejection amendment (see their excellent justification below). But they need all the support they can get and remember: it is never too late to act.
Rejection Amendment Justification
The draft Directive is poorly conceived and disproportionate. The Commission claims that the measure is needed in order to benefit poor performers. However, the proposed regulation and procedure is complicated and over-bureaucratic. The biggest beneficiaries will be the four largest record companies. Individual performers will only receive very small amounts each.
Performers could be helped much more effectively by regulating copyright contracts and collecting societies, by setting up appropriate social security and insurance schemes, and by reconsidering remuneration rights and license tariffs.
The draft Directive leaves a large number of questions unanswered. Additional impact assessments are needed to see which measures are best suited to help those performers really in need, to limit the negative impact on consumers and jobs, and to establish if regulation is best done at state or EU level. In these circumstances, it is not wise to proceed to make the long-term permanent changes proposed.
Some of the particular problems are:
The extension of copyright to 95 or even 70 years will increase the revenue of trust funds of deceased performers instead of living performers.
Many performers cannot produce proof for the performances they participated in during the past decades. It then becomes difficult to assess their rights to payments.
The proposed regulation could cause legal uncertainty for all existing audiovisual productions as it will be unclear if the material used is subject to sound copyright.
There is a risk that all material that is not commercially viable will not be marketed by the copyright owners and will become inaccessible for public use.
Small record companies currently publishing copyright-free material risk going bankrupt.
Public Domain in Europe (EUPD) Research Project
May 26th, 2008
I’m part of a team, led by Rightscom, which has won a bid to do a major analysis of the scope and nature of the public domain in Europe for the European Commission. As it says in the announcement:
We will assemble quantitative and qualitative data and produce a methodology for measuring the public domain which can be used and refined for future studies both within Europe and further a field. The objectives of the report are four fold:
- To estimate the number of works in the public domain in the EU and calculate approximately the levels and ways of use and main users of published works
- To estimate the current economic value of public domain works and estimate the value of works that in the next 10-20 years are to be released into the public domain and determine any change in its value whilst under copyright and once it is on the public domain
…
For my part, I’m going to be particularly focused on the size and value questions. This will involve getting large datasets about cultural material and trying to answer questions like: How many of these items are in the public domain? What’s the difference in price and availability of public domain versus non public domain items?
