In doing research for the EU Public Domain project (as here and here) we are often handling large datasets, for example one national library’s list of pre-1960 books stretched to over 4 million items. In such a situation, an algorithm’s speed (and space) can really matter. To illustrate, consider our ‘loading’ algorithm — i.e. the algorithm to load MARC records into the DB, which had the following steps:

  1. Do a simple load: i.e. for each catalogue entry create a new Item and new Persons for any authors listed
  2. “Consolidate” all the duplicate Persons, i.e. a Person who is really the same but for whom we create duplicate DB entries in part 1 (we can do this because MARC cataloguers try to uniquely identify authors based on name + birth date + death date).
  3. [Not discussed here] Consolidate “items” to “works” (associate multiple items (i.e. distinct catalogue entries) of, say, a Christmas Carol, to a single “work”)

The first part of this worked great: on a 1 million record load we averaged between 8s and 25s (depending on hardware, DB backend etc) per thousand records with speed fairly constant throughout (so that’s between 2.5 and 7.5h to load the whole lot). Unfortunately, at the consolidate stage we ran into problems: for a 1 million item DB there were several 100 thousand consolidations and we were averaging only 900s per 1000 consolidations! (This also scaled significantly with DB size: a 35k records DB averaged 55s per 1000). This would mean a full run would require several days! Even worse, because of the form of the algorithm (all the consolidation for a given person were done as a batch) we ran into memory issues on big datasets with some machines.

To address this we switched to performing “consolidation” on load, i.e. when creating each Item for a catalogue entry we’d search for existing authors who matched the information we had on that record. Unfortunately this had a huge impact on the load: time grew superlinearly and had already reached 300s per 1000 records at the 100k mark having started at 40 — Figure 1 plots this relationship. By extrapolation, 1M records would take 100 hours plus — almost a week!

At this point we went back to the original approach and tried optimizing the consolidation, first by switching to pure sql and then by adding some indexes on join tables (I’d always thought that foreign keys were auto indexed but it turned out not to be the case!). The first of these changes solved the memory issues, while the second resolved the speed problems providing a speedup of more than 30x (30s per 1000 rather 900s) and reduced the processing time from several days to a few hours.

Many more examples of this kind of issue could be provided. However, this one already serves to illustrate the two main points:

  • With large datasets speed really matters
  • Even with optimization algorithms can take a substantial time to run

Both of these have a significant impact on the speed, and form, of the development process. First, because one has to spend time optimizing and profiling — which like all experimentation is time-consuming. Second because longer run-times directly impact the rate at which results are obtained and development can proceed — often bugs or improvements only become obvious once one has run on a large dataset, plus any change to an algorithm that alters output requires that it be rerun.

speed.png

Figure 1: Load time when doing consolidation on load

This post continues the work begun in this earlier post on “Estimating Information Production and the Size of the Public Domain”.

Having already obtained estimates of the number of items (publications) produced each year based on library catalogue data our next step is to convert this into an estimate of the “size” of the public domain. (NB: as already discussed, “size” could mean several different things. Here, at least to start with, we’re going to take the simplest and crudest approach and equate size with number of publications/items.)

The natural, and most obvious, approach here is to go through our 1 million+ items and compute their public domain status (as discussed in this earlier post). Unfortunately, as detailed there, this is problematic because we often have insufficient information in library catalogues with which to compute PD status with certainty — in particular, author death dates are frequently absent. Thus, it will be necessary to fall back on some approximate method.

For example, we can use base PD status on simple publication dates: if a book was published, say, 140 years ago it is very likely it is in the public domain — for it to be in copyright its author must have lived more than 70 years after the book came out (remember copyright lasts for life plus 70 years in the EU)! Conversely, any publication less than 70 years old is almost certainly not in the public domain. For periods in between we can assume some proportion of publications are PD starting close to zero for more recent items and rising towards one for older ones. A calculation along those lines is provided in the following table:

StartEndItems% PDNumber PD
14001870389291100389291
18701880505649548035
18801890668579060171
18901900668838053506
19001910703605035180
19101920604893018146
1920193078670107867
193019409057654528
Total8736900.71616724

Number of UK Public Domain Publications (Based on Cambridge University Library Catalogue Data)

So, based on the assumptions regarding PD proportions given in the table, there are somewhat over 600 thousand PD books according to the holdings of Cambridge University Library (of which just over half, approx 390k are from before 1870). The British Library dataset is approx 4x as big as Cambridge University Library and the numbers scale up roughly proportionately giving a total of over 2.4 million items.

Of course this is a fairly crude approach based purely on publication date and it be improved in a variety of ways, most notably by using the authorial birth date information which is usually present in catalogue data (we can also use death date information where present). This will be the subject of the next post.

Here we’re going to look at using library catalogue data as a source for estimating information production (over time) and the size of the public domain.

Library Catalogues

Cultural institutions, primarily libraries, have long compiled records of the material they hold in the form of catalogues. Furthermore, most countries have had one or more libraries (usually the national library) whose task included an archival component and, hence, whose collections should be relatively comprehensive, at least as regards published material.

The catalogues of those libraries then provide an invaluable resource for charting, in the form of publications, levels of information production over time (subject, of course, to the obvious caveats about coverage and the relationship of general “information production” to publications).

Furthermore, library catalogue entries record (almost) the right sort of information for computing public domain status, in particular a given record usually has a) a publication date b) unambiguously identified author(s) with birth date(s) (though unfortunately not death date). Thus, we can also use this catalogue data to estimate the size of the public domain — size being equated here to the total number of items currently in the public domain.

Results

To illustrate, here are some results based on the catalogue of Cambridge University Library which is one of the UK’s “copyright libraries” (i.e. they have a right to obtain, though not an obligation to hold, one copy of every book published in the UK). This first plot shows the numbers of publications per year (as determined by their publication date) up until 1960 (when the dataset ends) based on the publication date recorded in the catalogue.

A major concern when basing an analysis on these kinds of trends is is that fluctuations over time derive not from changes in underlying production and publication rates but changes in acquisition policies of the library concerned. To check for this, we present a second plot which shows the same information but derived from the British Library’s catalogue. Reassuringly, though there are differences, the basic patterns look remarkably similar.

CUL data 1600-1960

Number of items (books etc) Per Year in the Cambridge University Library Catalogue (1600-1960).

BL data 1600-1960

Number of items (books etc) Per Year in the British Library Catalogue (1600-1960).

What do we learn from these graphs?

  • In total there were over a million “Items” in this dataset (and parsing, cleaning, loading and analyzing this data took on the order of days — while the preparation work to develop and perfect these algorithms took weeks if not months)
  • The main trend is a fairly consistent, and approximately exponential, increase in the number of publications (items) per year. At the start of our time period in 1600 we have around 400 items a year in the catalogue while by 1960 the number is over 16000.
  • This is a forty-fold increase and corresponds to an annual growth rate of approx 0.8%. Assuming “growth” began only around the time of the industrial revolution (~ 1750) when output was around 1000 (10-year moving average) gives a fairly similar growth rate of around 0.89%.
  • There are some fairly noticeable fluctuations around this basic trend:
    1. There appears to be a burst in publications in the decade or decade and a half before 1800. One can conjecture several, more or less intriguing, reasons for this: the cultural impact of the French revolution (esp. on radicalism), the effect of loosening copyright laws after Donaldson v. Beckett, etc. However, without substantial additional work, for example to examine the content of the publications in that period these must remain little more than conjectures.
    2. The two world wars appear dramatically in our dataset as sharp dips: the pre-1914 level of around 7k+ falls by over a third during the war to around 4.5k and then rises rapidly again to reach, and pass, 7k per year in the early 20s. Similarly, the late 1930s level of around 9.5k per year drops sharply upon the outbreak of war reaching a low of 5350 in 1942 (a drop of 45%), and then rebounding rapidly at the war’s end: from 5.9k in 1945 to 8k in 1946, 9k in 1947 and 11k in 1948!

To do next (but in separate entries — this post is already rather long!):

  • Estimates for the the size of the public domain: how many of those catalogue items are in the public domain
  • Distinguishing Publications (”Items”) from “Works” — i.e. production of new material versus the reissuance of old (see previous post for more on this).

Colophon: Background to this Research

I’m working on a EU funded project on the Public Domain in Europe, with particular focus on the size and value of the public domain. This involves getting large datasets about cultural material and trying to answer questions like: How many of these items are in the public domain? What’s the difference in price and availability of public domain versus non public domain items?

I’ve also been involved for several years in Public Domain Works, a project to create a database of works which were in the public domain.

Colophon: Data and Code

All the code used in parsing, loading and analysis is open and available from the Public Domain Works mercurial repository. Unfortunately, the library catalogue data is not: library catalogue data, at least in the UK, appears to be largely proprietary and the raw data kindly made available to us for the purposes of this research by the British Library and Cambridge University Library was provided only on a strictly confidential basis.

In my original post on Visualizing Technology Flows from Patent Data I just presented static information — flows for a single year. As I said there:

The next step is to watch how these flows, and the relationships implied by them, have evolved over time. We can do this by plotting the same graph say, every 3 years, from 1975 up until the present.

At the time I had already coded up, and computed, snapshots for each year. However, considerations of space, as well as a desire to find a way to display the information in a ‘nice’ (animated) form, warranted a separate entry. After what, as usual, has turned out to be a rather longer delay than intended, I’ve finally got round to having a first stab at this using simple animated gifs:

Technology flows 1975-1994

Animated Citation Flows 1975-1994 (1994 base year) (click through for full-size ~ 2MB). Click here to rerun the animation.

Here I’ve fixed the layout of the nodes based on the final year (1994) flows. I’ve also done quite a lot of tedious playing around (if only one had stylesheets!) with edge and node sizes to try and improve the look and they are still far from perfect (NB: this means edge/node sizes differ slightly from the images in the original post). As before:

  • Size of nodes indicates total citation flows from that area in that year
  • Yellow portion is citations back into that subcategory while black represents portion that is into other subcategories (comparison by area).
  • Direction of flow is indicated by an arrow head (a rectangular block) with size of flow measured by width of edge and size of head.

Note that we are displaying year values not cumulative values — so, for example, links between nodes may get smaller or even disappear from one year to the next. What jumps out from this?

  • The substantial increase in flows over time (most obviously seen in the size of the nodes).
  • (At least based on examination by eye) no great change in the balance of these flows between cites outside and cites within a category (relative sizes of black and yellow in nodes).
  • Growth has varied substantially across areas (largely, I would hazard, in line with the no. of patents in that area). In particular, the “Computer/Electronics” cluster (top-right) has grown substantially faster than the “Chemicals” sector at centre-left. Individual categories showing especially marked growth include: Biotechnology, Computer Hardware and Software, Communications, Information Storage, and Drugs.
  • It also looks like some areas have grown more strongly linked and “clustered” over time (e.g. Computer/Electronics, and Drugs to Organic Compounds) though it is hard to tell from this visualization (pointing to the need for more formal techniques …).
  • Something which is very clear from the visualization is that there is significant year-to-year variation with clear drops in flows in some cases year-on-year

I also computed another version where the network layout is based on that year’s flows — rather than with a fixed layout based on a given base year.

Unfortunately, this looks too “busy”, particularly as the sensitivity of the network layout algorithm (networkx.graphviz_layout) means that categories move around a lot. (To save on space — the files are big — I haven’t posted this up but if anyone is interested let me know and I’ll upload it).

One solution to this would be to move to rendering cumulative, rather than per-year, flows. This might also improve the base-year case: even there, it might be more natural, at least from a visual point of view, to display changes in flows over time via their impacts on “stocks” rather than displaying the “flows” themselves.

So, next steps:

  • Plot cumulative flows
  • Write up a more formal analysis based on e.g. PCA. I’ve already done PCAs on individual years and an animation might be interesting.
  • Do animations right: the proper way to do this with would be with a proper “slider” widget and stop/start control. It looks like this should be pretty easy in javascript using e.g. jquery but it doesn’t look to be trivial — if it is please let me know how! (BTW: I know I could use Flash but it’s proprietary …).

I’m posting up an essay on “Discounting and Self-Control” (pdf). The essay, which I haven’t really touched for over a year, is still in its early stages but having lacked the time to do much on it over the last year, and going on the motto of “release early, release often”, I’m posting it up as a form of alpha version.

… then must you speak
Of one that loved not wisely, but too well;
Of one not easily jealous, but, being wrought,
Perplex’d in the extreme; of one whose hand,
Like the base Judean, threw a pearl away
Richer than all his tribe; …

Othello, The Moor of Venice

Abstract

An agent’s intertemporal choices depend on a variety of factors, most prominently, their valuation of future payoffs as encapsulated in a discount function. However, it is also clear that factors such as self-control may also play an important role, and given the similarity of impact, a confouding one. We explore the literature on this issue as well as examining what occurs when those with higher time-preference (whether arising from discounting or self-control) also enjoy their consumption more.

Introduction

The exercise of will, especially in the form of self-control, has long been recognized as central to human existence, experience, and morality. Over the last few decades there has been increasing interest in the issue from a scientific perspective. At the same time, it has also long been appreciated that humans (and other animals) make trade-offs between the present and the future — as well as between different points in the future, and that events taking place closer to the present are given greater weight than those which are more distant. Traditionally, at least in economics, this type of behaviour has been subsumed under the heading of discounting.

Both of these factors, self-control and discounting, affect behaviour, and choices, in relation to outcomes which do not (all) take place in the present. However they are distinct. Specifically, consider a very simple case of two outcomes A and B where B occurs after A (for example, A might be one ice cream today and B an ice cream and a doughnut tomorrow). Self-control issues arise where one prefers B over A but is unable to execute on this preference and therefore actually takes (’chooses’) A. By contrast, in the discounting case A is actually preferred over B and therefore is chosen (freely) by the decision maker.

It would seem important to keep these two aspects of decision making clearly separated. While lack of ’self-control’ is usually seen as disadvantageous and a reason for adopting various ‘commitment strategies’ — for example, by opting to remove various items from the choice set (having no cigarettes in the house) — the simple preference for the present over the future incorporated in the discounting model would seem to generate no such difficulties.

However, empirically it may prove rather difficult to do so. As shown by the simple example above the same observed ‘choice’ for A (one ice cream today) over B (ice cream plus doughnut tomorrow) can be the result of two very different processes. Thus if we only observe choices, and not the underlying preferences and/or the process by which the choice is arrived at, it may be impossible to distinguish the two.

It is perhaps for this reason that these distinct aspects are sometimes conflated. Consider, for example, Mischel et al 1989 which is entitled “Delay of Gratification in Children” and summarizes much of Mischel of pioneering work on this area. Mischel’s approach is clearly more oriented along the self-control aspect, and this is borne out in the types of experiments conducted (more on this below). Nevertheless they state (p.934) “The obtained concurrent associations [between treatments and delay] are extensive, indicating that such preferences reflect a meaningful dimension of individual differences, and point to some of the many determinants and correlates of decisions to delay (18).” Here the orientation towards self-control has become a general “decision to delay” and this is borne out by the associated footnote (18) which references related literature in other disciplines and is worth quoting in its entirety:

[… see full essay for more]

Last Tuesday I was at the RES Annual Conference to present my paper “Is Google the Next Microsoft? Competition, Welfare and Regulation in Internet Search”. I’ve uploaded my slides from the talk here and below is a recently prepared overview. The full paper can be online on the SSRN site at:

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1265521

Overview

Beginning from nothing twelve years ago, today online search is a multi-billion dollar business and search engine providers such as Google and Yahoo! have become household names.

While search has become increasingly ubiquitous it has also grown increasingly dominated by a single firm: Google. For example today in the UK Google accounts for 90% of all searches and in many other countries Google has a similar lead over its rivals.

In this paper I investigate why the search engine market is so concentrated and what implications this has for us both now, and in the future. I also look at whether search engines will require regulation and if so in what form. In doing so we also give a detailed explanation of the how the search engine market works, its history, and how it has come to be such a lucrative, and important, activity.

To summarize the main points:

(a) Though search engines provide ordinary users with a `free’ service they gain something very valuable in exchange: attention. Attention is an increasingly valuable good, being in ever more limited supply — after all each of us have a maximum of 24 hours of attention available in any one day (and usually much, much less). Access to that attention is correspondingly valuable especially for those who have products or services to advertise. Thus, while web search engines do not charge users, they can retail the attention generated by their service to those are willing to pay for access to it.

(b) The search engine market is already extremely concentrated. In many countries a single firm (usually Google) possesses of market share an order of magnitude larger than its rivals. As stated, in the UK Google already holds over 90% market share as. However, it is also noteworthy that there are some marked variations, for example in China Google trails the leaders.

(c) Competition issues are likely to become more serious as this dominance becomes established. It is important to realise that while search appears ‘free’ we do pay indirectly via the charges to advertises — who must in turn recoup that money from consumers. A dominant search engine may have incentives to distort its ‘results’ in ways that increase it owns profits but harm society — for example by suppressing organic search results that would substitute for or harm associated ’sponsored’ results (adverts).

(d) There are a number of approaches that regulators and policy-makers could take to protect against these adverse consequences. For example, policy-makers could look at ways to separate the ’software’ and ’service’ parts of a search engines activity, or less dramatically, they could set up a regulatory body to review search result rankings and choices.

Conclusion: it will be increasingly necessary for there to be some form of oversight, possibly extending to formal regulation, of the search engine market. In several markets monopoly, or near monopoly, already exists and there is every reason to think this situation will persist. Left unchecked by competition the private interests of a search engine and the interests of society as whole will diverge and, thus, left entirely unregulated, online search will develop in ways that are harmful to the general welfare.

It is therefore important that policy-makers begin now to develop their strategy in relation to this key area of the knowledge economy. The power rapidly accumulating in the hands of a few major search providers is a great one. It behoves to ensure that it is used in a way that brings the greatest benefit to society as a whole.

A couple of weeks ago I was back at City University’s Centre for Competition and Regulatory Policy for their winter workshop to present a new paper. Entitled Changing the Numbers: UK Directory Enquiries Deregulation and the Failure of Choice it looked at what happened when the UK deregulated its directory enquiries market in the early 2000s. From the abstract:

In 2003, the UK `liberalised’ its telephone directory enquiries service with the aim of introducing competition so as to improve quality and lower costs. Unfortunately the results did not match expectations. Proliferation of numbers led to consumer confusion and high price firms with no discernible quality advantages but which employed heavy advertising came to dominate the market. Consumer and total welfare appear to have declined. This example raises important questions for regulators. In particular, with limits on information and rationality, it may sometimes be better to limit choice but increase competition to supply that choice.

Link to Paper

Recent Work on Open Economics

January 23rd, 2009

Over the Christmas break I had a chance to make some substantial improvements/additions to our Open Economics including:

  1. Improved javascript graphing.
  2. Extend Millenium Development Goals package and added web interface.
  3. First efforts at ‘Where Does My Money Go’

More details on each of these can be found below. Also we’d be delighted to here from anyone interested in getting involved in this, especially with the last item, so if interested do get in touch.

1. Updated javascript graphing package to use flot.

This also allows us to use javascript make the graphing stuff more interactive, in particular to select chart type and the series to plot. See e.g. the data on Daily Wages of Thatchers in the Middle Ages or Wheat, barley, oat, mutton and wool prices, and agricultural wages, 1500-1849.

2. Improved Millenium Development Goals package/dataset and added a web interface.

Extended ‘packagization’ of the MDG data by creating a mini-domain model and an associated sql version of data in addition to the existing csv normalized-tabular version of the data:

http://knowledgeforge.net/econ/svn/trunk/econdata/mdg/db.py

This is much more convenient for analysis (e.g. finding all countries which have at least one entry for any of these 3 series between 1995 and 2005 …). It is also essential for:

New web interface for Millenium Development Goals

Using the sql version of the data is was easy to build a quick-and-dirty web interface to enables one to browse and view the data quickly:

http://www.openeconomics.net/mdg/

For example here’s chart and data showing “Children under 5 moderately or severely underweight, percentage” for Afghanistan, China, India, United States:

http://www.openeconomics.net/mdg/view?commit=Show+Values&series=559&countries=4&countries=156&countries=356&countries=840

3. First efforts at ‘Where Does My Money Go’

Two parts to this project a) getting the data on government revenue/expenditure b) displaying it nicely in a web interface.

Part (a) is encapsulated in a new ukgovfinances dataset:

http://knowledgeforge.net/econ/svn/trunk/econdata/ukgovfinances/

Using this data we have made a (small) start on the web interface:

http://www.openeconomics.net/wdmmg/

The Open Knowledge Foundation (which I’m involved in) is co-organizing with MySociety and OPSI, a Workshop on Finding and Re-using Public (Sector) Information.

The event takes place this Saturday (1st of November) at the London Knowledge Lab near Holborn in London. Full details in this OKFN blog post and you can sign up the wiki page:

http://okfn.org/wiki/PublicInformation

Last Friday and Saturday I was at the 2008 European Policy for Intellectual Property (EPIP) conference, held this year in Bern. I presented my paper on the optimal term of copyright and discussed a paper of Luca Spinesi’s on ‘Imperfect IPR enforcement, inequality, and growth’. Below can be found ‘impressionistic’ notes from some of the other sessions I had a chance to attend.

Jim Bessen: How can and how should economics inform patent policy?

  • What is aim of ‘Property Rights’
  • Look at example of tradable permits for pollution
    1. Do institutions do their jobs
    2. Resources (is air cleaner)
    3. Social welfare
  • For patent system, thanks to recent work, first two are within our reach (though not within our grasp)
  • Institutions. Want:
    1. Specificity
    2. Searchability
    3. Predictability
    4. Transactability
    5. Enforceability
  • Patent system is not doing so well
    1. Specify: reasonable but lots of debate about what claims mean (40% overturn rate on appeal of district court decision re. claim construction)
    2. Search: pretty poor (esp. in ICT). Many firms do not bother to search.
    3. Predictability: low (e.g. no defense insurance)
    4. Transact: can be anti-commons
    5. Enforce: pretty unpredictable
  • Resources (Innovation)
    • Patent system is not doing so well due to overlapping claim (pooling problem)
    • Fuzzy boundaries: dispute costs
      • Value patents (upper bound from renewal, re-assignment, int’l filings, firm market value, surveys, case-studies)
      • Dispute costs (lower bound)
      • For pharma: value ~ $12 billion/year, costs ~ $1 billion
      • Other industries: value ~ $2 billion/year (from 80s to present), costs ~ $1 billion / year up until mid 90s since when they have spiked and now much higher than value — e.g. in late 90s costs 3x value
      • Could use fees to address this (raise from ~$5000 to ~$30000)

Reto Hilty: Enforcement of intellectual property rights on Enforcement of IPRs

  • Huge figures circulate about losses from piracy
    • Most figures are (very) dubious and produced by the industry
  • History of IPRED (and IPRED2)
  • More intl stuff:
    • TRIPS+
    • FTAs (US)
    • EPAs (EU)
    • ACTA
  • Why has this focus on enforcement happened
    • General mantra that strengthening IP rights is good for innovation
    • Patents: probably have over-protection
      • Full patent protection (EPC 1973) — i.e. patent covers subsequent uses even if not anticipated. (probably a mistake)
      • Biological substances — full patent protection particularly problematic
      • Software patents …
      • Drugs and developing countries
    • Copyright law
      • Internet users see constriction not justice
      • Entertainment + TPMs — “unjustified profits”
      • Scientific research: unnecessary constrictions (Open Access)
    • Industrial design
    • Trade-mark law — large extensions in the last 80s (protection of colours, shapes unjustified)
    • Eventually this constant extension generated such opposition that it is now at a standstill
    • Thus, rightsholders move focus to enforcement (focus on ‘efficiency’)
  • But stronger enforcement also causes problems [ed: the strength of a right in fact is is product of enforcement and strength in theory]
    • will there be a backlash?
  • Also extension of IP geographically — esp. to developing countries
  • What justifications are there for IP enforcement
    • IPR not valuable without some enforcement, certainty …
  • One size cannot fit all: whether for IP itself or for enforcement
    • If IPR is misused enforcement can make things worse
  • Suggestions:
    • Decriminalize where too much IP protection
    • Strengthen enforcement where IP truly detrimental
    • Distinguish IP protection from consumer protection (counterfeiting not the same as IP protection)
    • [ed: one concern here is that it seems here we are using enforcement/non-enforcement to correct IP rights which are themselves wrong — enforce where good, don’t enforce where not good. But if that were agreed why couldn’t we correct the underlying problem]

Davis, Davis and Hoisl: Leisure time invention

  • PatVal data (10.5k German patents sampled with survey of inventors)
  • Leisure time has +ve impact on inventive output
  • Leisure time invention +vely linked to interactions with co-workers and outsiders
  • More leisure time invention in conceptual-based technologies rather than science-based technologies
  • Incidence of leisure time invention will be -vely related to project size
  • Most hypotheses confirmed

Ashish Arora: Patents and Innovation

  • Evidence for benefits of patents on innovation is mixed
    • Example of early Swiss and German dye and chemical industries
    • Surveys main evidence which show there are rents from patents but with equivalent subsidy ratio that is not that high
  • Kyle and McGahan: no inducement of research in diseases of poor countries after TRIPs
    • Even if patent protection is important no reason for developing countries to have them (already have protection in developed countries)
  • Thickets, patent litigation and trolls
    • Cockburn MacGarvie and Mueller (2008): fragmentation increasing across all industries
    • Substantial litigation costs
    • Geraldin, … find no thicket problem in 3G telephony
  • Anti-commons
    • Completely unpersuaded by the evidence
    • All examples came from universities: US research universities have made a mess of tech-transfer and patenting, alienating faculty and angering corporate partners (Bayh-Dole has had significant unintended bad consequences)
  • Markets for technology (specialization)
    • The first order effect of patents may be on trade in technology
    • Having people whose business it is to sell technology is really important (particularly if you are a developing country)
    • Licensing flows in US: $66 billion in 2006 (Carol Robbins). Good proportion of domestic R&D
    • Hall and Ziedonis evidence on specialist semiconductor firms
    • Gambardella and Giarratana (2007): software security patents
  • Making patents more useful
    • Much of the problem is bad patents due to:
      1. Invention is poorly understood (underlying knowledge base is poor)
      2. The claims are written with the intent of claiming as much while revealing as little as poorly understood
    • ‘Metes and bounds’ of the patent are unclear to all except handful of patent lawyers
    • Not new: cf. German chemical industry back in 19th century
    • Solution:
      1. Force patents to be written using (i) standard terms (ii) without legal jargon (whose only justification is a futile reach for precision)
      2. Patents should be (i) published expeditiously (ii) transactions (licenses, assignments, beneficial interests) in patents should be recorded and disclosed

Survey on Patent Licensing: Dominique Guellec (OECD)

  • Why licensing out:
    • Value from unused inventions
    • Inventions with applications elsewhere
    • Fabless firms
    • Establishing technology as a standard (may raise Competition issues)
    • Cross-licensing deals (ditto)
  • Expected Economics Effects (+ve)
    • Increases diffusion
    • Reduces duplication
    • Boost downstream competition
    • Facilitates specialization
  • Can also be -ve (mirror image of +ve ones e.g. reduced duplication = less competition)
  • Graph showing huge increase in royalty/license payments since mid 80s: ~$10B/year to ~$110B/year) (source: world bank)
    • But how much of this real (i.e. not tax manipulation etc) — and also includes copyright etc
  • OECD survey implemented by EPO by JPO/University of Japan on licensing behaviour
    • focuses on licensing out
    • response rate: 42% in europe, 34% in japan [ed: japan responses are less reliable for reasons not entirely clear to me]
    • no questions on revenues (people don’t respond when you ask this — either don’t know or don’t what to tell)
  • Results:
    • 35% of european companies license out, 59% of japanese firms
    • Licensing to non-affiliated companies: 20% of Eur, 27% of Japanese
    • U-shaped prob of licensing as a function of size
    • By tech field: highest in chemistry and electronics
    • Younger companies do it more (controlling for size) [ed: issues here though. Old firms which are small are not the same as young firms that are small]
    • Why do it?
      • Earning revenue: 60% EUR, 52% JPN; cross-licensing: 18%, 18%
    • Patents you would have licensed but could not/did not: ~20%
      • Why? Difficulty of finding a partner (25% of EUR and 18% of JPN)
      • Not important: problems of drafting contracts or technology not mature
  • Difficulty of finding partners could be for several reasons but suggests could be role for more/better intermediaries to facilitate transactions (INPIT in Japan)

Patent Thickets and the Market for Ideas: Mark Schankerman (LSE)

  • Market for ideas (patent licensing and sale of patents) [ed: this is obviously not the whole market for ideas …]
  • Study market though new lens: settlement of patent infringement disputes
    • Do not know whether when settlements happen licensing actually occurs
  • Focus on 2 key aspects:
    • Fragmentation of rights (’patent thickets’)
    • Certainty of enforcement (CAFC led to more certainty — not worrying here about pro-patent bias)
  • Fragmentation:
    • Trad story: bad (higher transaction costs, bargaining failure …)
    • Dissenting voice (Lichtman 2006): greater fragmentation lowers the value at stake in each negotiation and this reduces the incentive to bargain hard. This speeds up settlement. Of course still leaves question of whether this reduces total negotiation time.
  • Model gives us various hypotheses:
    • H1: more complementarity means longer negotiation
    • H2: more fragmentation means shorter negotiations
    • H3: Settlement negotiations will be shorter for patents litigated after CAFC (1982)
    • H4: Impact of fragmentation external rights will be lower after the introduction of CAFC
    • H5: CAFC has a bigger impact where the preceding circuit had more uncertainty
  • Results
    • More fragmentation: leads to lower dispute duration (19.6 months for < 50th percentile frag vs. ~16 months for > 90th percentile)
    • CAFC has a big effect on dispute duration (~33 months to ~18months)
  • Conclusion: looking at delay (not royalty stacking on other issues)
    • Certainty: good
    • Fragementation: not bad (and maybe good)