Author “Significance” From Catalogue Data
November 5th, 2009
Continues the series of post related to analyzing catalogue data, here are some stats on author “significance” as measured by the number of book entries (’items’) for that author in the Cambridge University Library catalogue from 1400-1960 (there being 1m+ such entries).
I’ve termed this measure “significance” (with intentional quotes) as it co-mingles a variety of factors:
- Prolificness — how many distinct works an author produced (since usually each work will get an item)
- Popularity — this influences how many times the same work gets reissued as a new ‘item’ and the library decision to keep the item
- Merit — as for popularity
The following table shows the top 50 authors by “significance”. Some of the authors aren’t real people but entities such as “Great Britain. Parliament” and for our purposes can be ignored. What’s most striking to me is how closely the listing correlates with the standard literary canon. Other features of note:
- Shakespeare is number 1 (2)
- Classics (latin/greek) authors are well-represented with Cicero at number 2 (4), Horace at 5 (9) followed Homer, Euripides, Ovid, Plato, Aeschylus, Xenophon, Sophocles, Aristophanes and Euclid.
- Surprise entries (from a contemporary perspective): Hannah More, Oliver Goldsmith, Gilbert Burnet (perhaps accounted by his prolificity).
- Also surprising is limited entries from 19th century UK with only Scott (26), Dickens (28) and Byron (41)
| Rank | No. of Items | Name |
|---|---|---|
| 1 | 3112 | Great Britain. Parliament. |
| 2 | 1154 | Shakespeare, William |
| 3 | 1076 | Church of England. |
| 4 | 973 | Cicero, Marcus Tullius |
| 5 | 825 | Great Britain. |
| 6 | 766 | Catholic Church. |
| 7 | 721 | Erasmus, Desiderius |
| 8 | 654 | Defoe, Daniel |
| 9 | 620 | Horace |
| 10 | 599 | Aristotle |
| 11 | 547 | Voltaire |
| 12 | 539 | Virgil |
| 13 | 527 | Swift, Jonathan |
| 14 | 520 | Goethe, Johann Wolfgang Von |
| 15 | 486 | Rousseau, Jean-Jacques |
| 16 | 479 | Homer |
| 17 | 444 | Milton, John |
| 18 | 388 | Sterne, Laurence |
| 19 | 387 | England and Wales. Sovereign (1660-1685 : Charles II) |
| 20 | 386 | Euripides |
| 21 | 372 | Ovid |
| 22 | 358 | Goldsmith, Oliver |
| 23 | 358 | Plato |
| 24 | 351 | Wang |
| 25 | 349 | Alighieri, Dante |
| 26 | 338 | Scott, Walter (Sir) |
| 27 | 326 | More, Hannah |
| 28 | 322 | Dickens, Charles |
| 29 | 315 | Aeschylus |
| 30 | 304 | Burnet, Gilbert |
| 31 | 302 | Luther, Martin |
| 32 | 295 | Dryden, John |
| 33 | 290 | Xenophon |
| 34 | 280 | Sophocles |
| 35 | 262 | Pope, Alexander |
| 36 | 259 | Fielding, Henry |
| 37 | 258 | Li |
| 38 | 250 | Calvin, Jean |
| 39 | 248 | Zhang |
| 40 | 247 | Aristophanes |
| 41 | 247 | Byron, George Gordon Byron (Baron) |
| 42 | 247 | Bacon, Francis |
| 43 | 24have 7 | Chen |
| 44 | 245 | Terence |
| 45 | 241 | Euclid |
| 46 | 235 | Augustine (Saint, Bishop of Hippo.) |
| 47 | 232 | Burke, Edmund |
| 48 | 223 | Johnson, Samuel |
| 49 | 222 | Bunyan, John |
| 50 | 222 | De la Mare, Walter |
Top 50 authors based on CUL Catalogue 1400-1960
The other thing we could look at is the overall distribution of titles per author (and how it varies with rank — a classic “is it a power law” question). Below are the histogram (NB log scale for counts) together with a plot of rank against count (which equates, v. crudely, to a transposed plot of the tail of the histogram …). In both cases it looks (!) like a power-law is a reasonable fit given the (approximate) linearity but this should be backed up with a proper K-S test.
Histogram of items-per-author distribution (log-log)
Rank versus no. of items (log-log)
TODO
- K-S tests
- Extend data to present day
- Check against other catalogue data
- Look at occurrence of people in title names
- Look at when items appear over time
Colophon
Code to generate table and graphs in the open Public Domain Works repository, specifically method ‘person_work_and_item_counts’ in this file: http://knowledgeforge.net/pdw/hg/file/tip/contrib/stats.py
Exploring Patterns of Knowledge Production
October 15th, 2009
I’m posting up some work-in-progress entitled Exploring Patterns of Knowledge Production (link to full pdf). Below I’ve excerpted the introduction plus list of motivational questions. Comments (and critique) very welcome!
Introduction
In what follows the term ‘knowledge’ is here used broadly to signify all forms of information production including those involved in technological innovation, cultural creativity and academic advance.
Today, thanks to rapid advances in IT, we have available substantial datasets pertaining both to the extent and the structure of knowledge production across disciplines, space and time.
Especially recent is the availability of good ’structural’ data — that is data on the linkages and relationships of different pieces of knowledge, for example as provided by citation information. This new material allows us to explore the “patterns of knowledge production” in deeper and richer ways than ever previously possible and often using entirely new methods.
For example, it has long been accepted that innovation and creativity are cumulative processes, in which new ideas build upon old. However, other than anecdotal and case-study material provided by historians of ideas and sociologists of science there has been little data with which to study this issue — and almost none of a comprehensive kind that would make possible a systematic examination.
However, the recent availability of comprehensive databases containing ‘citation’ information have allowed us to begin really examining the extent to which new work builds upon old — be it a new technology as represented by a patent or a new idea in academia as represented by a paper, builds upon old.
Similar opportunities present themselves in relation to identifying the creation of new fields of research or technology, and tracing their evolution over time. Here the existence of extensive “structural information” as presented, for example, by citation databases, enables new systematic approaches — for example, can new fields be identified (or perhaps defined) as points in ‘knowledge space’ far away from the existing loci of effort? or, alternatively, by the nature of its connections to the existing body of work?
Structural information of this kind can also be used in charting other changes in the life-cycle of knowledge creation. For example, to offer a specific conjecture, a field entering decline, though still exhibiting a similar level of output (papers etc) and even citations to a field in rude health, may display a citation structure which is markedly different — for example, more clustered within the field itself. Thus, by using this additional structural information we may be able to gain insights not available with simpler approaches.
At the same time, structure must also play a central role in any attempt to estimate knowledge related ‘output’ measures. This is of course not true for other forms of ‘output’, for example that of corn of steel, where we have relatively well-defined objective measures available: tonnes of such-and-such a quality.
But knowledge is different: the most obvious metrics, such as number of patents or papers produced, seem entirely inadequate: one particular innovation or paper may be ‘worth’ as much as a hundred or a thousand others.
The issue here is that, compared to corn or steel, knowledge is extremely inhomogeneous, or put slightly differently, quality (or significance) differs very substantially across the individual pieces of knowledge (papers, patents etc).
Thus, any serious attempt to measure the progress of knowledge must must find some way to do this quality-adjustment and structural information seems essential to this.
What specific questions might we explore with such datasets?
The following is a (non-exhaustive) list of the kinds of questions one might explore using these new datasets:
- Can we use structure to infer information about quality of individual items? Clearly the answer is yes, for example by using a citation-based metric where a work’s value is estimated based on its citation by others.
- Can we then use this information together with more global structure of the production network to gain a better idea of total (quality-adjusted) output. This would allow one to chart progress, or the lack of it, over time?
- Can we use structural information to investigate the life-cycle of fields? For example, can we see fields ‘dying out’ or the onset of diminishing returns? Can we see new fields coming into existence and their initial growth patterns?
- What about productivity per capita and its variation across the population? It is likely that one would need to focus here within a discipline as it would be difficult to directly compare across disciplines, at least when using quality adjusted productivity.
- Do the structures of knowledge production vary over time and across disciplines and does this have implications for their productivity? Can we compare the structure of evolution in technology or economics with that in ‘natural’ evolution and, if not, what are the primary differences?
- How do other (observable) attributes related to the producers of knowledge (their collaboration with others, their geographical location) affect the structures we observe and the associated outcomes (output, productivity) already discussed above?
- Do different policies (for example openness vs. closedness — weak vs. strong IP) have implications for the structure of production and hence for output and productivity?
- Is knowledge production (in a particular area) ergodic or path-dependent? Crudely: do we always end up in the same place or do small shocks have large long-term effects?
Exploring Patterns of Knowledge Production
March 18th, 2008
A definition: the term ‘knowledge’ is here used broadly to signify all forms of information production including those involved in technological innovation, cultural creativity and academic advance.
Largely as a result of better ICT we now have available some very substantial datasets regarding both the extent and structure of knowledge production across different jurisdictions and different disciplines.
Of particular interest here is this is second aspect: the structure of knowledge production; as it has long been accepted that innovation and creativity are cumulative processes, in which new ideas build upon old.
However, other than the anecdotal and case-study material provided by historians of ideas and sociologists of science there has been little evidence on this issue — and almost none of a comprehensive kind that would make a systematic examination possible.
In particular, the existence of databases containing ‘citation’ information allows us to, at least partially, determine the extent to which new work, be it a new technology as represented by a patent or a new idea in academia as represented by a paper, builds upon old.
What specific issues might we explore with such datasets?
Given the availability of these new datasets and the basic cumulative nature of most knowledge production what specific issues and question might we explore? The following provides a basic, but non-exhaustive, list:
- Can we use structure to infer information about quality of individual items? Clearly the answer is yes, for example by using a citation-based metric where a work’s value is computed on its citation by others.
- Can we then use this information together with more global structure of the production network to gain a better idea of total (quality-adjusted) output. This would allow one to chart progress, or the lack of it, over time?
- What about productivity per capita and its variation across the population? It is likely that one would need to focus here within a discipline as it would be difficult to directly compare across disciplines, at least when using quality adjusted productivity.
- Do the structures of knowledge production vary over time and across disciplines and does this have implications for their productivity? Can we compare the structure of evolution in technology or economics with that in ‘natural’ evolution and, if not, what are the primary differences?
- How do other (observable) attributes related to the producers of knowledge (their collaboration with others, their geographical location) affect the structures we observe and the associated outcomes (output, productivity) already discussed above?
- Do different policies (for example openness vs. closedness — weak vs. strong IP) have implications for the structure of production and hence for output and productivity?
- Is knowledge production (in a particular area) ergodic or path-dependent? Crudely: do we always end up in the same place and do small shocks have small or large effects in the long term?
Overlord: D-Day and the Battle for Normandy 1944 by Max Hastings
February 9th, 2008
7.5/10. Finished a few weeks ago this is another (rather earlier) example of Hastings’ skill in writing penetrating and engaging military history, as well as his willingness to be critical of existing ’sacred cows’. Among other things Hastings:
- Argues that the famous Mulberrys were probably a waste of time and resources.
- Shows how the Air Force extreme unhelpfulness (largely driven by their own ambitions and obsession with civilian bombing) was a serious handicap to the whole campaign.
- Supplies a sharp corrective regarding Patton’s reputation, pointing out that up against reasonable German opposition Patton did little better than anyone else.
- Shows clearly how it was Hitler, almost more than anyone else, who contributed to the disastrous collapse of German forces in August-October 1944 by his insistence that no retreat of any kind be considered.
- Provides many examples of the poor quality of equipment, leadership, and men, especially among the American forces and how these deficiencies hindered the Allied campaign. In particular, Allied tanks were almost never a match for their German counterparts and on any occasion that Allied and German troops met on anything near equal footing the Germans won.[^1] In addition he details several clear cases of simple cowardice or unwillingness to fight among the Allied troops and/or extremely poor leadership stretching from the lowest levels to the highest. This is not to criticize — who can say what they would do in such circumstances — and in many reflects the fact that while the Germans were a nation that had for many years been ‘obsessed’ with soldiering the Allied troops were ‘civilians in uniform’, but it does supply a useful corrective to those rose-tinted visions supplied by films such as The Longest Day or the newsreel footage showing Allied soldiers racing past cheering French civilians.
Finally, and as an aside, while good, the book also displays the limitations of the traditional book format as a method for presenting this sort of material (i.e. military history with its strong connections between the temporal and spatial aspects of events). At least for me, the attempt to render particular troop movements, or the direction of battles, in prose never really succeeds and one finds oneself constantly flicking back to the (rather limited) maps in an attempt to connect the descriptions of events, the failures and successes of particular thrusts, with their location, both geographically and within the overall direction of the campaign. Thus, it seems to me that it is that this kind of subject is the sort thing most suited to being integrated with the kind of approach proposed by the Microfacts / Weaving History project currently in the early stages of its development at the Open Knowledge Foundation. Here one would be able to marry maps with descriptions, photos with actions, time with space to provide a much clearer insight into what was going on.
[^1]: From p. 84 ff. “The American Colonel Trevor Dupuy has conducted a detailed statistical study of German actions in the Second World War. Some of his explanations as to why Hitler’s armies performed so much more impressively than their enemies seem fanciful. But no critic has challenged his essential finding that on almost every battlefield of the war, including Normandy, the German soldier performed more impressively than his opponents:
On a man for man basis, the German ground soldier consistently inflicted casualties at about a 50% higher rate than they incurred from opposing British and American troops UNDER ALL CIRCUMSTANCES . [emphasis in original] This was true when they were attacking and when they were defending, when they had local numerical superiority and when, as was usually the case, they were outnumbered, when they had air superiority and when they did not, when they won and when they lost.
It is undoubtedly true that the Germans were much more efficient than the Americans in making use of available manpower. An American army corps staff contained 55 per cent more officers and 44 per cent fewer other ranks than its German equivalent. …
Events on the Normandy battlefield demonstrated that most British or American troops continued a given operation for as long as reasonable me could. Then – when they had fought for many hours, suffered many casualties, or were running low on fuel or ammunition – they disengaged. The story of German operations, however, is landmarked with repeated examples of what could be achieved by soldiers prepared to attempt more than reasonable men could.”
Path-Dependent vs. Ergodic Systems
January 11th, 2008
Consider a metal arm fixed by a pin. If it is hung vertically then the arm, no matter where it starts, will always end up in the same position. However, if you fix the arm (perfectly) horizontally it will stay forever in its initial position. The first case is ergodic: we converge independent of the starting point to some particular configuration; while the second is ‘path-dependent’ (or dependent on initial conditions): where you end up depends crucially on where you start. The question:
Is animal/technological/historical/linguistic evolution ergodic or path dependent?
More generally, how ergodic or path-dependent are the following processes?
- (Natural) Evolution
- Technological change
- Human history
- Communication systems such as natural languages
- Other symbol systems (e.g. games or mathematics)
Versioned Domain Models
March 22nd, 2007
I’ve been thinking about how to have a versioned domain model similar to the way we have versioned filesystems (e.g. subversion) for over two years. Over the last few months whatever bits of free time I’ve had have gone into developing a prototype built on top of sqlobject and I’ve now got a rough and ready (but fully functional) library:
http://project.knowledgeforge.net/ckan/svn/vdm/branches/sqlobj/
A demo of how it is used is best shown by the tests:
http://project.knowledgeforge.net/ckan/svn/vdm/branches/sqlobj/vdm/dm_test.py
Why be tied to SQLObject: obviously being so directly tied to sqlobject is not such a great thing but I intentionally chose to build on it because so many people will already be writing their domain models using SQLObject.
The Robustness Principle
February 22nd, 2007
2.10. Robustness Principle
TCP implementations will follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.
Source:
- rfc793: specification for TCP
- date: 1981
- editor: Jon Postel
- url: http://www.ibiblio.org/pub/docs/rfc/rfc793.txt
Thinking about Annotation
January 17th, 2007
Annotation means the adding of comments/notes/etc to an underlying resource. For the present I’ll focus on the situation where the underlying resource is textual (as opposed to being an image, or a piece of film or some data). Various things to consider when implementing an annotation/comment system:
Addressing and atomisation: Are annotations specific to particular parts of the resource. If so how do we store this address (relatedly: how is the resource ‘atomised’ and how to we address these atoms, or range of atoms). For example, do we address by word, by character, by paragraph or by section? Do we wish to store ranges rather than a single address? Do we wish to allow a given annotation to be associated with multiple ranges/atoms?
Permissions: Are there restrictions on the creation (deletion/updating etc) of annotations.
Will the underlying resource change and if so are annotations intended to be robust to those changes.
Let’s concentrate on the first issue for the time being as it is the most immediately important. Furthermore, defining the ‘atoms’ of the resource sharply narrows the implementation options.
The Simple Case: Mod a Blog
If one is happy to have fairly large atoms (pages, or even sections of some piece of text) then implementing an annotation system can be reduced to grabbing your favourite CMS or blogging software and feeding the text in in appropriate chunks. This is often satisfactory and is a simple, low tech solution that will pretty much work out of the box. A classic example of this approach is http://www.pepysdiary.com/ which works so well because the subject matter (Samuel Pepy’s diary) has a very obvious atomisation (namely the daily diary entries) suited perfectly suited to blog software (in this case movable type).
You can even start doing a bit of modding, for example to present recent annotations (http://www.pepysdiary.com/recent/) or to present the text plus annotations all in one piece. (Given that commentonpower seems to fall neatly into this category with most commentable atoms of the right size for ‘blog’ entries I wonder why they didn’t just implement it as a plugin for wordpress — perhaps it was such a simple app that it easier to ‘roll their own’).
Getting More Atomic
Once you want to have atoms below a size comfortable for individual html pages/blog entries, wish to allow people to comment on chunks too large for an individual page, or to comment on ranges one starts to have problems with this approach. The main challenge at this point is to find some way to extract the addressing information from the client doing the annotation. Confining ourselves to the web the challenge becomes way to structure the interface and the text so that one can determine range start and end points. This is a non-trivial matter. Possible options include:
- Javascript: in theory the selection/range objects should help us out here unfortunately cross-browser support is patch (firefox as usual is excellent and IE pretty bad). If one does not want to be as precise as to get ranges javascript could also be used to extract e.g. element ids.
- Copy and paste of the quote to annotate with some backend algorithm to determine the actual range. Nice and simple but not clear that one can ‘invert’ (i.e. find a unique range from a given selection) unless the selection is large.
- If addressing fairly large atoms (e.g. a paragraph or large) one could just insert a unique piece of user interface equipment (e.g. a button or link) with each atom. Note however that this prevents support for ranges.
Separating Data and Presentation
Whatever one chooses to do it does seem sensible to clearly separate data and presentation. This is particularly important when there is so much uncertainty over the user interface. In particular, it would be good to clearly specify the annotation format and implement a programmatic interface to it independent of the standard (human) user interface. That way is easy to switch interfaces (or have multiple ones). Given that annotations are essentially just a comment it would seem sensible to try and reuse an existing format such as Atom (or RSS) for the machine interface to the comment store. [marginalia] already had such a format based on atom. I’ve recently reimplemented a stripped down version of this format for the annotation store backend in python in preparation for adding annotation support to openshakespeare web interface, see:
http://project.knowledgeforge.net/shakespeare/svn/annotater/trunk/
Of course as discussed above this isn’t quite as simple as it looks as your user interface can constrain what you can and can’t store (using a blog approach you can’t store ranges and from what I have read getting reliable character offsets is problematic). Nevertheless it seems the best place to start.
Technology and History
May 28th, 2006
Found in a review by Gary Will’s of Taylor Branch’s At Canaan’s Edge: America in the King Years, 1965-1968 in the NYRB (2006-04-06, p. 20):
It is amazing how Branch can marshal so much material along so many tracks, moving it ahead stage by stage in coordination with King’s actions. The I saw Branch in a three-hour television interview with C-SPAN and learned part of his secret. He showed the interviewer his computer with its expertly programmed chronological record of all the information he had acquired from so many sources — over 17,000 items arranged year by year, day by day. The book probably could not have been written — surely not in so relatively short a time — without the computer.
The Invention of Symbols
November 3rd, 2005
We believe that we invent symbols. The truth is that they invent us.
Gene. M. Wolfe, The Book of the New Sun.


