Author “Significance” From Catalogue Data
November 5th, 2009
Continues the series of post related to analyzing catalogue data, here are some stats on author “significance” as measured by the number of book entries (’items’) for that author in the Cambridge University Library catalogue from 1400-1960 (there being 1m+ such entries).
I’ve termed this measure “significance” (with intentional quotes) as it co-mingles a variety of factors:
- Prolificness — how many distinct works an author produced (since usually each work will get an item)
- Popularity — this influences how many times the same work gets reissued as a new ‘item’ and the library decision to keep the item
- Merit — as for popularity
The following table shows the top 50 authors by “significance”. Some of the authors aren’t real people but entities such as “Great Britain. Parliament” and for our purposes can be ignored. What’s most striking to me is how closely the listing correlates with the standard literary canon. Other features of note:
- Shakespeare is number 1 (2)
- Classics (latin/greek) authors are well-represented with Cicero at number 2 (4), Horace at 5 (9) followed Homer, Euripides, Ovid, Plato, Aeschylus, Xenophon, Sophocles, Aristophanes and Euclid.
- Surprise entries (from a contemporary perspective): Hannah More, Oliver Goldsmith, Gilbert Burnet (perhaps accounted by his prolificity).
- Also surprising is limited entries from 19th century UK with only Scott (26), Dickens (28) and Byron (41)
| Rank | No. of Items | Name |
|---|---|---|
| 1 | 3112 | Great Britain. Parliament. |
| 2 | 1154 | Shakespeare, William |
| 3 | 1076 | Church of England. |
| 4 | 973 | Cicero, Marcus Tullius |
| 5 | 825 | Great Britain. |
| 6 | 766 | Catholic Church. |
| 7 | 721 | Erasmus, Desiderius |
| 8 | 654 | Defoe, Daniel |
| 9 | 620 | Horace |
| 10 | 599 | Aristotle |
| 11 | 547 | Voltaire |
| 12 | 539 | Virgil |
| 13 | 527 | Swift, Jonathan |
| 14 | 520 | Goethe, Johann Wolfgang Von |
| 15 | 486 | Rousseau, Jean-Jacques |
| 16 | 479 | Homer |
| 17 | 444 | Milton, John |
| 18 | 388 | Sterne, Laurence |
| 19 | 387 | England and Wales. Sovereign (1660-1685 : Charles II) |
| 20 | 386 | Euripides |
| 21 | 372 | Ovid |
| 22 | 358 | Goldsmith, Oliver |
| 23 | 358 | Plato |
| 24 | 351 | Wang |
| 25 | 349 | Alighieri, Dante |
| 26 | 338 | Scott, Walter (Sir) |
| 27 | 326 | More, Hannah |
| 28 | 322 | Dickens, Charles |
| 29 | 315 | Aeschylus |
| 30 | 304 | Burnet, Gilbert |
| 31 | 302 | Luther, Martin |
| 32 | 295 | Dryden, John |
| 33 | 290 | Xenophon |
| 34 | 280 | Sophocles |
| 35 | 262 | Pope, Alexander |
| 36 | 259 | Fielding, Henry |
| 37 | 258 | Li |
| 38 | 250 | Calvin, Jean |
| 39 | 248 | Zhang |
| 40 | 247 | Aristophanes |
| 41 | 247 | Byron, George Gordon Byron (Baron) |
| 42 | 247 | Bacon, Francis |
| 43 | 24have 7 | Chen |
| 44 | 245 | Terence |
| 45 | 241 | Euclid |
| 46 | 235 | Augustine (Saint, Bishop of Hippo.) |
| 47 | 232 | Burke, Edmund |
| 48 | 223 | Johnson, Samuel |
| 49 | 222 | Bunyan, John |
| 50 | 222 | De la Mare, Walter |
Top 50 authors based on CUL Catalogue 1400-1960
The other thing we could look at is the overall distribution of titles per author (and how it varies with rank — a classic “is it a power law” question). Below are the histogram (NB log scale for counts) together with a plot of rank against count (which equates, v. crudely, to a transposed plot of the tail of the histogram …). In both cases it looks (!) like a power-law is a reasonable fit given the (approximate) linearity but this should be backed up with a proper K-S test.
Histogram of items-per-author distribution (log-log)
Rank versus no. of items (log-log)
TODO
- K-S tests
- Extend data to present day
- Check against other catalogue data
- Look at occurrence of people in title names
- Look at when items appear over time
Colophon
Code to generate table and graphs in the open Public Domain Works repository, specifically method ‘person_work_and_item_counts’ in this file: http://knowledgeforge.net/pdw/hg/file/tip/contrib/stats.py



January 27th, 2010 at 9:57 pm
I heard your talk at Cambridge UL today, so was interested to read more about the project. I suspect prolificity skews the data quite heavily: two of the authors I was surprised to see missing were Chaucer and Jane Austen. As for the poor showing for the C19 stuff, is it possible this was influenced by some books only being recorded in the supplementary catalogues from 1800 onwards?
January 28th, 2010 at 9:34 am
magistra: your point about prolificity is well taken. As I mentioned at the start of the post, simple counts co-mingle a variety of factors.
The obvious way to deal with the prolificity issue would be to separate the counts into “number of distinct works” and “counts per work”. Doing this obviously requires us to distinguish Publications (”Items”) from “Works”. Unfortunately this is a tricky issue and one we have had only mixed success with doing (see previous posts in the EUPD series such as this one for more on this).
Checking on the two authors you mention reveals Chaucer clocking in at 149 and Austen at 112. I would point out as well that this analysis is only based on the catalogue up to 1960 (Austen’s popularity seems to have grown particularly rapidly in recent times).
On the supplementary catalogue point could you give me a bit more information about what it would mean if material were only recorded in supplementary catalogues — is it that this means it wouldn’t be in the main (digitized) catalogue I am using?