Web-Based Annotation

December 19th, 2006

We intend to add annotation/commentarysupport to the open shakespeare web demo either in this release or next. As a first step I’ve been looking to see what (open-source) web-based annotation systems are already out there. Below is a list of what I’ve been able to find so far (if you know of more please post a comment). After examining several of these in some detail the one we’re going to try our properly is marginalia (if you’re interested our current efforts to do this including writing a python wsgi annotation service backend can be found here in the subversion repository).

  1. stet: javascript annotation system used for gpl v3 comments system

  2. commentary: javascript based wsgi middleware developed by ian bicking

    • http://pythonpaste.org/commentary/
    • Rather hacked together (apparently he coded it in a week). Had problems getting it working locally and no documentation to help in adaptation. Seems to be unmaintained (demo site is currently down) which is perhaps not surprising given how many other projects Ian has on the go.
    • One nice feature is that you don’t seem to have to mess with the underlying web pages you want to add comments to (this only works if you are sitting on top of another wsgi application)
  3. marginalia: javascript library and spec for adding web annotation to pages

  4. annotea: W3C project based on RDF

    • http://www.w3.org/2001/Annotea/
    • Been around a long time and now seems to be inactive
    • Server and client support rather lacking. No simple interface based on, e.g., javascript — you have to write a special client yourself — which is a major drawback
    • That said the protocol is well-documented and so writing a client (or a server) shouldn’t be that hard (other than having to mess around with rdf in javascript …)
    • The Schema seems reasonable
    • xpointer based which according to the marginalia site is a problem

UPDATE (2008-06): a new version is available (v1.2): http://www.rufuspollock.org/2008/06/23/markdown2latex-mkdn2latex-12/

Over the last year I’ve written quite a few papers using markdown plus asciimathml. While this is great for web publication (and editing) and gives me lots of styling freedom via css it doesn’t produce output that’s as nice as that produced by latex especially in paginated form (also latex mathematics support is also currently better than that of obtained from asciimathml or latexmathml).

Unable to find any python code that would do what I want I played around for a couple of hours with the python-markdown script until I got something functional. After a few weeks of use which has allowed me to iron out the bugs and making several improvements I feel the script is now ready for public release. Hope people find it useful.

Download

Get it from: http://project.knowledgeforge.net/okftext/svn/trunk/python/mkdn2latex.py

(You can also it check it out using subversion from the same url if you want)

For the script to function you will also need to install the python-markdown module v1.5 (make sure you install it under the name markdown.py).

Usage

The following will print the latex output to the console (standard out):

 $ mkdn2latex.py path-to-markdown-file.mkd

To convert a markdown file straight to a latex output file do:

 $ mkdn2latex.py path-to-markdwon-file.mkd > path-to-output-file.ltx

NB: As provided the script expects mathematics in your markdown file to be delimited with ‘$\$’ (this should be dollar dollar — the slash is there to stop this being rendered as maths in the blog) as opposed to the standard asciimathml delimiters of ‘`’ or ‘$’.

Whenever I’ve had a few spare minutes over the last couple of months I’ve been hacking away on svnrepo, a pythonic API to local subversion repositories and it is now robust enough to warrant a 0.1 release. svnrepo is (and was intended to be) very small, just a single module, that wrapped the python subversion bindings for repository access to make them simpler to use and more object-oriented. At present the module requires subversion >= 1.3 but I’m hoping to scale that dependency back in future releases.

Getting it

The module is Open Source software (MIT-licensed) and you can either:

  1. Download it directly from: http://www.rufuspollock.org/code/svnrepo/svnrepo.py

  2. Or get it the python package index. If you are using setuptools just do:

    $ easy_install svnrepo

What it looks like

There are unit tests at:

http://www.rufuspollock.org/code/svnrepo/svnrepo_test.py

And they are pretty good at demonstrating how to use the API but just for the sake of demonstration. Assume that you have an existing subversion repository at REPOSPATH.


from svnrepo import *
REPOSPATH = ...
repos = Repository(REPOSPATH)

history = repos.history('/')
for revision in history:
    print history

rev = repos.get_revision() # get the youngest revision
print rev.log # the log message of the revision
print rev.date

# get a node
rootdir = rev.get_node('/')
print rootdir.is_dir()
print rootdir.list_dir()

# create a new revision
newrev = repos.new_revision()
newrev.log = 'My new revision'
newrev.author = 'me'
fs = newrev.file_system
filepath = 'tmp.txt'
newfile = fs.make_file(filepath)

text = 'nothing ever exists entirely alone'
newfile.write(text)

propname = 'copyright'
propval = 'nemo'
newfile.set_property(propname, propval)

newrev.commit()

Having looked around for a while without success for something that would spit out csv files as ascii tables I decided to hack something together. The result is a small python script [csv2ascii.py][]. It is currently fairly crude, for example it just truncates cell text which is too long, but I hope I’ll have some more time to improve it soon.

Example

Suppose you had the following in a file called example.csv:

"YEAR","PH","RPH","RPH_1","LN_RPH","LN_RPH_1","HH","LN_HH"
1971,7.8523,43.9168,42.9594,3.7822,3.7602,16185,9.691843   
1972,10.5047,55.1134,43.9168370988587,4.0093,3.7822,16397,9.704855

Running:

 $ ./csv2ascii.py example.csv

Would result in:

+------+------+------+------+------+------+------+------+
| YEAR |  PH  | RPH  |RPH_1 |LN_RPH|LN_RPH|  HH  |LN_HH |
+------+------+------+------+------+------+------+------+
| 1971 |7.8523|43.916|42.959|3.7822|3.7602|16185 |9.6918|
+------+------+------+------+------+------+------+------+
| 1972 |10.504|55.113|43.916|4.0093|3.7822|16397 |9.7048|
+------+------+------+------+------+------+------+------+

The Open Access Initiative Protocol for Metadata Harvesting (OAIPMH) is growing rapidly as the standard web protocol for making metadata, primarily bibliographic information, available online for programmatic access and I’ve long meant to write something that would allow be to pull information down from remote repositories into my local bibliographic database automatically (it would save an awful lot of typing).

I’ve mentioned the oaipmh package provided by infrae.com before however the documentation they provide has got rather out of date and though I’ve made a few attempts I’ve never quite been able to get it to work. However after a bit more effort recently with the newer v2.0+ of the package I’ve managed to get something basic working which you can find at http://www.rufuspollock.org/code/oaipmh/demo.py.

I should note that my main interest, at least at present, is in the client-side, not the server-side of oaipmh so the code is oriented in that direction — as I mentioned above my aim is to automatically pull down article metadata into my local bibliographic system from sites such as repec (repec oai url).

WSGI Middleware

September 28th, 2006

WSGI Middleware

In a previous tutorial we just wrote a basic ‘Hello World’ application in WSGI. At the end of you might, rightly, have been wondering what’s the point of WSGI — after all you could have written that ‘Hello World’ app using plain CGI (or anything else for that matter). In this tutorial we are going to start answering that question by taking a look at WSGI middleware and write a simple piece of middleware ourselves.

A Simple Example

Here a simple piece of middleware that adds authentication based on the remote address of the client (this tutorial and its code is available in raw form at http://www.rufuspollock.org/code/wsgi/):


from wsgiref.simple_server import make_server, demo_app

class AuthenticationMiddleware:
    """A modified version of an original example at:
    http://isapi-wsgi.python-hosting.com/wiki/WSGI-Gateway-or-Glue
    """

    def __init__(self, app, allowed_addresses):
        """
        @param app: the WSGI app we will that comes after us
        @param allowed_addresses: list of remote addresses from which to allow
                                  access
        """
        self.app = app
        self.allowed_addresses = allowed_addresses

    def __call__(self, environ, start_response):
        """The standard WSGI interface"""
        addr = environ.get('REMOTE_ADDR','UNKNOWN') 

        if addr in self.allowed_addresses: # pass through to the next app
            return self.app(environ, start_response)
        else: # put up a response denied
            start_response(
                '403 Forbidden', [('Content-type', 'text/html')])
            return ['You are forbidden to view this resource']

addresses = [ '127.0.0.1' ]
simple_app_with_auth = AuthenticationMiddleware(demo_app, addresses)

if __name__ == '__main__': 

    httpd = make_server('', 8000, simple_app_with_auth)
    print "Serving HTTP on port 8000..."

    # Respond to requests until process is killed
    httpd.serve_forever()

The Basic Idea

As explained in [pep-333] the basic idea of middleware is of something that ‘plays both sides’:

Note that a single object may play the role of a server with respect to some application(s), while also acting as an application with respect to some server(s). Such “middleware” components can perform such functions as:

  • Routing a request to different application objects based on the target URL, after rewriting the environ accordingly. * Allowing multiple applications or frameworks to run side-by-side in the same process * Load balancing and remote processing, by forwarding requests and responses over a network * Perform content postprocessing, such as applying XSL stylesheets

A diagram helps:

             WSGI SERVER

               V   A
               V   A
               |   |
               |   |
      +---------------------+
      |        |   |        |
      |   +-------------+   |
      |   |    V   A    |   |
      |   |   +-----+   |   |
      |   |   | APP |   |   |
      |   |   +-----+   |   |
      |   | MIDDLEWARE1 |   |
      |   +-------------+   |
      |     MIDDLEWARE2     |
      +---------------------+

   The WSGI Application + Middleware 'Onion'

Basically middleware wraps an underlying wsgi application and then presents itself as the new wsgi application to external callers. In python code the above would like:

core_app = SomeWsgiApplication()
# remember the middleware is itself a wsgi application
wrapped_once = Middleware1(core_app)
# wrap the new wsgi application!
wrapped_twice = Middleware2(wrapped_once)

# alternatively we could do it all in one
wrapped = Middleware2(Middleware1(core_app))

Remarks

Middleware is useful because it dramatically increases the possibilities for using standard web application plumbing — any piece of middleware can now be plugged together very easily with either other middleware or an application.

Middleware is usually one of three types:

  • pre-processors
  • post-processors
  • those that do both (rare)

Examples of pre-processors are:

  • Authenticators (including session management)
  • Dispatchers including proxies and controllers

Examples of post-processors:

In general, pre-processors are a little simpler because they don’t have to deal with the ‘chunking’ aspect of WSGI (a WSGI application return an iterable rather than just a single buffer so as to allow ‘chunking’ of output — this will be useful, for example, when streaming large files, see the’Buffering and Streaming’ section in PEP 333 for more information).

‘Hello World’ with WSGI

August 31st, 2006

I’ve been seeing a lot of talk about WSGI (Web Server Gateway Interface) and its benefits over the last six months or so and I’ve been meaning to take a look — not least because of the potential to use wsgi middleware to make a nice front-controller for KForge.

First Stop

A quick google takes me to: http://www.wsgi.org/wsgi. I’m looking to just write the proverbial ‘hello world’ app at this stage. Most of the references are bit too high level (or complex) for me (though this one is an exception). So here I’m going to detail my experiences of familiarizing myself with wsgi by writing the classic ‘hello world’ app (if you looking to do something more sophisticated with wsgi check out a toolkit such as paste or pylons the framework built on top of paste).

Hello World

1. Install wsgiref

wsgiref is the wsgi reference implementation that is now part of python 2.5 standard library. If you are running python version less than 2.5 you will want to do:

  $ sudo easy_install wsgiref

2. Get a web server

We’ll use the wsgiref simple server as detailed in the docs (if you want to use a ‘proper’ webserver see the section below on making your wsgi app available via fastcgi). Create a python module, simpletest.py say, and insert:

  from wsgiref.simple_server import make_server, demo_app

  httpd = make_server('', 8000, demo_app)
  print "Serving HTTP on port 8000..."

  # Respond to requests until process is killed
  httpd.serve_forever()

  # Alternative: serve one request, then exit
  ##httpd.handle_request()

3. Run it

Start the server:

  $ python simpletest.py

Then visit http://localhost:8000/

Bingo! We’ve got our first working wsgi app (demo_app should output ‘Hello world!’ followed by a list of variable values).

4. Make our own Hello World app

We haven’t yet written anything ourselves — we’re just using the demo_app bundled with wsgiref. So change simpletest.py to be:

  def simple_app(environ, start_response):
      """Simplest possible application object""" 
      status = '200 OK'
      response_headers = [('Content-type','text/plain')]
      start_response(status, response_headers)
      return ['My Own Hello World!\n']

  from wsgiref.simple_server import make_server, demo_app

  httpd = make_server('', 8000, simple_app)
  print "Serving HTTP on port 8000..."

  # Respond to requests until process is killed
  httpd.serve_forever()

Run this and visit http://localhost:8000/ and you should see a blank page containing ‘My Own Hello World!’.

5. Using a Class

Finally for completeness here’s the same application but done as a class:

  class SimpleApp:
      """Produce the same output, but using a class
      """
      def __init__(self, environ, start_response):
          self.environ = environ
          self.start = start_response

      def __iter__(self):
          status = '200 OK'
          response_headers = [('Content-type','text/plain')]
          self.start(status, response_headers)
          yield 'My Own Hello world!\n'

  from wsgiref.simple_server import make_server, demo_app

  # httpd = make_server('', 8000, simple_app)
  # the same but using a class
  httpd = make_server('', 8000, SimpleApp)

  print "Serving HTTP on port 8000..."

  # Respond to requests until process is killed
  httpd.serve_forever()

Serving an WSGI App via FastCGI

This section explains how to serve your WSGI app via FastCGI (other methods using scgi or even cgi take an almost identical approach).

1. Install a fastcgi interface to wsgi:

Use flup which provides a fastcgi and scgi interface to wsgi:

  $ sudo easy_install flup

2. Install a simple standalone fastcgi implementation:

  1. Download http://www.saddi.com/software/py-lib/py-lib/fcgi.py
  2. Install this somewhere you can import it as import fcgi

3. Attach your wsgi application to this fcgi server

Create a python file (server.fcgi) and paste in the following:

  #!/usr/bin/env python
  from myapplication import app # Assume app is your WSGI application object
  from fcgi import WSGIServer
  WSGIServer(app).run()

Now you can just point your webserver at this file (make sure you’ve configured it to handle .fcgi files using fastcgi) and your app is available via fastcgi.

References

debian/ubuntu: apt-get install wdiff macosx: fink install wdiff

From apt-cache show:

wdiff’ is a front-end to GNU diff. It compares two files, finding which words have been deleted or added to the first in order to create the second. It has many output formats and interacts well with terminals and pagers (notably with less').wdiff’ is particularly useful when two texts differ only by a few words and paragraphs have been refilled.

BSA Study on Software Piracy

December 14th, 2005

The BSA (Business Software Alliance) has released a study (prepared by IDC) along with a heavy PR blitz claiming all kinds of absolutely astronomical benefits from reducing software piracy.

There are immediate reasons to be suspicious. While the study is touted as an ‘economic’ analysis of the benefits of combatting piracy no self-respecting economist I know of would come out with an ‘estimate’ in the way the BSA does because we just don’t have the data to do that. At the same time there are no difficulties in pointing out multiple errors in the BSA’s estimates which indicated they are almost certainly wildly too high.

Of course we should remember as we get hot under the collar about this latest piece of misinformation that that for those of us opposed to software monopolies the best thing right now for the developing world would be a really massive crackdown on piracy with draconian enforcement. That way people would stop with the free candy (MS windows + office) and get into linux before it is too late :).

Below is a quick analysis of the paper pointing out some of the bigger howlers.

1. they claim with a section title: “Lower Software Piracy Produces Higher IT Benefits” (p.4) and then produce as their evidence the fact that:

“A country’s software piracy rate is a key dif- ferentiator among countries that enjoy vast IT eco- nomic benefits and those that have yet to unlock them …

In general, there is an inverse relationship between a country’s software piracy rate and the size of its IT sector as a percentage of GDP. Thus, the lower the software piracy rate, the higher the IT related bene- fits, including IT-generated taxes.’

[+ nice accompanying graph and repetition innumerable times in the section]

This is outrageous and is just the old analytical fallacy of equating correlation with causation: just because more piracy is associated with a smaller IT industry doest not mean piracy causes a smaller IT industry and that reducing piracy will increase the size of the IT industry. In fact it seems more likely causation goes the other way (smaller IT sector -> more piracy) or is simply the result of omitted variables: developing countries have both a smaller IT sector — e.g. due to education levels, etc, and more piracy — because commercial software prices are higher, enforcement is worse etc. Putting these two facts together but leaving out conditioning on development will give you the resulting correlation

2. p.13 Summary of IDC’s methodology:

‘the theoretical losses from piracy in terms of revenue to software vendors, software- related revenues to services firms, and software- related revenues to channel players. Employment losses are calculated from revenue losses, and only apply to employment in the IT industry, not IT pro- fessionals in end-user organizations (although IDC believes there is some impact.) Tax revenue losses are calculated from revenue losses (VAT and corpo- rate income tax) and employment losses (income and social taxes). The software losses are based on the piracy rate and equal the value of software installed and not paid for, adjusted by IDC’s soft- ware analysts to account for software in a country (such as enterprise and server software, not meas- ured in the annual BSA study).’

So there it is baldly stated, they calculate societal loss by:

a) assuming that every pirated copy would have been purchased if piracy were prevented (blatantly false) b) take no account of general equilibrium issues: i.e. that people have a fixed budget so that if you suddenly spend a load more on ‘commercial’ software (because you pay for everything) that means less to spend on other goods (such as bespoke software, fridges, apples etc)

This suggests:

  1. They have never heard of a demand curve (i.e. that people buy different amounts depending on price!)

  2. Sampling effect: ‘piracy’ is one way for people to sample a product before buying. Even in the developed world we often see people using a friend’s copy at home but the purchased version at work which is a similar situation (Let’s remember that the biggest shareware company in the world is microsoft)

  3. That estimates of welfare/revenue losses (at least on this scale) have to be done on a society wide basis so that you include the displacement effects from spending more on non-pirated software

Finally and most importantly, the whole study gets its analysis wrong from the very start. For information goods, once the good is made it is optimal for societal welfare to distribute at marginal cost (i.e. at 0 for software).

Thus piracy of e.g. MS windows should be welfare improving for the world as a whole and definitely welfare improving for every developing country. Of course this may be bad for MS (though that depends on sampling effect).

Now the usual trade-off (and the reason we have IP) is that if it was known in advance that everything would be sold at marginal cost there would not be the money to pay for the original development. Thus we do grant a temporary monopoly to help rather than distributing at marginal cost.

What this means is that any ‘economic’ study of the effects of piracy is really about the trade-off of lower consumer prices (due to piracy) vs. less innovation.

Given that most piracy involves: a) developing countries (so that loss of sales is low since they wouldn’t be able to afford the stuff anyway) b) big name products (who are already deep within the black in relation to paying for the innovation)

It seems on the face of it likely that software piracy is actually welfare improving. (Backing this up with proper empirical evidence studies is a mammoth task because we don’t really know the demand function for software or the supply function of innovation).

Migrating Drupal to Wordpress

October 10th, 2005

Here are some scripts along with instructions for migrating a drupal site to wordpress.

README.txt

These instructions are ‘implemented’ in code as a small python script called migrate-drupal-data.py which you can find below.

  1. Dump your drupal database. To find out how to do this refer to the manual for your db (for mysql from the command line you use mysqldump and for postgres it is pg_dump)

  2. Load the drupal dump into your wordpress db:

      mysql -u [username] -p [password] < [path-to-drupal-dump]
    
  3. Edit the convert-drupal-data.sql script replacing ‘weblog_’ with your wordpress prefix

  4. Run convert-drupal-data.sql script against your wordpress db:

      mysql -u [username] -p [password] < convert-drupal-data.sql
    

    WARNING: this will delete all existing posts, comments and categories from your wordpress db. If you don’t want to do that edit the script as indicated therein).

  5. Finished.

Scripts

convert-drupal-data.sql

-- taken from
-- http://vrypan.net/log/archives/2005/03/10/migrating-from-drupal-to-wordpress/
-- and then improved :-) 
-- these lines will result in the deletion of all existing posts, comments
-- and categories. Comments these out using '--' or /* ... */ if you don't
-- want that to happen 
DELETE FROM weblog_wp_categories ;
DELETE FROM weblog_wp_posts ;
DELETE FROM weblog_wp_post2cat ;
DELETE FROM weblog_wp_comments ;

/*
-- does not work (think it is to do with password issues)
-- also causes problems with auto-increment
delete from weblog_wp_users;

INSERT INTO weblog_wp_users
  (ID, user_login, user_pass, user_nickname, user_email, user_registered)
  SELECT
  uid, name, pass, name, mail, FROM_UNIXTIME(timestamp)
  FROM users;
*/

INSERT into weblog_wp_categories(
  cat_ID,cat_name, category_nicename, category_description, category_parent
  )
  SELECT term_data.tid, name, name, description, parent
  FROM term_data, term_hierarchy
  WHERE term_data.tid=term_hierarchy.tid;

INSERT INTO weblog_wp_posts(
  ID, post_author, post_date, post_content, post_title, post_excerpt,
  post_name, post_modified
  )
  SELECT nid, 1, FROM_UNIXTIME(created), body, title, teaser, concat('OLD',nid), FROM_UNIXTIME(changed)
  FROM node
  WHERE type='blog' OR type='page' OR type='story' OR type='forum' ;

INSERT INTO weblog_wp_post2cat (post_id,category_id)
  SELECT nid,tid
  FROM term_node ;

INSERT INTO weblog_wp_comments (
  comment_post_ID, comment_date, comment_content, comment_parent
  )
  SELECT nid, FROM_UNIXTIME(timestamp), concat(subject,' ', comment), thread
  FROM comments ;

DROP TABLE term_data;
DROP TABLE term_hierarchy;
DROP TABLE node;
DROP TABLE term_node;
DROP TABLE comments;
DROP TABLE users;

migrate-drupal-data.py

#!/usr/bin/env python
import os

# target wordpress db information
user = 'your_user_name'
db = 'your_db_name'

# replace this with the path to your dump from drupal
drupalDbDump = '~/tmp/drupal-db-dump.sql'

# path to drupal conversion sql script
templateConvertDrupalSql = 'convert-drupal-data.sql'

# tmp file used as an intermediary 
convertDrupalData = 'tmp_sql.sql'

# default wordpress prefix used in templateConvertDrupalSql script
# you shouldn't have to change this
templatePrefix = 'weblog_'

# your wordpress prefix
wpPrefix = ''

def prepareConvertSql():
    ff = file(templateConvertDrupalSql)
    tstr = ff.read()
    ff.close()
    tstr = tstr.replace(templatePrefix, wpPrefix)
    ff2 = file(convertDrupalData, 'w')
    ff2.write(tstr)
        ff2.close()

if __name__ == '__main__':
    sqlCmd = 'mysql -u %s -p %s < %s'
    cmdInsertDrupalData = sqlCmd % (user, db, drupalDbDump)
    cmdConvertDrupalData = sqlCmd % (user, db, convertDrupalData)

    # getData()
    prepareConvertSql()
    os.system(cmdInsertDrupalData)
    os.system(cmdConvertDrupalData)