Sunday, April 27, 2008

How to find the cheapest train tickets by website scraping

One of my niggles with train timetable websites is that it can be very tricky to find the cheapest train ticket if you want to ask questions like "What is the cheapest ticket if I could travel on Sunday, Monday or Tuesday next month?" or even worse "What would be the cheapest 1st class ticket if can travel any weekend in June?". If you phone up a train hotline they would probably give up (and you are normally paying 10p/min while they try) or they would guess and might miss the best deal.

My solution was to use iMacro. This is a macro recording plugin for Firefox that can automate interactions with a website. I can now set up an initial script and set the browser to look for the cheapest possible ticket.

For example, below are the results for London to Nottingham travelling in the mornings (9am to 1pm) from 3rd to 5th May searching for the cheapest direct train ticket in any 2 hour slot. It took 20 seconds to configure the script (after taking a day to develop it) and my computer took 5 minutes to run the search as it has to enter data on the timetable website 12 times (3 days x 2 time slots x two classes of travel) and wait each time for the website to refresh with new data. I chose the Virgin Trains website as it seemed to have quite a nice interface which made it easy to pick the first radio button to find the cheapest ticket and seemed to return the most comprehensive results (I kept finding discrepancies on other sites, which surprised me as I assumed they shared the same underlying data).
Results for London to Nottingham 
[click to show longer example]

Searched on Sun, 27 Apr 2008 12:44:50 GMT

Saturday 3 May 2008
Slot | Time | Std | Time | 1st
09:00 | 08:55 | £ 15.00 | 08:55 | £ 18.00
11:00 | 12:55 | £ 15.00 | 11:55 | £ 18.00

Sunday 4 May 2008
Slot | Time | Std | Time | 1st
09:00 | 09:00 | £ 15.00 | 09:00 | £ 18.00
11:00 | 11:00 | £ 15.00 | 11:00 | £ 18.00

Monday 5 May 2008
Slot | Time | Std | Time | 1st
09:00 | 08:55 | £ 11.00 | 08:55 | £ 18.00
11:00 | 10:55 | £ 11.00 | 10:55 | £ 18.00
I have tested with very long queries taking more than an hour to run and the only problem may be time-outs from the website. In these cases iMacro appears to lock but by pressing the Pause/Resume button will continue without losing data. Note it is best not to use the browser when running an iMacro script but I have successfully used a different browser (Safari) at the same time without any problems.

The macro is run from a javascript file in iMacro which calls iMacro commands (previously I passed variables to an iim file but it seems easier to put it all in one file) and I tweak the "Set up query" section of the javascript to query a particular train route, this could easily be set from user prompts. For the time being it is always cheapest to get two singles rather than a return so I've only bothered writing the macro for a single. Note that the slot hours (set in array myhour) are based on the maximum number of hours that Virgin trains will display in the particular train route. For London/Nottingham this is 2 hours and for London/Redruth this is 3 hours. Nice bonus functionality I've included are returning the date in long text form from the Virgin site and keeping the iMacro code display updated telling you the estimated time left.

The latest version opens a results window to show the data in an html table. You can use iimDisplay() but it is limited to a tiny window and was (at the time of writing) not resizeable by scripting.

Here's the source code. Note that the source code is not word wrapped and long lines may appear truncated, but if you cut & paste to your editor you should see all the text.

JavaScript source code Virgin-Trains.js (click to show)

No comments: