Over the last year, we’ve been posed two screen-scraping challenges, both from the world of media. In both cases the purpose was to consolidate publicly available, non-commercially sensitive data. Whilst the overarching business imperatives were different – one was business intelligence and the other providing a consumer service – the challenges were the same, which we thought we’d share:
1. Tread considerately
The overriding consideration when overcoming the challenges was to minimise the server load on the sites we were scraping. Yes, part of that consideration is self-serving, i.e. not get to black-listed, but part is about doing the job well.
2. The curse of the modern browser
Modern browsers are so capable at handling client-side code that, as part of delivering a good User Experience, many websites make extensive use of client-side code and AJAX call-backs to load data dynamically or on the fly. Unfortunately – for us – this blocks traditional scraping techniques, which effectively download the raw HTML content, but are incapable of executing post-download client-side code.
Add on top that the use of static URLs. Historically, each page on a web-site would have had a unique URL. Today, however, the URL often remains static despite the fact a user may have clicked on a menu and gone to another page
3. Replicating humans
Many of the web-sites that we scraped contained data only accessible after clicking on various menus, scrolling, clicking on arrows and generally being a human being. In effect, we had to replicate programmatically a human interacting with a modern browser. These repeatable set of interactions had to be executed by an automated service at a user-definable frequency, owing to the fact some sites may show more data and some less, based on their time window
4. Common time-zone
Working on a global scale, the sites we needed to scrape often contained times in their respective country’s time-zone. In order to compare one country’s data against another these needed to be converted to a standard time-zone.
5. Common language
We also needed to convert from the local language to English; this was overcome by using Google Translate.
Overcoming these challenges
In each case we built a system where that worked like this operationally:
A semi-technical operator would view the site in Google Chrome and, using the Developer tools, produce XPath queries to obtain relevant data. These queries would be entered into our custom built application.
The application would also allow a user to configure user-interactions. We did this by building a command-driven mechanism, allowing a page to have as many interactions as required. The commands are fairly simple e.g. wait for 15 seconds, press button with class name = “classname”, wait 15 seconds, press button with ID, “Id”.
The operator would then click a test button that displayed a browser view (IE11 equivalent capabilities with client-side exactable code being executed), showing the user-interactions being performed in real-time and scrapes the site of the XPath defined data.
Once the operator was happy the data was correct, they saved the new scrape to the system. That scrape was then set up programmatically at set interval using a server-side scrape engine.
The engine was built with concurrency in mind, meaning multiple scrape engines can run simultaneously. Each scrape communicates with the database server via a secure RESTful Web API service layer. The scrape engine attempts each scrape, as defined in the tool, as and when it is deemed necessary (determined by the scrape frequency the operator configured).
All of the scrape attempts are logged with any issues shown on-screen in the web-admin portal. Issues are also sent via e-mail to selected people.
Consolidating the data was only the first part of the overall journey. The data needed to be exposed in a consumer facing website in one instance and complex reporting portal in another – but these being fairly standard things to do, they aren’t covered here.
The biggest lesson learned is the configuration of the scraping (i.e. the operator tasks above) need development skills (X-Path), a business understanding (why am I doing this) and an attention to detail. A reasonably challenging combination!