Friday, March 6, 2015

Case Study: Migrate Web Scraper from dotCloud

SUMMARY:
  • Migrate existing code in less than 2 minutes
  • No need to mess with servers at all
  • Hookscript automatically scales and adds redundancy


BACKGROUND:
PriceCharting.com tracks prices for every video game. Some of this data is scraped from various websites.

PROBLEM:
Some scraping required scaling as product lists grew and debugging as sites changed HTML. dotCloud didn't provide enough logging detail and required choosing the number of servers.

ANSWER:
PriceCharting migrated the code to hookscript for improved logging with no need to manage servers.

STACK:
Written in Prolog. Scrapes HTML using XPath. Outputs data in JSON.

% built in modules
:- use_module(library(hookscript)).
:- use_module(library(debug), [assertion/1]).
:- use_module(library(dcg/basics), [integer//1]).
:- use_module(library(http/json), [json_write_dict/2]).
:- use_module(library(web), []).
:- use_module(library(xpath)).

hook :-
    % fetch the GameStop product page
    req:param(id, Id),
    format(string(Url),"http://www.gamestop.com/-/games/-/~d",[Id]),
    web:get(Url,[status_code(200),html5(Dom)]),
        

Migrating the code took roughly 2 minutes. Cut/paste code and add one line:
:- use_module(library(hookscript)).

Like all HTML scrapers, they can break when underlying HTML changes. Hookscript makes debugging easier because script logs show full HTTP response and incoming HTTP request.

# extract price from html
price(Dom, Condition, Price) :-
    gamestop_condition(GameStopCondition, Condition),
    xpath(Dom, //div(h2/strong(text=RawCondition))/h3(text), RawPriceAtom),
    normalize_space(atom(GameStopCondition), RawCondition),
    normalize_space(codes(PriceCodes), RawPriceAtom),
    phrase(currency(Price),PriceCodes),
    !.
price(_, _, 0).  % a missing price is ok
        

Request Log hookscript request logs

PriceCharting doesn't have to worry about the number of servers, redundancy, or scaling. Hookscript takes care of all that and bills only for runtime consumed.