Web scraping “describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context” (Wikipedia).
This is exactly what I needed for a revived project soon to be revealed so I “crawled” the net a bit for solutions available under Ruby. The following is an overview of alternatives I’ve found - cleaned and published as a short essay for “Web programming” course at university. I’ll start from the simplest to the more complex.
Regex is the first thing that comes to mind when you want to extract data from text and it can also work well for HTML.
Advantages:
- doesn’t depend on ruby as it’s supported in any decent programming environment
- can eat any html, not maddening how garbage it is
Desadvantages:
- you can’t take advantage of the semantics of html structure
- ungly syntax, a week after you write it nobody (including you) will be able to figure out what does it mean
My opinion is that using regex as the single scraping tool is too hard-core but on the other hand it is the best tool for polishing up results from the tools below.
The following approaches profit from the fact html is (or should be) actually a “tree” - or if you want to be particular or idealistic, “xml”. Idealistic because HTML is too rarely XML-clean. There is a solution tho:
HTree is a nice little (i think) tool that is very good at chewing HTML and spitting up valid XML. Considering this this tool is a must but you don’t have to wary about it too much as it’s been already integrated in many of the solutions below.
REXML is the standard XML processing library implemented in ruby and can eat up very well the tidy HTML from HTree and let you play with it like it would be formed by ruby objects. It also has Xpath support, this making it one of the most powerful solutions. So, too keep with the advantages/disadvantages scheme:
Advantages:
- play with html subtrees like with ruby objects, nice way of coding
- full XPath support!
- implemented in ruby so you can tweak it (if you’re into that stuff)
- mature, stable tool
Disadvantages:
- haven’t noticed first-hand but I understand it can be slow on large datasets
- requires Xpath knowledge to work well with it
XMLParser is an alternative to HTree and REXML that uses the expat for parsing xml. The fact that expat is a c library should make it much faster but I cannot tell you very much about it as I haven’t used it. It also requires expat installation - that I think also requires compilation - on the server so if you don’t have good access to your deployment server it is not an option.
Hpricot is a nice HTML parsing tool that makes HTree, JQuery and Ragel work together in a simple and fun to use way.
Advantages:
- nice to use - ruby-ish and everything
- pretty fast
- it supports queries using CSS selectors
- it has some XPath support
Disadvantages:
- it has only some Xpath support (no predicate functions, for example)
Hpricot is the tool I’ve chosen for the “still in stealth mode” project mentioned above and I like it so far.
Rubyful Soup is another HTML parser that looks like a good alternative for somebody not knowing or not wanting to learn xpath. I haven’t used it first-hand but the code looks nice and it brags it won’t “choke if you give it bad markup”. So:
Advantages:
- produce nice-looking code
- accepts bad markup
- ruby is all you need to know - and HTML, of course
Disadvantages:
scrAPI that uses CSS selectors as the way of querying the HTML document. It’s suggested way of coding is rather different from the parsers above but it is intuitive and might prove easy to use.
Advantages:
- CSS selectors
- accepts invalid HTML - it actually uses Tidy to clean it up
Disadvantages:
The above are the main HTML parsers I’ve found. But the purpose is not only parsing HTML, it is actually extracting information from a whole site. In order to do this we also need a way of navigating the site to gather the pages required for parsing. The following tools facilitate this.
WWW::Mechanize does this by providing a layer over the ruby http module that also integrates by default with hpricot (more recently, before was the standard REXML). It emulates the way a browser would crawl pages, including storing cookies, following redirects, etc. It’s a nice tool, it’s only drawback being no support for js interaction with the site in case you need it. This is the tool we are currently using.
scRUBYt is a tool that uses WWW::Mechanize, HPricot and some probably complicated code in order to build web scrapers from an example you give it in a page. The ideea is nice but we (me, Silviu and Cristi) have spent over a day trying to make it work due to it’s crazy complicated dependencies only to finally figure out that it only works at it’s full strength on 32bit linux systems. On other system it either doesn’t work or it don’t support the some feature like the “example to code” module. Anyway, the examples and documentation suggest that it handles greatly 80% of cases but I couldn’t figure out from the documentation how to make complex crawling scenarios (not just nested) so I assume for now they can’t be done. The dependency problems and lack of flexibility in 20% of the cases are the reasons we’ve dropped it.
I’ll move on as criticizing scRUBYit should have been a different post.
Watir is, as they describe it, an “web application testing in ruby” tool so it might not sound like a tool for crawling but it can very well be used for that. It provides the same stuff as WWW:Mechanize with a similar looking code and it can also play with JavaScript. The downsite is that it uses IE’s engine so it only works on Windows for now.
The above solutions are all available for standard ruby. If JRuby is used than this opens access to a whole bunch of Java tools that serve the same purpose. IronRuby will do the same thing for .net soon.
Lastly, the above overview is based partly on my experience with the tools and partly from just reading about them from different sources. Please feel free to correct me if I’ve got something wrong and let me know if there are any other tools that would fit here.