Roy van Kaathoven
Full Stack Developer

24 Sep

Scraping with Scala

A few months ago i got asked if i could scrape data from various websites, store it in a database and output some nice reports. Back then i've written a simple scraper with Ruby and Nokogiri, saved everything to a Microsoft SQL database and created reports using Excel. Because of the tight budget the script became a quick and dirty solution, but it did the job and the client was happy.

Now that the scrape requests are becoming more common i wanted to design a framework in Scala which should make scrape jobs a lot easier, faster and fun to write. But why should i develop a new webscraping framework when there are already plenty of alternatives in various languages?

Firstly because i like learning new things in Scala, and webscraping is something i haven't tried out in Scala yet. Secondly because after some research the available frameworks all lack something, the following things i was not happy with:

  • Most frameworks are single-threaded and can't download pages in parallel
  • They do not scale horizontally
  • The amount of code it requires do something simple is to much

Akka solves point 1 with Round Robin Routing, and point 2 with Remote Actors. Points 3 can be solved with Scala alone.

So far i came up with the following DSL to search for php elephants on Google and read the results

import org.rovak.scraper.ScrapeManager._

object Google {
  val results = "#res li.g h3.r a"
  def search(term: String) = {
    "http://www.google.com/search?q=" + term.replace(" ", "+")
  }
}

// Open the search results page for the query "php elephant"
scrape from Google.search("php elephant") open { implicit page =>

  // Iterate through every result link
  Google.results each { x: Element =>

    val link = x.select("a[href]").attr("abs:href").substring(28)
    if (link.isValidURL) {
      // every found link in the found page
      scrape from link each (x => println("found: " + x))
    }
  }
}

To make common tasks easier there are spiders available which search through a website, clicking on every link with the allowed domains and open the webpage for reading. A simple spider reads the entire domain and stop when it has nothing left to do.

new Spider with EmailSpider {
  startUrls ::= "http://events.stanford.edu/"
  allowedDomains ::= "events.stanford.edu"

  onEmailFound ::= { email: String =>
    // Email found
  }

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

  onLinkFound ::= { link: Href =>
    println(s"Found link ${link.url} with name ${link.name}")
  }
}.start()

The project is far from finished, but hopefully it gives you an idea of the upcoming features. The project can be found on Github