Web Jazz: A simple distributed crawler in Ruby

A week ago, I took a break from Mobtropolis, and...of all things ended up writing a simple distributed crawler in Ruby. I hesitated posting it at first, since crawlers are conceptually pretty simple. But eh, what the heck.

This was just an exercise to do some DRb and Hpricot, so don't use this for your production work, whatever it may be. An actual crawler is far more robust than what I wrote. And don't keep it running hammering at stuff, since it'll get you banned.

First, this is how you use it:


WebCrawler.start("http://en.wikipedia.org/") do |doc|
  puts "#{doc.search("title").inner_html}"
end

And that's it. It returns documents in an XPath traversable form, courtesy of Hpricot.

A web crawler is a program that simply downloads pages, takes notes of what links there are on that page, and puts those links on its queue of links to crawl. Then it takes the next link off its queue and downloads that page and does the same thing. Rise and Repeat.

First, we create a class method named start that creates an instance of a webcrawler and then starts it. Of course, we could have done without this helper method, but it makes it easier to call.


module Crawler
  class WebCrawler
    class << self
      def start(url)
        crawler = WebCrawler.new
        crawler.start(url) do |doc|
          yield doc
        end
        return crawler
      end
    end
end

So next, we define the initialization method.


module Crawler
  class WebCrawler
    def initialize
      puts "Starting WebCrawler..."
      begin
        DRb.start_service "druby://localhost:9999"
        puts "Initializing first crawler"
        puts "Starting RingServer..."
        Rinda::RingServer.new(Rinda::TupleSpace.new)
        
        puts "Starting URL work queue"
        @work_provider = Rinda::RingProvider.new(:urls_to_crawl, Rinda::TupleSpace.new, "Queue of URLs to crawl")
        @work_provider.provide
        
        puts "Starting URL visited tuple"
        @visited_provider = Rinda::RingProvider.new(:urls_status, Hash.new, "Tuplespace of URLs visited")
        @visited_provider.provide
      rescue Errno::EADDRINUSE => e
        puts "Initialize other crawlers"
        DRb.start_service
      end
      puts "Looking for RingServer..."
      @ring_server = Rinda::RingFinger.primary
      
      @urls_to_crawl = @ring_server.read([:name, :urls_to_crawl, nil, nil])[2]
      @urls_status = @ring_server.read([:name, :urls_status, nil, nil])[2]
      @delay = 1
    end
  end
end

This bears a little explaining. The first webcrawler you start will create a DRb server if it doesn't already exist and do the setup. Then, every subsequent webcrawler it'll connect to the server and start picking URLs off the work queue.

So when you start a DRb server, you call start_server with a URI, then you start a RingServer. What a RingServer provides is a way from subsequent clients to find services provided by the server or other clients.

Next, we register a URL work queue and a URLs visited hash as services. The URL work queue is a TupleSpace. If you haven't heard of TupleSpace, the easiest way to think of it is as like a bulletin board. Clients post items on there, and other clients can take them out. This is what we'll use as a work queue of URLs to crawl.

The URLs visited is a Hash so we can check which URLs we've already visited. Ideally, we'd use the URL work queue, but DRb seems to only provide blocking calls for reading/taking from the TupleSpace. That doesn't make sense, but I couldn't find a call that day. Lemme know if I'm wrong.


module Crawler
  class WebCrawler
    def start(start_url)
      @urls_to_crawl.write([:url, URI(start_url)])
      crawl do |doc|
        yield doc
      end
    end

    private

    def crawl
      loop do
        url = @urls_to_crawl.take([:url, nil])[1]
        @urls_status[url.to_s] = true
        
        doc = download_resource(url) do |file|
          Hpricot(file)
        end or next
        yield doc

        time_begin = Time.now
        add_new_urls(extract_urls(doc, url))
        puts "Elapsed: #{Time.now - time_begin}"
      end
    end
  end
end

Here is the guts of the crawler. It loops forever taking a url off the work queue using take(). It looks for a pattern in the TupleSpace, and finds the first one that matches. Then, we mark it as 'visited' in @urls_status. Then, we download the resource at the url and use Hpricot to parse it into a document then yield it. If we can't download it for whatever reason, then we grab the next URL. Lastly, extract all the urls in the document and add it to the work queue TupleSpace. Then we do it again.

The private methods download_resource(), extract_urls(), and add_new_urls() are merely details, and I won't go over it. But if you want to check it out, you can download the entire file. There are weaknesses to it that I haven't solved, of course. If the first client goes down, everyone goes down. Also, there's no central place to put the processing done by the clients. But like lazy textbook writers, I'll say I'll leave that as an exercise for the readers. snippet!

webcrawler.rb

Web Jazz

Friday, December 07, 2007

A simple distributed crawler in Ruby

No comments:

Post a Comment

About Me

Karma

Github repos

Twitter

Blog Archive