Friday, December 07, 2007

A simple distributed crawler in Ruby

A week ago, I took a break from Mobtropolis, and...of all things ended up writing a simple distributed crawler in Ruby. I hesitated posting it at first, since crawlers are conceptually pretty simple. But eh, what the heck.

This was just an exercise to do some DRb and Hpricot, so don't use this for your production work, whatever it may be. An actual crawler is far more robust than what I wrote. And don't keep it running hammering at stuff, since it'll get you banned.

First, this is how you use it:

WebCrawler.start("http://en.wikipedia.org/") do |doc|
puts "#{doc.search("title").inner_html}"
end


And that's it. It returns documents in an XPath traversable form, courtesy of Hpricot.

A web crawler is a program that simply downloads pages, takes notes of what links there are on that page, and puts those links on its queue of links to crawl. Then it takes the next link off its queue and downloads that page and does the same thing. Rise and Repeat.

First, we create a class method named start that creates an instance of a webcrawler and then starts it. Of course, we could have done without this helper method, but it makes it easier to call.


module Crawler
class WebCrawler
class << self
def start(url)
crawler = WebCrawler.new
crawler.start(url) do |doc|
yield doc
end
return crawler
end
end
end


So next, we define the initialization method.


module Crawler
class WebCrawler
def initialize
puts "Starting WebCrawler..."
begin
DRb.start_service "druby://localhost:9999"
puts "Initializing first crawler"
puts "Starting RingServer..."
Rinda::RingServer.new(Rinda::TupleSpace.new)

puts "Starting URL work queue"
@work_provider = Rinda::RingProvider.new(:urls_to_crawl, Rinda::TupleSpace.new, "Queue of URLs to crawl")
@work_provider.provide

puts "Starting URL visited tuple"
@visited_provider = Rinda::RingProvider.new(:urls_status, Hash.new, "Tuplespace of URLs visited")
@visited_provider.provide
rescue Errno::EADDRINUSE => e
puts "Initialize other crawlers"
DRb.start_service
end
puts "Looking for RingServer..."
@ring_server = Rinda::RingFinger.primary

@urls_to_crawl = @ring_server.read([:name, :urls_to_crawl, nil, nil])[2]
@urls_status = @ring_server.read([:name, :urls_status, nil, nil])[2]
@delay = 1
end
end
end

This bears a little explaining. The first webcrawler you start will create a DRb server if it doesn't already exist and do the setup. Then, every subsequent webcrawler it'll connect to the server and start picking URLs off the work queue.

So when you start a DRb server, you call start_server with a URI, then you start a RingServer. What a RingServer provides is a way from subsequent clients to find services provided by the server or other clients.

Next, we register a URL work queue and a URLs visited hash as services. The URL work queue is a TupleSpace. If you haven't heard of TupleSpace, the easiest way to think of it is as like a bulletin board. Clients post items on there, and other clients can take them out. This is what we'll use as a work queue of URLs to crawl.

The URLs visited is a Hash so we can check which URLs we've already visited. Ideally, we'd use the URL work queue, but DRb seems to only provide blocking calls for reading/taking from the TupleSpace. That doesn't make sense, but I couldn't find a call that day. Lemme know if I'm wrong.


module Crawler
class WebCrawler
def start(start_url)
@urls_to_crawl.write([:url, URI(start_url)])
crawl do |doc|
yield doc
end
end

private

def crawl
loop do
url = @urls_to_crawl.take([:url, nil])[1]
@urls_status[url.to_s] = true

doc = download_resource(url) do |file|
Hpricot(file)
end or next
yield doc

time_begin = Time.now
add_new_urls(extract_urls(doc, url))
puts "Elapsed: #{Time.now - time_begin}"
end
end
end
end


Here is the guts of the crawler. It loops forever taking a url off the work queue using take(). It looks for a pattern in the TupleSpace, and finds the first one that matches. Then, we mark it as 'visited' in @urls_status. Then, we download the resource at the url and use Hpricot to parse it into a document then yield it. If we can't download it for whatever reason, then we grab the next URL. Lastly, extract all the urls in the document and add it to the work queue TupleSpace. Then we do it again.

The private methods download_resource(), extract_urls(), and add_new_urls() are merely details, and I won't go over it. But if you want to check it out, you can download the entire file. There are weaknesses to it that I haven't solved, of course. If the first client goes down, everyone goes down. Also, there's no central place to put the processing done by the clients. But like lazy textbook writers, I'll say I'll leave that as an exercise for the readers. snippet!


webcrawler.rb

No comments:

Post a Comment