Friday, September 12, 2008

Algorithm always matters, even when parallel

I've finally had some time to start looking at things that looked interesting to me. One of the things that I started looking at was parallel algorithms.

I was reading Hacker News, and I saw some link that lead me to discover Cilk, a C extension to easily make parallel programs.
thread int fib(int n) {
if (n < 2)
return n;
else {
cont int x, y;
x = spawn fib (n-2);
y = spawn fib (n-1);
sync;
return x+y;
}
}

This was the example given for calculating the Fibonacci sequence in parallel. This is the standard mathematical way to define it, and it looks clean enough. So instead of trying it out in Cilk, I fired up Erlang to try my hand at doing a port. I found it a little bit difficult because while you can easily spawn processes in Erlang, there was no quick way for the parent process to wait/sync/join child processes and get their results. Since that was besides the point of the exercise, I fired up Ruby, even though they had a slow Threading library (which is suppose to be fixed in 1.9, plus Fibers!) I'll do with it Erlang some other time.

First I wrote the threaded version to mimic the Cilk code:
def fib_threaded(n)
if n < 2
return 1
else
threads = []
x = 0
y = 0
threads << Thread.new(n - 2) { |i| x = fib_threaded(i) }
threads << Thread.new(n - 1) { |i| y = fib_threaded(i) }
threads.each { |thread| thread.join }
return x + y
end
end

I don't have a multi-core processor. I'm on a 3 year old 1GHz laptop. At a mere fib(18), it was taking about 21 seconds to run. To see if there was a difference, I wrote a serial version.

def fib_serial(n)
n < 2 ? 1 : fib_serial(n - 1) + fib_serial(n - 2)
end

This one ran much much faster. It took about 0.02594 seconds. At this point, it's probably the overhead of thread creation that's making it take so long to run. Maybe with green threads or lightweight threads, the threaded version would run much faster. That makes me want to try it in Erlang just to compare. But wtf, adding shouldn't take that long, even if it is 0.025 seconds

When I thought about it, it was an efficient algorithm: there's a lot of wasted cycles. In order to compute f(n), you have to calculate f(n - 1) and f(n - 2) in separate threads.
  • The f(n - 1) thread requires it to spawn two more threads to compute f(n - 2) and f(n - 3).
  • The f(n - 2) thread requires it to spawn two more threads to compute f(n - 3) and f(n - 4).
Notice that both the threads for f(n - 1) and f(n - 2) have to spawn two different threads to calculate f(n - 3). And since this algorithm has no way for threads to share their results, they have to recompute values all over again. The higher the n, the worse the problem is, exponentially.

To calculate the speedup given to an algorithm by adding more processors, you calculate the amount of total work required and divide it by the span of the parallel graph. If that didn't make sense, read lecture one for Cilk, which is where the diagram comes from. So for fib(n)

Twork = O(n^2)


The total amount of work is the total number of processes spawned. Since every f(n) recursively spawns two other processes, it's about n^2 processes.

Tspan = O(ln n)

The total span is how many nodes a particular calculation traverses. A la the diagram, it's about the height of the tree, so that's about ln n nodes.

Therefore, for fib(n), the processor speed up is at most:

Tw / Ts = O(n^2) / O(ln n)

I don't know of any reductions for that off the top of my head, but you can see that the processor speedup gain grows somewhere in between n and n^2. On one hand, it means this algorithm can benefit from speedups by adding up to somewhere between n and n^2 processors. However, that also means that to make this algorithm go as fast as it can to process fib(1000), you need more than 1000 processors to make it worthwhile. Not so good for a parallel program that's just doing addition.

As a last version, I wrote one that computed the Fibonacci from 0 up to n, and keeping the total as I go along, instead of the recursive version that has to work its way n back down to zero.

def fib_loop(n)
sequence = [1, 1]
if n < 2
sequence[n]
else
sequence = (1..n-1).inject(sequence) do |t, e|
t << (t[-2] + t[-1])
end
end
sequence[-1]
end

It's not space effective since I wrote it quickly, but this beat the pants off the other two running at 0.00014 seconds. As you can see, you're not recalculating any f(n) more times than you need to.

I wish Cilk had a better first example to parallel programs. Given that the guy making Cilk is the same guy that co-wrote the famous mobile book for algorithms, I guess I was surprised. However, it was a fun exercise, and made me think about algorithms again.

I'll find some other little project that requires me to write in Erlang, rather than falling back on the comfortable Ruby. snippet! Below if you want to run it yourself.

2 comments:

  1. Anonymous12:10 AM

    The reason the recursive fib is used as an example is because it's easy to read, easy to write and recurses a whole lot. If you can make recursive fib efficient in a parallel system, you can make pretty much any tree-recursive function efficient. The actual result matters very little, it's all about measuring the overhead.

    So making an optimized non-recursive version or an optimized non-threading version is completely beside the point, because what you're measuring with fib is the speed of recursion and threading.

    ReplyDelete
  2. What you've arrived at is one of the canonical examples of dynamic programming.

    As "Anonymous" said, there are reasons they use recursive Fibonacci in places as tests, and none of it has to do with it being the algorithm with least time complexity, since (as you pointed out), it's not. =)

    ReplyDelete