Friday, October 24, 2008

Frogmetrics on techcrunch

As some of you know, I joined a YCombinator company called Frogmetrics. We've launched, and was covered by Techcrunch last night.

I work on the server side of things, which means the analytics package which we call Mastermind. Working on the architecture over the summer was a little bit different than the typical rails app.

Mastermind allows users to view graphs of survey data for a particular region, location, or employee at a specific time period.

One of the comments that I saw on techcrunch was why didn't we use iPhones? That's a common question, and the answer is a mix of technical and business reasons. If we had iPhones to give to people, they would readily know that it was an iPhone and be more likely to run off with one. Sorry Nokia, but your tablets just aren't as sexy to run off with. Alternatively, telling people to go to a webpage on their iPhones would be another barrier to taking the survey. It wouldn't get the high response rates that we're seeing right now. The idea is to make it as easy and as painless as possible to take a survey. In our pilots, we found that older folks had a much tougher time reading the text. Nokias have bigger screens, and we could hence put bigger text on there, not to mention big beautiful buttons that affords pushing.

Well, if you or your company has interest in our product, feel free to let us know on the contact page. I dislike that we don't have our prices on the web page as of yet, but that'll be worked out soon.

Wish us luck (and hard work)!

As a side note, interesting things that I post on here don't always have to do with Frogmetrics. Like any good engineer, I play or think about interesting things whether they're related to work or not. So don't take it that whatever I'm playing with here is what's going into Frogmetrics. :)

Edited

Thursday, October 02, 2008

Benchmarking row multiply in Ruby's Linalg

While using LinAlg to write and understand/review information gain, I found that I wanted the equivalent of a row multiply, notated as [*] (it's non-standard. I made up the notation), where:
A [*] x = B

[[a00, ..., a0M] [[x0] [[a00 * x0, ..., a0M * x0]
[... ...] [*] [x1] = [... * x1, .... ...]
[aN0, ..., aNM]] [x2]] [aN0 * x2, ..., aNM * x2]]

I want to map each row with each corresponding value of x. This operation is not standard matrix multiplication, though I feel like it should be!

I admit I slept in Linear Algebra class. Thus I was a bit dumbfounded about how to express it using normal matrix multiplication. While I could do it using LinAlg's mapping functions, I knew that it could be done because it was all linear transformations.

Luckily James Lawrence, who maintains LinAlg was emailing me, and asked me about it (thanks!). He gave me some code that did it both ways. After I read it, I slapped myself.

# using straight up matrix multiplication
def calculate1(con)
DMatrix.diagonal(con*DMatrix.new(con.hsize,1,1)).inverse*con
end

# using map
def calculate2(con)
DMatrix.diagonal((con*DMatrix.new(con.hsize,1,1)).map { |e| 1.0/e })*con
end

I wasn't sure which one was faster. Maybe they'd be about the same. Maybe doing matrix multiply would be slower because you'd have to go through all the multiplications. Per James's suggestion, I benchmarked them, in the fine form of yak shaving.

Run on a laptop 1.1GHz Pentium M, 758MBs, I did two experiments. For one experiment, I kept the matrix at 800 rows, then I grew the columns. Let's see what it looks like:

time (secs) vs column size

I should have labeled the axis in GnuPlot, but you'll live. The y axis is the number of seconds, and the x axis is the column size. Uninteresting, but nice to know it's linear. The two functions aren't significantly different from each other in the amount of time that it takes. I expected the map (calculate2) to be much faster since it didn't have to do all the multiplies. oh well.

I almost didn't run the second test, but it proved to be a bit more interesting. This time I kept 800 columns and grew the number of rows. Same axis. Different graph:

time (secs) vs row size

Whoa! It's exponential. Or quadratic. I can't tell. Anyway, anything curving up is bad news. I suspect this might have something to do with row major/column major ordering. C stores matrices row by row, whereas Fortran stores it column by column.. Update: As corrected by James L., in the first experiment, growing columns will create more multiplies inside the diag() call, but the size of the diagonal will stay the same. Growing the rows, however, will create less multiplies inside the diag() call, but each increase in row size will increase both dimensions of the resulting diagonal matrix, giving us n^2. So it's quadratic.

So what I went looking for wasn't what I meant to find. But given that I've read about it before, it wasn't super interestingit would have made sense had I thought about it, I guess it's not super surprising. Let's see if we can use transpose to cut down on the time. For one part, we'll grow the rows as before, and compare it to growing rows, but transposing the input then transposing the output, to get the same deal. What's it look like:

time(secs) vs row size

This is good news. Even if transpose is an extra couple of manipulations, it saves us computation for bigger matrix sizes. The most interesting part of the graph is the crossing of the two graphs. If somehow, LinAlg (or any other package for that matter) can detect where that inflection point is going to be, it can switch off between the two. The only thing I can think of is another package lying underneath doing sampling of each implementation randomly whenever a user calls the function to do interpolation of its growth curve, and then calculate the crossing analytically. I don't currently know of any package that does this (or if it does, I don't know about it, cuz it already performs so well by doing the switch!)

This was a nice little diversion from my side projects...a side project of side projects. Back to learning about information gain and its ilk. At least something came out of it. I have a nice little experiment module that I can use to do other experiments. And I spent way too much time on it not to post something...

I share my code below, and you can run it if you want. snippet!

# These are the experiments that I ran
#!/usr/bin/ruby

require 'experiment'

require 'linalg'
include Linalg

def calculate1(con)
DMatrix.diagonal(con*DMatrix.new(con.hsize,1,1)).inverse*con
end

def calculate2(con)
DMatrix.diagonal((con*DMatrix.new(con.hsize,1,1)).map { |e| 1.0/e })*con
end

DATA_PTS = 10
MAX = 800
MID = MAX / 2
STEP = MAX / DATA_PTS

# Growing columns
Experiment::compare(1..MAX, STEP,
proc { |col_size| DMatrix.rand(MID, col_size) },
proc { |matrix, col_size|
calculate1(matrix)
},
proc { |matrix, col_size|
calculate2(matrix)
})

# Growing rows
Experiment::compare(1..MAX, STEP,
proc { |row_size| DMatrix.rand(row_size, MID) },
proc { |matrix, row_size|
calculate1(matrix)
},
proc { |matrix, row_size|
calculate2(matrix)
})

# Growing rows, transposing
Experiment::compare(1..MAX, STEP,
proc { |row_size| DMatrix.rand(row_size, MID) },
proc { |matrix, col_size|
calculate1(matrix)
},
proc { |matrix, col_size|
calculate1(matrix.transpose).transpose
})


# Experiment.rb - perform timed experiments across on one dimension
require 'rubygems'
require 'gnuplot'

# Experiment is a module where you can use to plot and benchmark pieces of code
# and plot it on a gnuplot
#
#
module Experiment
class << self

def timer
start_time = Time.now
yield
diff_time = Time.now - start_time
puts "That run took #{diff_time} seconds"
return diff_time
end

# takes range to try and the step resolution of the range and runs
# the benchmark_block with each of the different ranges.
# the init_proc only gets run once during setup.
#
# Experiment::benchmark((1..50), 10,
# proc { |col_size| DMatrix.rand(500, col_size)},
# proc { |matrix, col_size| ProbSpace.new(matrix).entropy })
def benchmark(range, resolution, init_proc, benchmark_block)
xs, ts = [], []
range.step(resolution) do |x_size|
object = init_proc.call(x_size)
xs << x_size
ts << timer { benchmark_block.call(object, x_size) }
end
plot(xs, ts)
end

# same idea as benchmark, but does two at the same time. So
# it takes an init_proc to initialize the experiment,
# and two different procs, standard and comparison, to run
# for each step in the range at the given resolution.
#
# Experiment::benchmark((1..50), 10,
# proc { |col_size| DMatrix.rand(500, col_size)},
# proc { |matrix, col_size| ProbSpace.new(matrix).entropy1 },
# proc { |matrix, col_size| ProbSpace.new(matrix).entropy2 })
def compare(range, resolution, init_proc,
standard_block, comparison_block)
xs, s_ts, c_ts = [], [], []
range.step(resolution) do |x_size|
object = init_proc.call(x_size)
xs << x_size
s_ts << timer { standard_block.call(object, x_size) }
c_ts << timer { comparison_block.call(object, x_size) }
puts "#{x_size} = standard : comparison :: #{s_ts.last} : #{c_ts.last} secs"
end
plot(xs, s_ts, c_ts)
end

def plot(x, *ys)
Gnuplot.open do |gp|
Gnuplot::Plot.new(gp) do |plot|
ys.each do |y|
plot.data << Gnuplot::DataSet.new([x, y]) do |ds|
ds.with = "linespoints"
ds.notitle
end
end
end
end
end

end
end