Sunday, July 29, 2007

Erlang and Neural Networks article on TrapExit.org

Thanks for the reader that submitted my article to Trapexit.org, the guy that runs it asked me to post the full article on there. It ended up being about 7th on programming.reddit.com for about a day or two. So far, I haven't had any complaints on it, so either people haven't read it, or it's been pretty accurate. :)

I didn't think writing that was going to be such a big job when I started, but I suppose I took the position that anyone reading it had minimal mathematical and Erlang background. Therefore, there was a lot of explain. I can appreciate why good textbooks are hard to come by now. :P

But I did learn a lot about Erlang and functional style programming. In addition, Neural networks aren't really mystifyingly magical, like I think many people think of them. I use to think you can just solve anything with them. And while they solve a particular class of problems quite well, they're essentially just a high dimensional gradient descent.

I'll probably work on the neural network code later on--as I haven't written a trainer for it. And I'll probably try to do a particle swarm optimization article in Erlang some other time. In the meantime, I have other things to experiment with and work on. You'll hear about it here first!

Saturday, July 28, 2007

Self-referential many-to-many join with extra data and non-standard foreign keys

Here's another gotcha that I ran into that's kind of an edge case in Rails. So I'm documenting it, so that the rest of you can go about your business of building great apps, instead of scratching your heads. So what was I doing? I wanted a self-referential join table with rich associations for an "acts_as_friendable" plugin I was writing.

I made table in database as follows:
  create_table "friendships", :force => true do |t|
t.column "owner_id", :integer, :null => false
t.column "friend_id", :integer, :null => false
end

In the model, put:
  class Friendship < ActiveRecord::Base
belongs_to :owner, :class_name => "User", :foreign_key => :owner_id
belongs_to :friend, :class_name => "User", :foreign_key => :friend_id
end

Then in the table that you want to act as friendable, you put:
  class User < ActiveRecord::Base

acts_as_friendable

end

And that's it. you can now access friends() and friendships() of a person.

However, it doesn't work when you try to push the association through, like:
user = User.find(1)
friend = User.find(2)
user.friends << friend

It will fail complain about how "user_id" doesn't exist in any of the tables. I was scratching my head for a good 2 hours before I figured that it wasn't actually me this time. I think the << method doesn't correctly use the foreign keys correctly when they're non-standard like this. According to #6466 ([PATCH] fix for has_many :through push and delete with legacy/non-standard foreign_keys), this doesn't work, until a patch is written for it. I'm not as familiar with the Rails code base as I should be. This is probably a good chance to get started looking at it...unless someone else fixes it first...

As a workaround, I simply used the Friendship ActiveRecord object directly to create the association. I'll have to override the method in the plugin so that it allows correct use of << for my specific case. Tip!

Saturday, July 21, 2007

Erlang and Neural Networks Part V

Here's part V of neural networks.

Last time, we learned how the neural network learns. The most important part was the very last section, where we figured out what each perceptron needs to do to adjust its weights to learn.

(5) wji = ɳ * [∑k=1 to c wkjδk] * f'(netj) * xi

To recap:
  • Eta - The learning rate
  • Sensitivity from the next layer
  • Inputs from the previous layer




In order to achieve this, we'll have to refactor some of our code. Let's do so.

Vector map


map() is a function that applies the same operation to all elements in an array. I needed a function that is a map for two different lists of the same length. Vector and matrix addition relies on this:
[1,2,3] + [4,5,6] = [5,10,18]

I couldn't find one in the erlang APIs, so I wrote one myself. Ends up this code is used when adjusting weights.
% like map, but with two lists instead.
vector_map(Func, [], []) ->
[];
vector_map(Func, [Xh | Xt], [Yh | Yt]) ->
[Func(Xh, Yh) | vector_map(Func, Xt, Yt)].

This looks very much like the code for the previous dot_prod(). Let's keep things DRY and make dot_prot() use vector_map.
% Calculates the dot product of two lists
dot_prod(X, Y) ->
lists:foldl(fun(E, Sum) -> E + Sum end, 0,
vector_map(fun(Ex, Ey) -> Ex * Ey end, X, Y)).

So vector_map() takes the the nth element and multiplies them together. foldl() is very much like reduce() in MapReduce or inject() in ruby. It takes all the elements and sums it up.

f'(netj)


We need to calculate the feed forward, but with the derivative of the sigmoid function.

Recall that our feed_forward function looks like this:

sigmoid(X) ->
1 / (1 + math:exp(-X)).

feed_forward(Weights, Inputs) ->
sigmoid(dot_prod(Weights, Inputs)).

According to our calculations in the last part, we also need the derivative of the sigmoid function, but with the same stuff passed into it. So we could write another function like this:

sigmoid_deriv(X) ->
math:exp(-X) / (1 + math:exp(-2 * X)).

feed_forward_deriv(Weights, Inputs) ->
sigmoid_deriv(dot_prod(Weights, Inputs)).

This is a waste, since the two versions of feed forward are essentially doing the same thing. Here is where we can use first-class functions to get around it.

Languages that support first class functions are powerful because you can pass functions in and out of other functions as arguments. This is nice because you don't need a lot of structure to get things done--no messing around with anonymous inner classes and function pointers.

Instead, we'll write a new Sigmoid functions in perceptron and the feed_forward function outside of it.

perceptron(Weights, Inputs, [other variables]) ->
Sigmoid = fun(X) ->
1 / (1 + math:exp(-X))
end,

Sigmoid_deriv = fun(X) ->
math:exp(-X) / (1 + math:exp(-2 * X))
end,
receive
% perceptron messages...
end.

feed_forward(Func, Weights, Inputs) ->
Func(dot_prod(Weights, Inputs)).

So now if we want to feed forward with a particular non-linear activation function, we just tack it on as an argument!
feed_forward(Sigmoid, [1, 2], [2, 3]).        % f(net)
feed_forward(Sigmoid_deriv, [1, 2], [2, 3]). % f'(net)

Some helper functions to refactor


Remember convert_to_list?
convert_to_list(Inputs) ->
lists:map(fun(Tup) ->
{_, Val} = Tup,
Val
end,
Inputs).


Let's change it to convert_to_value and add a convert_to_key function:

convert_to_values(Tuple_list) ->
lists:map(fun(Tup) ->
{_, Val} = Tup,
Val
end,
Tuple_list).

convert_to_keys(Tuple_list) ->
lists:map(fun(Tup) ->
{Key, _} = Tup,
Key
end,
Tuple_list).


Taking note of sensitivities


We also need to adjust our perceptron to keep the sensitivities from the next layer, in the same way we keep the outputs from the previous layer. That means that our perceptron will no longer keep a list of just Output_PIDs, but a list of tuples [{OutputPID_1, Sensitivity_1}, ... {OutputPID_n, Sensitivity_n}], called Sensitivities.

perceptron(Weights, Inputs, Sensitivities) ->
receive
{stimulate, Input} ->
New_inputs = replace_input(Inputs, Input),
Output_value = feed_forward(Sigmoid, Weights, convert_to_values(New_inputs)),
if Sensitivities =/= [] ->
% My output's connected to at least one perceptron:
lists:foreach(fun(Output_PID) ->
Output_PID ! {stimulate, {self(), Output_value}}
end,
convert_to_keys(Sensitivities));
Sensitivities =:= [] ->
% My output's connected to no one:
io:format("~w outputs: ~w~n", [self(), Output_value]),
% Call a trainer here instead and
self() ! {learn, {self(), 1}}
end,
perceptron(Weights, New_inputs, Sensitivities);

% other messages....
end.

We'll need to do this for all occurrences of Output_PID in other messages, such as pass, connect_to_output, and connect_to_input. When we need a list of Output_PIDs, we simply call convert_to_keys(Sensitivities).

The actual learning


Now that we have all that other stuff out of the way, we can actually write the learning message of a perceptron. The steps will be:
  1. Update the list of sensitivities from the previous layer
  2. Calculate the sensitivity for this node
  3. Adjust all the weights based on the sensitivity
  4. Propagate sensitivities and associated weight back to the previous layer


perceptron(Weights, Inputs, Sensitivities) ->
receive
{learn, Backprop} ->
Learning_rate = 0.5,

% Calculate the correct sensitivities
New_sensitivities = add_sensitivity(Sensitivities, Backprop),
Output_value = feed_forward(Sigmoid, Weights, convert_to_values(Inputs)),
Derv_value = feed_forward(Sigmoid_deriv, Weights, convert_to_values(Inputs)),
Sensitivity = calculate_sensitivity(Backprop, Inputs, New_sensitivities,
Output_value, Derv_value),
io:format("(~w) New Sensitivities: ~w~n", [self(), New_sensitivities]),
io:format("(~w) Calculated Sensitivity: ~w~n", [self(), Sensitivity]),

% Adjust all the weights
Weight_adjustments = lists:map(fun(Input) ->
Learning_rate * Sensitivity * Input
end,
convert_to_values(Inputs)),
New_weights = vector_map(fun(W, D) -> W + D end, Weights, Weight_adjustments),
io:format("(~w) Adjusted Weights: ~w~n", [self(), Weights]),

% propagate sensitivities and associated weights back to the previous layer
vector_map(fun(Weight, Input_PID) ->
Input_PID ! {learn, {self(), Sensitivity * Weight}}
end,
New_weights,
convert_to_keys(Inputs)),

perceptron(New_weights, Inputs, New_sensitivities);

% The other messages...
end.

Notice there's a couple helper functions in there to help with the sensitivities.
% adds the propagating sensitivity to the Sensitivities Hash
add_sensitivity(Sensitivities, Backprop) when Sensitivities =/= [] ->
replace_input(Sensitivities, Backprop);
add_sensitivity(Sensitivities, Backprop) when Sensitivities =:= [] ->
[].

% Calculates the sensitivity of this particular node
calculate_sensitivity(Backprop, Inputs, Sensitivities, Output_value, Derv_value)
when Sensitivities =/= [], Inputs =:= [] -> % When the node is an input node:
null;
calculate_sensitivity(Backprop, Inputs, Sensitivities, Output_value, Derv_value)
when Sensitivities =:= [], Inputs =/= [] -> % When the node is an output node:
{_, Training_value} = Backprop,
(Training_value - Output_value) * Derv_value;
calculate_sensitivity(Backprop, Inputs, Sensitivities, Output_value, Derv_value)
when Sensitivities =/= [], Inputs =/= [] -> % When the node is a hidden node:
Derv_value * lists:foldl(fun(E, T) -> E + T end, 0, convert_to_values(Sensitivities)).

Note that there are guards (the conditions that start with "when") on each of these functions. They are the different cases for how to calculate the sensitivity, given whether the node was an input node, an output node, or a hidden node.

Test Run


So as a little test, we can exercise it by writing a function called run(), and running it from the command line:
run() ->
X1_pid = spawn(ann, perceptron, [[],[],[]]),
X2_pid = spawn(ann, perceptron, [[],[],[]]),
H1_pid = spawn(ann, perceptron, [[],[],[]]),
H2_pid = spawn(ann, perceptron, [[],[],[]]),

O_pid = spawn(ann, perceptron, [[],[],[]]),

% Connect input node X1 to hidden nodes H1 and H2
ann:connect(X1_pid, H1_pid),
ann:connect(X1_pid, H2_pid),

% Connect input node X2 to hidden nodes H1 and H2
ann:connect(X2_pid, H1_pid),
ann:connect(X2_pid, H2_pid),

% Connect input node H1 and H2 to output node O
ann:connect(H1_pid, O_pid),
ann:connect(H2_pid, O_pid),

X1_pid ! {status},
X2_pid ! {status},
H1_pid ! {status},
H2_pid ! {status},
O_pid ! {status},

X1_pid ! {pass, 1.8},
X2_pid ! {pass, 1.3}.


233> ann:run().
Output of <0.841.0> connected to Input of <0.843.0>
Output of <0.841.0> connected to Input of <0.844.0>
Output of <0.842.0> connected to Input of <0.843.0>
Output of <0.842.0> connected to Input of <0.844.0>
Output of <0.843.0> connected to Input of <0.845.0>
Output of <0.844.0> connected to Input of <0.845.0>
Status of Node(<0.841.0>)
W: []
I: []
S: [{<0.844.0>,0},{<0.843.0>,0}]
Status of Node(<0.842.0>)
W: []
I: []
S: [{<0.844.0>,0},{<0.843.0>,0}]
Status of Node(<0.843.0>)
W: [0.891633,-0.112831]
I: [{<0.842.0>,0.446080},{<0.841.0>,-0.815398}]
S: [{<0.845.0>,0}]
Status of Node(<0.844.0>)
W: [0.891633,-0.112831]
I: [{<0.842.0>,0.446080},{<0.841.0>,-0.815398}]
S: [{<0.845.0>,0}]
Status of Node(<0.845.0>)
W: [0.891633,-0.112831]
I: [{<0.844.0>,0.446080},{<0.843.0>,-0.815398}]
S: []
{pass,1.30000}Stimulating <0.844.0> with 0.200000
Stimulating <0.844.0> with 0.300000

Stimulating <0.843.0> with 0.200000
Stimulating <0.843.0> with 0.300000
<0.845.0> outputs: 0.650328
Stimulating <0.844.0> with 1.80000
Stimulating <0.844.0> with 1.30000
<0.845.0> outputs: 0.643857
Stimulating <0.843.0> with 1.80000
Stimulating <0.843.0> with 1.30000
<0.845.0> outputs: 0.606653
<0.845.0> outputs: 0.607508
(<0.845.0>) New Sensitivities: []
(<0.845.0>) Calculated Sensitivity: 0.178902
(<0.845.0>) Adjusted Weights: [0.891633,-0.112831]
<0.845.0> outputs: 0.610857
(<0.844.0>) New Sensitivities: [{<0.845.0>,0.168491}]
(<0.843.0>) New Sensitivities: [{<0.845.0>,-1.12092e-2}]
<0.845.0> outputs: 0.655916
(<0.844.0>) Calculated Sensitivity: 5.64317e-2
(<0.843.0>) Calculated Sensitivity: -3.75421e-3
(<0.845.0>) New Sensitivities: []
(<0.844.0>) Adjusted Weights: [0.891633,-0.112831]
(<0.843.0>) Adjusted Weights: [0.891633,-0.112831]
(<0.845.0>) Calculated Sensitivity: 0.141549
(<0.842.0>) New Sensitivities: [{<0.844.0>,5.23863e-2},{<0.843.0>,0}]
(<0.841.0>) New Sensitivities: [{<0.844.0>,-3.50115e-3},{<0.843.0>,0}]
(<0.845.0>) Adjusted Weights: [0.941808,-6.26553e-2]
(<0.842.0>) Calculated Sensitivity: null
(<0.841.0>) Calculated Sensitivity: null
<0.845.0> outputs: 0.669378
(<0.844.0>) New Sensitivities: [{<0.845.0>,0.140548}]
(<0.843.0>) New Sensitivities: [{<0.845.0>,-3.24941e-3}]
(<0.842.0>) Adjusted Weights: []
(<0.841.0>) Adjusted Weights: []
<0.845.0> outputs: 0.668329
...(clipped)...

This is an asynchronous implementation of a feed-forward network, so there will be a lot of messages flying around back and forth. It is outputting the results of the network at the same time it's adjusting its weights. One will see a finer-grain adjustment of the outputs, and will have to wait until all stimulations have passed through the system to grab the output.

One can cut down on these messages if you implement something that keeps track of which inputs or sensitivities have been updated. Wait until all inputs have been updated to stimulate the next layer, and wait until all sensitivities have been updated to propagate back to the next layer.

The trainer


So now the neural network knows how to learn! All the hard part is done. However, there is one last part that I will leave as an exercise to the reader--that is to write a trainer for the network. Right now, in the stimulate message, it simply calls the learn method directly when an output node prints its output. It always uses "1" as the default training value. This is just a place holder. The learn method should be inside a trainer.

A trainer takes a list of examples, and feeds them to the network, one at a time, and for each valid output(remember the asynchronous network will output a series of values for any set of inputs), find the difference with the training value and send it to the learn message.

In Conclusion


I went over most of the hard part, so the last mile should be relatively easy. It's been a long trip to write these articles--longer than I had ever imagined when I started. So it is with all things. Hopefully, you have now a better understanding of neural networks, and realize that they're really not that magical. They're just a high-dimensional gradient descent. Let me know if you've found the series helpful. Maybe I'll try writing another set later on. Until then, I have other cool things to code. I'll let you know when I get it up and running.

Monday, July 16, 2007

Email attachments not downloading on Textdrive servers

This was one of those things that had me reeling in pain a month ago. I couldn't figure out what exactly was wrong...and I've run into this problem twice! The symptom was ActionMailer, which in turn is using Rails TMail extensions, didn't seem to be downloading attachment correctly on the server. It worked correctly on my development machine, but when I run it on the production server, it just didn't want to work. All attachments would be two bytes. I had suspected a host of other things, but I'll spare you the stupidity. It's a long chain from mail attachment to webpage, and it took a lot of work to narrow it down to Rail's TMail extensions.

It ends up that this is only an issue if base64.so is on the user's system. According to this two year old post on Textdrive, ActionMailer's TMail will choose between two implementations of base64, a C one or a Ruby one, based on the existence of the library base64.so

The problem is, the ruby version assumes that a string will return, and the C version assumes an array will return. And because of that, calling first() on a string only gives the first byte or two, as opposed to calling first on an array, which gives the first element (a string)

I hotfixed it, because I couldn't get a plugin to work and override the classes. I didn't spend too much time with it, so if anyone else wants to show me how to do the plugin for this simple fix, I'd be happy to learn.

In base64.rb in the rails directory (I froze my rails in my vendors directory) I changed the code to:
module TMail

module Base64
def rb_decode( str, strict = false )
str.unpack('m').join
end
end

end


And subsequently, in unquoter.rb
module Unquoter

class << self
def unquote_base64_and_convert_to(text, to, from)
convert_to(Base64.decode(text), to, from)
end
end

end


That should fix your problems for now. I was using rails 1.2.0 and it still hasn't been fixed. I know someone else submitted a patch for it already (Rails bug #7861). Tip!

Thursday, July 12, 2007

Nerd time, issue 7

Hi all,

There's more interesting things lately than usual. Here's another set of stuff to discover. I've put the more easily digestible stuff up top. This time, it's a lot geekier. The lower you go, the nerdier it gets.

Humanized Endo--CLI + GUI
The Humanized interface combines GUI and CLI, which really reminds me of emac's interface...but tons prettier. You can eval text as commands on the page in emacs, as well as executing commands. They basically want to do away with the desktop metaphor. The second link is a presentation. The guy presenting, Asa Raskin, is Jef Raskin's son...Jef is the guy that started the Macintosh before Steve Jobs took over.
http://www.humanized.com/
http://www.humanized.com/weblog/2007/05/18/die_desktop_die/

Multi-touch interfaces
More demo eye candy from Jeff Han.
http://www.thelastminuteblog.com/2007/03/19/new-jeff-han-video-multi-touch-ui/

How google earth works.
It explains some of the MIP-mapping techniques that GE uses, to get good filtering characteristics on its texture maps, so that things look crisp and clear, even at sharp angles.
http://www.realityprime.com/articles/how-google-earth-really-works

Distributed version control.
The second link is a talk given by Linus talking up distributed version control and his own version of it called Git. He also spends time ragging on SVN and how much it sucks. Ian's the only other person I know that's been using distributed version control with darcs. I've tried it, and it's not too bad. It took some time to understand some of the implications of DVC.
http://ianclatworthy.wordpress.com/2007/06/21/version-control-the-future-is-adaptive/
www.youtube.com/watch?v=4XpnKHJAok8

Haskell Faster than C on Great Language shootout benchmark.
I didn't read into detail, but what I gleamed is that lazy evaluation has its advantages. Read into it what you will. Overall, if haskell has to do work, it is slower than C. But if it can 'cheat', it will be faster in some cases.
http://neilmitchell.blogspot.com/2007/07/making-haskell-faster-than-c.html
http://www.haskell.org//pipermail/haskell/2006-June/018127.html

More on functional style programming.
I've been using more functional style programming lately. I like being able to chain things together, though sometimes, it doesn't make it necessarily easier to read. That's still dependent on the coder. Functional style programming has its advantages, but it's not made obvious here.
http://gensym.org/2007/4/7/enumerate-map-filter-accumulate

Lock-free hash tables.
This is kinda neat, actually. It's a talk on a concurrent hash table algorithm, where it doesn't use any locks (but it does use fencing during table resizes), and scales to 4000 processors. What I found neat is that the table resizing can be stacked, so that if you have 700 threads writing to a hash table all at once, it'll exponentially resize as it's reading and writing, where the reading and writing threads do some of the work copying table entries from the old table to the new table during the resize. More than one resize can be happening at the same time too.
http://video.google.com/videoplay?docid=2139967204534450862

Monday, July 09, 2007

Erlang and Neural Networks Part IV

Ahh, Part IV. It's been long overdue, mostly because I've been changing directions with my startup. I decided to drop everything I was doing, since it wasn't working, and head in another direction with the startup. And lately, I've been messing around with mobile platforms as well as zoomable interfaces. I'll talk more about that another time! But you came for neural networks. Last time in part III, we were able to connect the perceptrons to each other. This time, we're going to look at how you'd actually learn.

The ways of learning

There are many types of neural networks. But this one that we're building is a classic feed-forward neural network. A feed-forward neural network is a linear classifier, and the way it learns is to adjust the hyperplane that separates different classes in multi-dimensional space to minimize classification error, according to what it has seen before. The way that one would adjust the hyperplane is to change the value of the weights in the neural network. But how much to adjust it?

The classic way is to use back propagation, which we'll explore here. People since then have used other methods to calculate the weights, such as genetic algorithms and particle swarm optimization. You can basically use any type of optimization algorithm to adjust the weights.

Carrying the Error Backwards

To figure out the error at the output node is easy. You simply subtract the output from what the output was suppose to be, and that's your error (not exactly, but that's the idea). The problem was, how do you assign weights to the hidden layers when you can't directly see their output? Even if you could, how would you know which way to adjust it, since it would affect other nodes?

The basic idea of back propagation is to get the output of the network and compare its decision with the decision it should have made, and more importantly, how far off it was. That is the error rate of decision. We'll take that error and propagate it backwards towards the input so we will know how to adjust the weights, layer by layer.

I'm not going to go too much into the hows and whys back propagation, since I feel like there's a lot of tutorials out there that do it justice. And I won't go into the proof either. It's mainly just multi-dimensional calculus. It's not too hard to follow, actually. It's really just a matter of keeping the variables straight, since there are so many. I'll skip all that. But I will show and explain the result, since it makes understanding the code a lot easier.

I'm going to assume that most of my audience are programmers that didn't much like math. If they did, they probably wouldn't be reading this, and would have read the proof themselves from a textbook. Therefore, I'll explain some math things that I otherwise would not. Math people, bear with me...or correct me.

Starting from the back of the bus

Calculating the change in weights for the output node isn't too bad. Using my "awesome" GIMP skillz...it looks like this:

We'll start from the back. I color coded it to make it easier to figure out what the equations are saying. (If a variable is bolded, that means it's a vector) The error of output of the training input is:

(1) J(w) = ½ ∑ (tk - zk)2 = ½ * ||t - z||2

where t is what the output should have been, and z is what we actually got from the neural network. J(w) is basically a sum of all the errors across all output nodes. You'd want a square of the differences because you want to make all differences positive before you sum them, so the errors don't cancel each other out. The double lines stand for norm. You can think of norm as "length of vector". Norm is just a convenient way to write it.

If you wanted to derive back propagation, you'd take the derivative of J(w) with respect to w, and try to minimize J. Remember what I said about going in the direction of steepest change in error? Well, to calculate change, you calculate the derivative (since derivative means change), and that's why you'd do it in the proof. If you want to follow the proof, check out page 290-293 of Pattern Classification by Duda, Hart, and Stork.

The hyperplane

So skipping all the proof, you'd get two equations. One for calculating the adjustment of weights in the output layer (red layer), and the other for calculating the adjustment in weights of all other layers before that (yellow and green layers).

(2) wkj = ɳ * (tk - zk) * f'(netk) * yj

This is the equation to adjust the purple weights. It's not too bad, and I'll go through each part.
  • ɳ - The eta (funny looking 'n') in the beginning is the learning rate. This is a variable you tweak to adjust how fast the neural network learns. I'll talk more about that some other time, but don't think that you'd want to set this as high as possible.
  • (tk - zk) - Next, note that tk - zk aren't bolded, so they are what the output was suppose to be, and the output of the neural network of the kth output node. For us, we only have one output node.
  • f'(netk) - Remember back in part II, where we were talking about the sigmoid function? f'(x) is the derivative of the sigmoid function. If I haven't forgotten my calculus, it should be:

    (3) f'(x) = e-x / (1 + e-2x)

  • netk is the dot product of the output node weights with the inputs (yj) of the output node. Note that yj is also the outputs of the hidden layer, and it is calculated by f(netj)--note that this is a regular sigmoid.
In equation (2) above, we'll need a part of it to send back to the hidden layers. We'll represent it by a lower case delta (looks like an 'o' with a squiggly on top). It is called the sensitivity. This is what we propagate back to the other layers, and where the technique gets its name.

(4) δk = (tk - zk) * f'(netk)

The second equation dictates how to adjust all hidden layers. Note that it uses the sensitivity variable:

(5) wji = ɳ * [∑k=1 to c wkjδk] * f'(netj) * xi
  • As you can see, this is more of the same. The only difference is the second term, which is the dot product of all the output node input weights (wkj) from a hidden node and the sensitivities (δk) across all output nodes the hidden node is connected to.
  • netj is like as before--it's the dot product of the inputs xi with the inputs weights of the hidden nodes.
You'll note that from the perspective a single hidden node, the adjustment of its input weights depends on the set of inputs from the previous layer that is connected to it, and the set of sensitivities and the associated weights of the output layer from the next layer that the hidden node is connected to. netj is no exception since it is the dot product of xi and wji for all i. You can better see this in a picture. GIMP again.

I know we don't have 3 output nodes and 4 input nodes. It's just to illustrate that from the perspective of the hidden node, this would be the information it needs from the layers surrounding it. In the code base we've written so far, the weights are contained in the node it's connected to. So wji would belong to the hidden layer, and wkj would belong to the output layer. Therefore, the output layer would need to send both the sensitivity and the output layer input weights back to the hidden node.

This perspective is important, because Erlang follows an Actor model, where you model the problem as individual agents that pass messages back and forth to each other. We have now written how each individual node adjusts its weights, and that will help us in our coding.

This also means that as the current implemention is headed, I am assuming an asynchronous model of the neural network. Each perceptron will update when any of its inputs change. That means, like a digital circuit, there will be a minimum time that it takes for the output to reach a correct steady state and for the weight adjustments to propagate back. What this minimum time will be, will probably depend on the number of hidden layers. We'll see if it'll work. I have a hunch it should be ok, as long as the inputs are throttled to wait until the minimal time passes before feeding it a new set of inputs. It might result a lot of unnecessary messages, but if we can get away with it while keeping the code simple, I think it's probably worth it.

Whew. That all took a long time. Probably a good four or five hours. Well, I was hoping to be done by part IV when I started this, but it looks like there'll still probably one or two more installments to this series. Next time, we'll get to the code. I had intended to get to it this installment, but the code will make a lot more sense if you know what the math is saying about it.

In the meantime, I've gotta get to bed. It's like 2am.

Erlang and Neural Networks Part I
Erlang and Neural Networks Part II
Erlang and Neural Networks Part III

Friday, July 06, 2007

Expoential backoff, mofo!

I noticed that facebook doesn't check my blog's RSS feed at a regular interval. I hadn't blogged in a while on my other blog, and facebook stopped checking it. It was only just now that it imported 4 posts at once. It can only mean that they're doing one of two things.

1) They're so overloaded, that they import peoples' blog RSS when they get around to it, and can spare some cycles.

2) They've implemented something like Ethernet's exponential back off when contention happens on the wire.

To me, #2 makes sense. RSS is known for its bandwidth hogging nature, since readers keep pinging the feed. I wrote about adaptive polling before, and it shouldn't be too hard. Based on a recent history of when someone posts on their blog, you can make pretty good predictions as to when they're going to blog next. Therefore, you can check the blog's RSS feed based on that, rather than wasting bandwidth. An easy way is to use a Bayesian classifier to do this.

Thursday, July 05, 2007

Nerd time - Issue 6

Hey all,

Hope you had a good July 4th. It's more nerd time--bringing the curiosities of the net at your doorstep. This time is more techcrunchy stuff. So if you read that, you can skip it.

Video's taken off ever since Youtube, facilitated by the ability of flash to play video. Slap Vid is notable because it's the bittorrent of flash video clients. It's a P2P client for video. This is a ycombinator company.
http://www.slapvid.com

The idea to make a clickable world has been around for a while. The idea is to be able to print up URLs as 2D barcodes, so people with camera phones can take a picture of it, and it takes them to the URL. If you're old enough to remember the CueCat, you remember what a spectacular failure that was. But the market was different back then.
http://www.smartpox.com

This is an article on mapping. Google Maps is old news, but the implications of being able to overlay virtual information on top of the real world while you're in it is pretty exciting. The ability to create your own maps has been available for a while now, but discovery of those maps hasn't been easy. And there currently is no mobile earth browser, like there is for the desktop. This in conjunction with the clickable world is worth a thought. Smells like market opportunity to me. Mobile platforms aren't quite mature yet, but they're getting there.
http://www.wired.com/techbiz/it/magazine/15-07/ff_maps

This is a talk on the implications of OpenID. OpenID is a distributed authentication mechanism, aimed to eliminate the need for multiple logins for multiple websites. OpenID in combination with semantic technologies like microformats seems like a neat idea. It's gaining some momentum, as both Sun and AOL have implemented it.
http://video.google.com/videoplay?docid=2288395847791059857

And just for emacs fans:
http://robrohan.com/projects/9ne/
Emacs-like editor on the web!

has_many :jobs, :limit => 4


I saw this ad in my gmail, and found it pretty funny. Swivel is a startup that is trying to be the "YouTube of Data". What they're doing is pretty neat, esp if they can get the correlation of data down. I wasn't aware that they were using Rails, but perhaps they're only using it as a front end, and have some Java backend processing.

While we're at trivialities, you might as well read about "Mel the Programmer." Though Mel's a real person, this is what legends are made of.

More substance next time.

Monday, July 02, 2007

How do you get to work with Richard Feynman?

Long Now: Views: Essays

This is a nice article describing Thinking Machines and Richard Feynman, a physicist that worked on the bomb, and is known for his sense of humor. The article got me thinking...how would I get to work with Richard Feynman? This was what I came up with off the top of my head.

Going to MIT or Caltech would help. Not necessarily those actual places that matters, but being at a place that attracts passionate and smart like-minded individuals would help. That way, you can learn from them and expose yourself to different ideas. I've found that knowing about different far-flung things really help in your creativity and problem solving skills. For example, I never thought reading and learning about the election process was anything other than being a responsible citizen of a Republic. But with the rise of social media news sites, different election systems gives a good perspective on it.

The other point is probably to be working on an interesting problem. Nothing swarms geeks like an interesting problem. I think this is what managers and business fails to take into account. If you're going to get engineers to work on something, you have to cast it as an interesting problem. Don't tell engineers that they're going to work on airline reservation system. Tell them they're going to work on a distributed large-scale scheduling problem. (On the other hand, engineers need to learn to explain to their cocktail party cohorts that they work on airline reservations systems.)

In this day and age of the Internet, the bar is lower than before to get started on working on something important. You can learn all sorts of stuff to get started. But you still need to be in the right environment to help you along.