Last time, we learned how the neural network learns. The most important part was the very last section, where we figured out what each perceptron needs to do to adjust its weights to learn.
(5) ∆wji = ɳ * [∑k=1 to c wkjδk] * f'(netj) * xi
To recap:
- Eta - The learning rate
- Sensitivity from the next layer
- Inputs from the previous layer
In order to achieve this, we'll have to refactor some of our code. Let's do so.
Vector map
map() is a function that applies the same operation to all elements in an array. I needed a function that is a map for two different lists of the same length. Vector and matrix addition relies on this:
[1,2,3] + [4,5,6] = [5,10,18]
I couldn't find one in the erlang APIs, so I wrote one myself. Ends up this code is used when adjusting weights.
% like map, but with two lists instead.
vector_map(Func, [], []) ->
[];
vector_map(Func, [Xh | Xt], [Yh | Yt]) ->
[Func(Xh, Yh) | vector_map(Func, Xt, Yt)].
This looks very much like the code for the previous dot_prod(). Let's keep things DRY and make dot_prot() use vector_map.
% Calculates the dot product of two lists
dot_prod(X, Y) ->
lists:foldl(fun(E, Sum) -> E + Sum end, 0,
vector_map(fun(Ex, Ey) -> Ex * Ey end, X, Y)).
So vector_map() takes the the nth element and multiplies them together. foldl() is very much like reduce() in MapReduce or inject() in ruby. It takes all the elements and sums it up.
f'(netj)
We need to calculate the feed forward, but with the derivative of the sigmoid function.
Recall that our feed_forward function looks like this:
sigmoid(X) ->
1 / (1 + math:exp(-X)).
feed_forward(Weights, Inputs) ->
sigmoid(dot_prod(Weights, Inputs)).
According to our calculations in the last part, we also need the derivative of the sigmoid function, but with the same stuff passed into it. So we could write another function like this:
sigmoid_deriv(X) ->
math:exp(-X) / (1 + math:exp(-2 * X)).
feed_forward_deriv(Weights, Inputs) ->
sigmoid_deriv(dot_prod(Weights, Inputs)).
This is a waste, since the two versions of feed forward are essentially doing the same thing. Here is where we can use first-class functions to get around it.
Languages that support first class functions are powerful because you can pass functions in and out of other functions as arguments. This is nice because you don't need a lot of structure to get things done--no messing around with anonymous inner classes and function pointers.
Instead, we'll write a new Sigmoid functions in perceptron and the feed_forward function outside of it.
perceptron(Weights, Inputs, [other variables]) ->
Sigmoid = fun(X) ->
1 / (1 + math:exp(-X))
end,
Sigmoid_deriv = fun(X) ->
math:exp(-X) / (1 + math:exp(-2 * X))
end,
receive
% perceptron messages...
end.
feed_forward(Func, Weights, Inputs) ->
Func(dot_prod(Weights, Inputs)).
So now if we want to feed forward with a particular non-linear activation function, we just tack it on as an argument!
feed_forward(Sigmoid, [1, 2], [2, 3]). % f(net)
feed_forward(Sigmoid_deriv, [1, 2], [2, 3]). % f'(net)
Some helper functions to refactor
Remember convert_to_list?
convert_to_list(Inputs) ->
lists:map(fun(Tup) ->
{_, Val} = Tup,
Val
end,
Inputs).
Let's change it to convert_to_value and add a convert_to_key function:
convert_to_values(Tuple_list) ->
lists:map(fun(Tup) ->
{_, Val} = Tup,
Val
end,
Tuple_list).
convert_to_keys(Tuple_list) ->
lists:map(fun(Tup) ->
{Key, _} = Tup,
Key
end,
Tuple_list).
Taking note of sensitivities
We also need to adjust our perceptron to keep the sensitivities from the next layer, in the same way we keep the outputs from the previous layer. That means that our perceptron will no longer keep a list of just Output_PIDs, but a list of tuples [{OutputPID_1, Sensitivity_1}, ... {OutputPID_n, Sensitivity_n}], called Sensitivities.
perceptron(Weights, Inputs, Sensitivities) ->
receive
{stimulate, Input} ->
New_inputs = replace_input(Inputs, Input),
Output_value = feed_forward(Sigmoid, Weights, convert_to_values(New_inputs)),
if Sensitivities =/= [] ->
% My output's connected to at least one perceptron:
lists:foreach(fun(Output_PID) ->
Output_PID ! {stimulate, {self(), Output_value}}
end,
convert_to_keys(Sensitivities));
Sensitivities =:= [] ->
% My output's connected to no one:
io:format("~w outputs: ~w~n", [self(), Output_value]),
% Call a trainer here instead and
self() ! {learn, {self(), 1}}
end,
perceptron(Weights, New_inputs, Sensitivities);
% other messages....
end.
We'll need to do this for all occurrences of Output_PID in other messages, such as pass, connect_to_output, and connect_to_input. When we need a list of Output_PIDs, we simply call convert_to_keys(Sensitivities).
The actual learning
Now that we have all that other stuff out of the way, we can actually write the learning message of a perceptron. The steps will be:
- Update the list of sensitivities from the previous layer
- Calculate the sensitivity for this node
- Adjust all the weights based on the sensitivity
- Propagate sensitivities and associated weight back to the previous layer
perceptron(Weights, Inputs, Sensitivities) ->
receive
{learn, Backprop} ->
Learning_rate = 0.5,
% Calculate the correct sensitivities
New_sensitivities = add_sensitivity(Sensitivities, Backprop),
Output_value = feed_forward(Sigmoid, Weights, convert_to_values(Inputs)),
Derv_value = feed_forward(Sigmoid_deriv, Weights, convert_to_values(Inputs)),
Sensitivity = calculate_sensitivity(Backprop, Inputs, New_sensitivities,
Output_value, Derv_value),
io:format("(~w) New Sensitivities: ~w~n", [self(), New_sensitivities]),
io:format("(~w) Calculated Sensitivity: ~w~n", [self(), Sensitivity]),
% Adjust all the weights
Weight_adjustments = lists:map(fun(Input) ->
Learning_rate * Sensitivity * Input
end,
convert_to_values(Inputs)),
New_weights = vector_map(fun(W, D) -> W + D end, Weights, Weight_adjustments),
io:format("(~w) Adjusted Weights: ~w~n", [self(), Weights]),
% propagate sensitivities and associated weights back to the previous layer
vector_map(fun(Weight, Input_PID) ->
Input_PID ! {learn, {self(), Sensitivity * Weight}}
end,
New_weights,
convert_to_keys(Inputs)),
perceptron(New_weights, Inputs, New_sensitivities);
% The other messages...
end.
Notice there's a couple helper functions in there to help with the sensitivities.
% adds the propagating sensitivity to the Sensitivities Hash
add_sensitivity(Sensitivities, Backprop) when Sensitivities =/= [] ->
replace_input(Sensitivities, Backprop);
add_sensitivity(Sensitivities, Backprop) when Sensitivities =:= [] ->
[].
% Calculates the sensitivity of this particular node
calculate_sensitivity(Backprop, Inputs, Sensitivities, Output_value, Derv_value)
when Sensitivities =/= [], Inputs =:= [] -> % When the node is an input node:
null;
calculate_sensitivity(Backprop, Inputs, Sensitivities, Output_value, Derv_value)
when Sensitivities =:= [], Inputs =/= [] -> % When the node is an output node:
{_, Training_value} = Backprop,
(Training_value - Output_value) * Derv_value;
calculate_sensitivity(Backprop, Inputs, Sensitivities, Output_value, Derv_value)
when Sensitivities =/= [], Inputs =/= [] -> % When the node is a hidden node:
Derv_value * lists:foldl(fun(E, T) -> E + T end, 0, convert_to_values(Sensitivities)).
Note that there are guards (the conditions that start with "when") on each of these functions. They are the different cases for how to calculate the sensitivity, given whether the node was an input node, an output node, or a hidden node.
Test Run
So as a little test, we can exercise it by writing a function called run(), and running it from the command line:
run() ->
X1_pid = spawn(ann, perceptron, [[],[],[]]),
X2_pid = spawn(ann, perceptron, [[],[],[]]),
H1_pid = spawn(ann, perceptron, [[],[],[]]),
H2_pid = spawn(ann, perceptron, [[],[],[]]),
O_pid = spawn(ann, perceptron, [[],[],[]]),
% Connect input node X1 to hidden nodes H1 and H2
ann:connect(X1_pid, H1_pid),
ann:connect(X1_pid, H2_pid),
% Connect input node X2 to hidden nodes H1 and H2
ann:connect(X2_pid, H1_pid),
ann:connect(X2_pid, H2_pid),
% Connect input node H1 and H2 to output node O
ann:connect(H1_pid, O_pid),
ann:connect(H2_pid, O_pid),
X1_pid ! {status},
X2_pid ! {status},
H1_pid ! {status},
H2_pid ! {status},
O_pid ! {status},
X1_pid ! {pass, 1.8},
X2_pid ! {pass, 1.3}.
233> ann:run().
Output of <0.841.0> connected to Input of <0.843.0>
Output of <0.841.0> connected to Input of <0.844.0>
Output of <0.842.0> connected to Input of <0.843.0>
Output of <0.842.0> connected to Input of <0.844.0>
Output of <0.843.0> connected to Input of <0.845.0>
Output of <0.844.0> connected to Input of <0.845.0>
Status of Node(<0.841.0>)
W: []
I: []
S: [{<0.844.0>,0},{<0.843.0>,0}]
Status of Node(<0.842.0>)
W: []
I: []
S: [{<0.844.0>,0},{<0.843.0>,0}]
Status of Node(<0.843.0>)
W: [0.891633,-0.112831]
I: [{<0.842.0>,0.446080},{<0.841.0>,-0.815398}]
S: [{<0.845.0>,0}]
Status of Node(<0.844.0>)
W: [0.891633,-0.112831]
I: [{<0.842.0>,0.446080},{<0.841.0>,-0.815398}]
S: [{<0.845.0>,0}]
Status of Node(<0.845.0>)
W: [0.891633,-0.112831]
I: [{<0.844.0>,0.446080},{<0.843.0>,-0.815398}]
S: []
{pass,1.30000}Stimulating <0.844.0> with 0.200000
Stimulating <0.844.0> with 0.300000
Stimulating <0.843.0> with 0.200000
Stimulating <0.843.0> with 0.300000
<0.845.0> outputs: 0.650328
Stimulating <0.844.0> with 1.80000
Stimulating <0.844.0> with 1.30000
<0.845.0> outputs: 0.643857
Stimulating <0.843.0> with 1.80000
Stimulating <0.843.0> with 1.30000
<0.845.0> outputs: 0.606653
<0.845.0> outputs: 0.607508
(<0.845.0>) New Sensitivities: []
(<0.845.0>) Calculated Sensitivity: 0.178902
(<0.845.0>) Adjusted Weights: [0.891633,-0.112831]
<0.845.0> outputs: 0.610857
(<0.844.0>) New Sensitivities: [{<0.845.0>,0.168491}]
(<0.843.0>) New Sensitivities: [{<0.845.0>,-1.12092e-2}]
<0.845.0> outputs: 0.655916
(<0.844.0>) Calculated Sensitivity: 5.64317e-2
(<0.843.0>) Calculated Sensitivity: -3.75421e-3
(<0.845.0>) New Sensitivities: []
(<0.844.0>) Adjusted Weights: [0.891633,-0.112831]
(<0.843.0>) Adjusted Weights: [0.891633,-0.112831]
(<0.845.0>) Calculated Sensitivity: 0.141549
(<0.842.0>) New Sensitivities: [{<0.844.0>,5.23863e-2},{<0.843.0>,0}]
(<0.841.0>) New Sensitivities: [{<0.844.0>,-3.50115e-3},{<0.843.0>,0}]
(<0.845.0>) Adjusted Weights: [0.941808,-6.26553e-2]
(<0.842.0>) Calculated Sensitivity: null
(<0.841.0>) Calculated Sensitivity: null
<0.845.0> outputs: 0.669378
(<0.844.0>) New Sensitivities: [{<0.845.0>,0.140548}]
(<0.843.0>) New Sensitivities: [{<0.845.0>,-3.24941e-3}]
(<0.842.0>) Adjusted Weights: []
(<0.841.0>) Adjusted Weights: []
<0.845.0> outputs: 0.668329
...(clipped)...
This is an asynchronous implementation of a feed-forward network, so there will be a lot of messages flying around back and forth. It is outputting the results of the network at the same time it's adjusting its weights. One will see a finer-grain adjustment of the outputs, and will have to wait until all stimulations have passed through the system to grab the output.
One can cut down on these messages if you implement something that keeps track of which inputs or sensitivities have been updated. Wait until all inputs have been updated to stimulate the next layer, and wait until all sensitivities have been updated to propagate back to the next layer.
The trainer
So now the neural network knows how to learn! All the hard part is done. However, there is one last part that I will leave as an exercise to the reader--that is to write a trainer for the network. Right now, in the stimulate message, it simply calls the learn method directly when an output node prints its output. It always uses "1" as the default training value. This is just a place holder. The learn method should be inside a trainer.
A trainer takes a list of examples, and feeds them to the network, one at a time, and for each valid output(remember the asynchronous network will output a series of values for any set of inputs), find the difference with the training value and send it to the learn message.
In Conclusion
I went over most of the hard part, so the last mile should be relatively easy. It's been a long trip to write these articles--longer than I had ever imagined when I started. So it is with all things. Hopefully, you have now a better understanding of neural networks, and realize that they're really not that magical. They're just a high-dimensional gradient descent. Let me know if you've found the series helpful. Maybe I'll try writing another set later on. Until then, I have other cool things to code. I'll let you know when I get it up and running.
I read through the concatenated version at trapexit.org. It's a good article, but you're probably confusing everyone by calling Erlang processes threads.
ReplyDeleteHrm, ok. I'll change the wording around. Thanks!
ReplyDelete