Friday 6 July 2018

Really REALLY simple Neural Network Reinforcement Learning

Well it was always a project I never quite finished and I found a way back in by using some well written basic scripts for AI to mix them together and make a Neural Q learner Recipe as follows:

Take one ANN for this I used the following:

http://mnemstudio.org/neural-networks-backpropagation-xor.htm

add one Q-learner from here:

http://mnemstudio.org/path-finding-q-learning-example-1.htm

and create a Neural Q-learner

1. add a new hidden layer to the ANN
2. add the complete formula for calculating the Bellman Residual to the Q-learner:

3. Create this learning loop:
i. Randomly choose a legal action (Not a wall)
ii. Connect CurrentState and Action to inputs on ANN
iii. Calculate Q using above formula
iv. Calculate Error of Q outputed from the ANN
v. Update all the weights for each layer started with the last and ending with the first - feeding back the error in the same way as the XOR problem.

4. Test the net

I found it worked best with 8 neurons in each hidden layer and the Gamma of the Bellman Equation set to 1.8 instead of 0.8

Here is the net solving the state matrix of the qlearning program:

First state=1,
Found a winner
Q value=0.539621 Action=0
Found a winner
Q value=0.542437 Action=1
Found a winner
Q value=0.757337 Action=2
Final winner for state 1 =2
First state=2,
Found a winner
Q value=0.539621 Action=0
Found a winner
Q value=0.542437 Action=2
Found a winner
Q value=0.757337 Action=3
Final winner for state 2 =3
First state=3,
Found a winner
Q value=0.539621 Action=0
Found a winner
Q value=0.542437 Action=3
Found a winner
Q value=0.757337 Action=4
Final winner for state 3 =4
First state=4,
Found a winner
Q value=0.539621 Action=0
Found a winner
Q value=0.542437 Action=4
Found a winner
Q value=0.757337 Action=5
Final winner for state 4 =5

First state=3,
Found a winner
Q value=0.539621 Action=0
Found a winner
Q value=0.542437 Action=3
Found a winner
Q value=0.757337 Action=4
Final winner for state 3 =4
First state=4,
Found a winner
Q value=0.539621 Action=0
Found a winner
Q value=0.542437 Action=4
Found a winner
Q value=0.757337 Action=5
Final winner for state 4 =5

First state=5,
Found a winner
Q value=0.539621 Action=0
Found a winner
Q value=0.542437 Action=5
Final winner for state 5 =5

First state=2,
Found a winner
Q value=0.539621 Action=0
Found a winner
Q value=0.542437 Action=2
Found a winner
Q value=0.757337 Action=3
Final winner for state 2 =3
First state=3,
Found a winner
Q value=0.539621 Action=0
Found a winner
Q value=0.542437 Action=3
Found a winner
Q value=0.757337 Action=4
Final winner for state 3 =4
First state=4,
Found a winner
Q value=0.539621 Action=0
Found a winner
Q value=0.542437 Action=4
Found a winner
Q value=0.757337 Action=5
Final winner for state 4 =5

First state=4,
Found a winner

You can see the different Q-values for each action produced by the output of the Neural Network. (Needs more work though to get it to converge)

It makes mistakes and doesnt converge yet but with more work it might!

If anyone wants the code for this one (Based on the Mnemstudio C++ original) please add a comment.