![Intelligent Projects Using Python](https://wfqqreader-1252317822.image.myqcloud.com/cover/975/36698975/b_36698975.jpg)
Q-learning
We will now look at a popular reinforcement learning algorithm, called Q-learning. Q-learning is used to determine an optimal action selection policy for a given finite Markov decision process. A Markov decision process is defined by a state space, S; an action space, A; an immediate rewards set, R; a probability of the next state, S(t+1), given the current state, S(t); a current action, a(t); P(S(t+1)/S(t);r(t)); and a discount factor, . The following diagram illustrates a Markov decision process, where the next state is dependent on the current state and any actions taken in the current state:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/83330458-c0b9-4660-a1e3-6cd4927244a6.png?sign=1739695114-0zlzSQUdKWVTWZZXBaYWsYm2wQHRTdbV-0-b20dc3a3a69bf76d4ce11a1113b3da8b)
Let's suppose that we have a sequence of states, actions, and corresponding rewards, as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/3c565184-9ac9-488b-8caa-6136105ef059.png?sign=1739695114-82NTp9ALAtNeiaOmmYCPiS8ij52WkH9w-0-6799a3cf1e312e873e970c7a1b389e26)
If we consider the long term reward, Rt, at step t, it is equal to the sum of the immediate rewards at each step, from t until the end, as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/94a7a7da-3528-4fb2-9e42-1af592e0c024.png?sign=1739695114-mo3pV7QCiz32TcSKyNgpJ9wDqeM5AXcf-0-b04045935a6bb983bf261ad99841ae87)
Now, a Markov decision process is a random process, and it is not possible to get the same next step, S(t+1), based on S(t) and a(t) every time; so, we apply a discount factor, , to future rewards. This means that, the long-term reward can be better represented as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/4a4b48fe-0a74-4d0f-92ad-bad7fdc7fb35.png?sign=1739695114-JOPx9NLs484YbUp7Bn5v2LPTnF6Cv2EL-0-7766ea6bbbeace6b27bd5c9f9969c92f)
Since at the time step, t, the immediate reward is already realized, to maximize the long-term reward, we need to maximize the long-term reward at the time step t+1 (that is, Rt+1), by choosing an optimal action. The maximum long-term reward expected at a state S(t) by taking an action a(t) is represented by the following Q-function:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/b3ecdb7c-d4cf-48a2-af93-c969b62fb41e.png?sign=1739695114-ENkpUrrGTdjs6QIT0yjQ576cOL92N1Ih-0-0c6d702c0ee8d419840bb92aa6021258)
At each state, s ∈ S, the agent in Q-learning tries to take an action, , that maximizes its long-term reward. The Q-learning algorithm is an iterative process, the update rule of which is as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/b6e4a367-e0aa-406d-bd8b-5e1b2cb8c1d7.png?sign=1739695114-4t1Vmz4vyNK8ddaV0KRveAEzseKsSwEE-0-d0a3d711bd471ef613c980aaebf97fbb)
As you can see, the algorithm is inspired by the notion of a long-term reward, as expressed in (1).
The overall cumulative reward, Q(s(t), a(t)), of taking action a(t) in state s(t) is dependent on the immediate reward, r(t), and the maximum long-term reward that we can hope for at the new step, s(t+1). In a Markov decision process, the new state s(t+1) is stochastically dependent on the current state, s(t), and the action taken a(t) through a probability density/mass function of the form P(S(t+1)/S(t);r(t)).
The algorithm keeps on updating the expected long-term cumulative reward by taking a weighted average of the old expectation and the new long-term reward, based on the value of .
Once we have built the Q(s,a) function through the iterative algorithm, while playing the game based on a given state s we can take the best action, , as the policy that maximizes the Q-function:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/88e6ee5c-d64d-4d70-a9cf-ee3eb0e6c32a.png?sign=1739695114-CVfYPjwFlkUnThq04gPq57g46JNuYZsm-0-83768c6e3a9de6e73feb88f7bef9249d)