Reinforcement learning (RL) is centered around the principle of action followed by reward. Just like most machine learning algorithms, reinforcement learning is an optimization technique where we maximize reward.
where rᵢ reward at time i. Essentially, the total reward accumulated up until time t.
This is what we optimize (maximize), i.e., converge to the optimal action-value function
This can be done via Bellman’s equation as follows
where s' the next state and r is the reward for action a on state s.
Q Learning builds upon Bellman’s Equation by introducing stochasticity in the way our agent can perform optimal actions, where optimality hinges on maximizing the reward or more specifically, the action-value. This is defined as follows.
Denoting r as R(s,a), for a probability P(s,a,s') for transitioning to state s', we get
where Q(s,a) is the Q-value (action-value) for action a on state s.
In the RL environment, for a given set of starting states and actions, we want to converge the table state-action Q-values to their corresponding optimal values.
So how do we train the agent to converge Q-values to and maintain an optimal state?
Answer: Use the Temporal Difference (TD).
Referring to the previously defined approximation of Q-value, we can define TD as its difference with the current Q-value. TD represents how far off are we from the ideal or target value compared to an earlier stage.
Given a learning rate ɑ, the q-value updates are done as follows for time t
Inserting the value for TD(a,s) we get
For each learning update in Q Learning, we need to compute Q-values for all states-action pairs. This is practically possible only for a reasonably small set of discrete states. This is because the learning complexity will be primarily determined by size of the state space, i.e., the more states, the longer an update takes. This begs the question…
What happens if there’s just too many states to calculate for?
Answer: Use deep learning to approximate the optimal Q-value`
So we essentially replace the Q-table with a neural network.
Critically, Deep Q-Learning replaces the regular Q-table with a neural network. Rather than mapping a state-action pair to a q-value, a neural network maps input states to (action, Q-value) pairs.
Using the highest Q-value across all actions, we update the neural network weights using mean squared loss over $TD$, thereby iteratively converging towards the optimal Q-values for all states. The loss is defined as follows.
Reinforcement learning that
RL with the same core principle of Q Learning, but
The Deep Convolutional Neural Network used to ingest state as image input (tensors) and output Q-value for all possible actions on the state.
RL algorithms can be mainly divided into two categories – model-based and model-free.
Model-based, as it sounds, has an agent trying to understand its environment and creating a model for it based on its interactions with this environment. In such a system, preferences take priority over the consequences of the actions i.e. the greedy agent will always try to perform an action that will get the maximum reward irrespective of what that action may cause.
On the other hand, model-free algorithms seek to learn the consequences of their actions through experience via algorithms such as Policy Gradient, Q-Learning, etc. In other words, such an algorithm will carry out an action multiple times and will adjust the policy (the strategy behind its actions) for optimal rewards, based on the outcomes.