Reinforcement Learning

1. What is Reinforcement Learning?

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative reward.

In RL, an agent:

Perceives the environment’s state
Takes actions
Receives feedback (reward or penalty)
Learns a policy that maximizes long-term reward

It’s trial-and-error learning, where the agent explores and improves over time.

2. Key Components of RL

Component	Description
Agent	Learner or decision maker
Environment	External system the agent interacts with
State (S)	Current situation of the environment
Action (A)	Choices the agent can make
Reward (R)	Feedback signal (positive or negative)
Policy (π)	Strategy mapping states to actions
Value Function (V)	Expected long-term reward from a state
Q-Function (Q)	Expected reward of taking action `a` in state `s`

3. The RL Cycle

Agent observes the state
It selects an action based on its policy
The environment returns a new state and a reward
Agent updates its policy based on experience

4. Types of Reinforcement Learning

(a) Model-Free RL

Learns directly from experience
No knowledge of environment dynamics

Examples:

Q-Learning
SARSA

(b) Model-Based RL

Learns a model of the environment (transition and reward functions)
Uses the model to plan

5. Popular Algorithms

Q-Learning (Off-policy)

Learns Q-values: $Q (s, a) = r + γ \cdot max_{a^{'}} Q (s^{'}, a^{'})$
Doesn’t follow the same policy it updates

SARSA (On-policy)

Learns Q-values while following current policy
$Q (s, a) = r + γ \cdot Q (s^{'}, a^{'})$

Deep Q-Networks (DQN)

Combines Q-learning with neural networks
Handles high-dimensional state spaces (like games)

6. Exploration vs Exploitation

Exploration: Try new actions to discover better rewards
Exploitation: Use known actions that give high reward

A balance is needed (e.g., using ε-greedy policy).

7. Reward Signal

The reward is crucial in shaping agent behavior:

Immediate reward teaches short-term behavior
Discounted future rewards (via discount factor γ) encourage long-term planning

8. Applications of RL

Game Playing: AlphaGo, Dota bots
Robotics: Learning to walk, grip objects
Recommendation Systems
Autonomous Vehicles
Industrial Control Systems

9. Challenges in RL

Delayed rewards: Action effects seen after many steps
Exploration complexity: Large action/state spaces
Sample inefficiency: Requires lots of interaction data
Stability and convergence in training (especially deep RL)

Summary

Feature	Reinforcement Learning
Learning Style	Trial-and-error
Feedback Type	Scalar reward signal
Goal	Maximize total reward
Best For	Sequential decision-making
Core Idea	Learn from interaction with environment

Reinforcement learning is like teaching a dog tricks by giving it treats—it improves by trying, failing, and adjusting its behavior to get more rewards over time.

Rushabh's Digital Garden

Explorer