DeepRL Notes & Algorithms

Spinning Up Algorithms

RL Notes from Spinning Up Docs

Introduced terminology:

states and observations
action spaces
policies
trajectories
different formulations of return
the RL optimization problem
value function

States and observations: State is a complete description of the world that the agent lives in, and observation is an incomplete picture of the state. State is most often represented in vector form. "s" represents state and "o" represents observation in equations.

Action spaces: Action spaces refer to the set of actions an agent can take in a given state. Action spaces make be different in different states in a state space. Actions spaces can be discrete and continuous.

Policies: Policy is the rule an agent uses to decide what to do. There are deterministic (nothing is random) and stochastic (some randomness). Deterministic policy symbol is "" and stochastic is "" in equations. For Deep RL there are paramatrized policies that take in paramters for things such as weights in a NN, and paramters are represented by "" or "".

In stochastic policies for Deep RL, categorical policies (discrete) and diagnol gaussian policies (continuous) are the most used.

Trajectories: Trajectory a sequence of states and actions in the world: . State transitions can be deterministic: or stochastic: .

Trajectories may also be reffered to as episodes.

Different formulations of return: The reward function is very important, denoted by "R" and is dependent on the current state, the current action taken, and the next state coming from that action: . The goal of the agent is to maximize the reward over the trajectory, unlinke a greedy algorithm which would just try to maximize it at the current state.

There is finite-horizon undiscounted return (sum of rewards in fixed window): and infinite-horizon discounted return (sum of all rewards): .

The RL optimization problem: The goal in RL is to select the policy which maximizes expected return. Optimal policy for a stochastic policy is: "".

Value function: Value functions give you the value of a state or state-action pair, like showing how good having a flush is in poker. There are four main value functions:

On-Policy Value Function: always act according to policy
On-Policy Action-Value Function: take an arbitrary action then always act according to policy
Optimal Value Function: always act according to the optimal policy
Optimal Action-Value Function: take an arbitrary action then always act according to optimal policy

Other important tidbits of info

Model-free vs. model-based: a model of an environment is a function which predicts state transitions and rewards. The algorithms covered here will be mostly model-free.

Udacity DeepRL Notes

Episodic vs. Continuing tasks: Episodic tasks have a well defined ending point like in a game of chess, and continuing tasks do not, like the life space of a puppy robot perhaps.

The reward hypothesis: All goals can be framed as the maximization of expected cumulative rewards.

Reward equations can be very important and interesting, prioritizng certain actions over time in a mathematical way. Take a robot that you want to teach how to walk from A to B as fast as possible. The reward equation might look like a multi-term equation where there is a term for the robots velocity, deviation from the forward direction, and a constant for not falling over time. Each term will add or subtract to the total given reward for the robot in that time frame.

Cumulative reward: It is important to give the robot the objective of increasing cumulative rewards and not just maximizing the nearest reward, for that would be a greedy algorithm and is often not too performant. We call this cumulative reward return.

Discounted return: Discounted return refers to making future expected rewards less significant.

Markov Decision Process: MDP is muy importante! And useful. We define reinforcment learning problems as MDPs so they are in a format that we can apply our algorithms to. It's defined as a set of states, actions, and rewards with one step dynamics and a discount rate.

Policy: A policy is what interprets the current state and produces an output, being an action. Hopefully a good one. Choosing a definite action is deterministic, and choosing an action with some randomality is stochastic.

State-value fucntion: The state-value function yields the expected return if the agent started in the specific state and followed the policy for all time steps.

Bellman equations: Bellman equations can simplified down to this: The state-value of a certain state can be recursively obtained by the expected return of the current state plus the discounted state-value of the next state.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
SpinningUpAlgs		SpinningUpAlgs
.DS_Store		.DS_Store
An Introduction to Deep Reinforcement Learning-2018.pdf		An Introduction to Deep Reinforcement Learning-2018.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepRL Notes & Algorithms

Table of Contents

Notes

Implementations

Spinning Up Algorithms

RL Notes from Spinning Up Docs

Other important tidbits of info

Udacity DeepRL Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DeepRL Notes & Algorithms

Table of Contents

Notes

Implementations

Spinning Up Algorithms

RL Notes from Spinning Up Docs

Other important tidbits of info

Udacity DeepRL Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages