"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

February 10, 2019

Day #208 - OpenAI - Spinning Up in Deep RL Workshop - Part 1

Key Lessons
  • AGI - Artificial General Intelligence
  • Do Most Economically Valuable work
  • Deep Reinforcement Learning trains Deep Networks with Trial and Error
  • Function approximators - Deep Networks 


Reinforcement Learning
  • Good for Sequential Learning
  • Good when we do not know optimal behavior
  • RL is useful when evaluating behavior is easier than generating them
Deep Learning
  • Good for High Dimensional Data
  • Approximate a function
Deep RL
  • Video Games
  • For Decision Rules 


Recap of DL Patterns
  • Finding a model that would give right output for certain inputs
  • Output of each layer is a re-arrangement of input with the non-linearity applied
  • Loss function differentialble with parameters in model
  • Compute loss changes with respect to change in parameters
  • Function composition is the core of the model
  • Function topology with multiple architecture
  • Non-Linearity does a lot of work
  • Successive layers represents more complex features
  • LSTM (RNN) - Accept timeseries of input and timeseries of output
  • Transformer - Allows network to do (Attention) over several inputs
  • Attention Neural Networks - Select most meaningful details from data, Make Decision based on lot of data
  • Regularizers - Tradeoff loss against something that is not dependant on task, They do better job at Generalization
  • Adaptive Optimizer 




Formulate RL Problem
  • Agent that interacts with environment
  • Agent picks and executes action
  • With New Environment Agent proceeds further
  • What decision maximizes rewards
  • Attains the goal with Trial and Error

Observations And Actions
  • Observations are continuous
  • Actions may be discrete or continuous

Policy
  • Randomly (Stochastic)
  • Deterministic (Map directly with no randomness)
  • Randomness is helpful
  • Logits (Probabilities of particular action)
  • Probabilities of max of softmax of the logits
Trajectory - Sequence of states and actions in an environment

Reward function - Measures of good / bad. More positive better


Value functions - How much reward expected to get
Value function satisfy Bellman Equation


Types of RL Algos
Model, Environment based models


Try - Evaluate - Improve the policy
Policy Optimzation


  • Run policy by complete trajectory
  • Represent policy with Neural Network
Derive Policy Gradient
  • Parameters are in distribution
  • Bring gradient inside expectation
  • Expectation based on Trajectory

  • Starting state drawn from some distribution
  • Markov Property notion of picking next state depends on current state not on previous state

  • Every action will get some update
  • Reward to go Policy Gradient


  • Advantage form functions
  • How much better action is than average


  • N-Step Advantage Estimates



  • Initially Assignments (Weights) do Matter while setting up the system
Next is Part II


Happy Mastering DL!!!

No comments: