Title: | Model-Free Reinforcement Learning |
---|---|
Description: | Performs model-free reinforcement learning in R. This implementation enables the learning of an optimal policy based on sample sequences consisting of states, actions and rewards. In addition, it supplies multiple predefined reinforcement learning algorithms, such as experience replay. Methodological details can be found in Sutton and Barto (1998) <ISBN:0262039249>. |
Authors: | Nicolas Proellochs [aut, cre], Stefan Feuerriegel [aut] |
Maintainer: | Nicolas Proellochs <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.5 |
Built: | 2024-11-10 03:14:26 UTC |
Source: | https://github.com/nproellochs/reinforcementlearning |
Computes reinforcement learning policy from a given state-action table Q. The policy is the decision-making function of the agent and defines the learning agent's behavior at a given time.
computePolicy(x)
computePolicy(x)
x |
Variable which encodes the behavior of the agent. This can be
either a |
Returns the learned policy.
# Create exemplary state-action table (Q) with 2 actions and 3 states Q <- data.frame("up" = c(-1, 0, 1), "down" = c(-1, 1, 0)) # Show best possible action in each state computePolicy(Q)
# Create exemplary state-action table (Q) with 2 actions and 3 states Q <- data.frame("up" = c(-1, 0, 1), "down" = c(-1, 1, 0)) # Show best possible action in each state computePolicy(Q)
-greedy action selectionDeprecated. Please use [ReinforcementLearning::selectEpsilonGreedyAction()] instead.
epsilonGreedyActionSelection(Q, state, epsilon)
epsilonGreedyActionSelection(Q, state, epsilon)
Q |
State-action table of type |
state |
The current state. |
epsilon |
Exploration rate between 0 and 1. |
Character value defining the next action.
Sutton and Barto (1998). "Reinforcement Learning: An Introduction", MIT Press, Cambridge, MA.
Deprecated. Please use [ReinforcementLearning::replayExperience()] instead.
experienceReplay(D, Q, control, ...)
experienceReplay(D, Q, control, ...)
D |
A |
Q |
Existing state-action table of type |
control |
Control parameters defining the behavior of the agent. |
... |
Additional parameters passed to function. |
Returns an object of class hash
that contains the learned Q-table.
Lin (1992). "Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching", Machine Learning (8:3), pp. 293–321.
Watkins (1992). "Q-learning". Machine Learning (8:3), pp. 279–292.
Function defines an environment for a 2x2 gridworld example. Here an agent is intended to navigate from an arbitrary starting position to a goal position. The grid is surrounded by a wall, which makes it impossible for the agent to move off the grid. In addition, the agent faces a wall between s1 and s4. If the agent reaches the goal position, it earns a reward of 10. Crossing each square of the grid results in a negative reward of -1.
gridworldEnvironment(state, action)
gridworldEnvironment(state, action)
state |
The current state. |
action |
Action to be executed. |
List containing the next state and the reward.
# Load gridworld environment gridworld <- gridworldEnvironment # Define state and action state <- "s1" action <- "down" # Observe next state and reward gridworld(state, action)
# Load gridworld environment gridworld <- gridworldEnvironment # Define state and action state <- "s1" action <- "down" # Observe next state and reward gridworld(state, action)
Input is a name for the action selection, output is the corresponding function object.
lookupActionSelection(type)
lookupActionSelection(type)
type |
A string denoting the type of action selection. Allowed values are |
Function that implements the specific learning rule.
Decides upon a learning rule for reinforcement learning. Input is a name for the learning rule, while output is the corresponding function object.
lookupLearningRule(type)
lookupLearningRule(type)
type |
A string denoting the learning rule. Allowed values are |
Function that implements the specific learning rule.
Deprecated. Please use [ReinforcementLearning::computePolicy()] instead.
policy(x)
policy(x)
x |
Variable which encodes the behavior of the agent. This can be
either a |
Returns the learned policy.
Deprecated. Please use [ReinforcementLearning::selectRandomAction()] instead.
randomActionSelection(Q, state, epsilon)
randomActionSelection(Q, state, epsilon)
Q |
State-action table of type |
state |
The current state. |
epsilon |
Exploration rate between 0 and 1 (not used). |
Character value defining the next action.
Performs model-free reinforcement learning. Requires input data in the form of sample sequences consisting of states, actions and rewards. The result of the learning process is a state-action table and an optimal policy that defines the best possible action in each state.
ReinforcementLearning(data, s = "s", a = "a", r = "r", s_new = "s_new", learningRule = "experienceReplay", iter = 1, control = list(alpha = 0.1, gamma = 0.1, epsilon = 0.1), verbose = F, model = NULL, ...)
ReinforcementLearning(data, s = "s", a = "a", r = "r", s_new = "s_new", learningRule = "experienceReplay", iter = 1, control = list(alpha = 0.1, gamma = 0.1, epsilon = 0.1), verbose = F, model = NULL, ...)
data |
A dataframe containing the input sequences for reinforcement learning.
Each row represents a state transition tuple |
s |
A string defining the column name of the current state in |
a |
A string defining the column name of the selected action for the current state in |
r |
A string defining the column name of the reward in the current state in |
s_new |
A string defining the column name of the next state in |
learningRule |
A string defining the selected reinforcement learning agent. The default value and
only option in the current package version is |
iter |
(optional) Iterations to be done. iter is an integer greater than 0. By default, |
control |
(optional) Control parameters defining the behavior of the agent.
Default: |
verbose |
If true, progress report is shown. Default: |
model |
(optional) Existing model of class |
... |
Additional parameters passed to function. |
An object of class rl
with the following components:
Q
Resulting state-action table.
Q_hash
Resulting state-action table in hash
format.
Actions
Set of actions.
States
Set of states.
Policy
Resulting policy defining the best possible action in each state.
RewardSequence
Rewards collected during each learning episode in iter
.
Reward
Total reward collected during the last learning iteration in iter
.
Sutton and Barto (1998). Reinforcement Learning: An Introduction, Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA.
# Sampling data (1000 grid sequences) data <- sampleGridSequence(1000) # Setting reinforcement learning parameters control <- list(alpha = 0.1, gamma = 0.1, epsilon = 0.1) # Performing reinforcement learning model <- ReinforcementLearning(data, s = "State", a = "Action", r = "Reward", s_new = "NextState", control = control) # Printing model print(model) # Plotting learning curve plot(model)
# Sampling data (1000 grid sequences) data <- sampleGridSequence(1000) # Setting reinforcement learning parameters control <- list(alpha = 0.1, gamma = 0.1, epsilon = 0.1) # Performing reinforcement learning model <- ReinforcementLearning(data, s = "State", a = "Action", r = "Reward", s_new = "NextState", control = control) # Printing model print(model) # Plotting learning curve plot(model)
Performs experience replay. Experience replay allows reinforcement learning agents to remember and reuse experiences from the past. The algorithm requires input data in the form of sample sequences consisting of states, actions and rewards. The result of the learning process is a state-action table Q that allows one to infer the best possible action in each state.
replayExperience(D, Q, control, ...)
replayExperience(D, Q, control, ...)
D |
A |
Q |
Existing state-action table of type |
control |
Control parameters defining the behavior of the agent. |
... |
Additional parameters passed to function. |
Returns an object of class hash
that contains the learned Q-table.
Lin (1992). "Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching", Machine Learning (8:3), pp. 293–321.
Watkins (1992). "Q-learning". Machine Learning (8:3), pp. 279–292.
Function generates sample experience in the form of state transition tuples.
sampleExperience(N, env, states, actions, actionSelection = "random", control = list(alpha = 0.1, gamma = 0.1, epsilon = 0.1), model = NULL, ...)
sampleExperience(N, env, states, actions, actionSelection = "random", control = list(alpha = 0.1, gamma = 0.1, epsilon = 0.1), model = NULL, ...)
N |
Number of samples. |
env |
An environment function. |
states |
A character vector defining the enviroment states. |
actions |
A character vector defining the available actions. |
actionSelection |
(optional) Defines the action selection mode of the reinforcement learning agent. Default: |
control |
(optional) Control parameters defining the behavior of the agent.
Default: |
model |
(optional) Existing model of class |
... |
Additional parameters passed to function. |
An dataframe
containing the experienced state transition tuples s,a,r,s_new
.
The individual columns are as follows:
State
The current state.
Action
The selected action for the current state.
Reward
The reward in the current state.
NextState
The next state.
# Define environment env <- gridworldEnvironment # Define states and actions states <- c("s1", "s2", "s3", "s4") actions <- c("up", "down", "left", "right") # Sample 1000 training examples data <- sampleExperience(N = 1000, env = env, states = states, actions = actions)
# Define environment env <- gridworldEnvironment # Define states and actions states <- c("s1", "s2", "s3", "s4") actions <- c("up", "down", "left", "right") # Sample 1000 training examples data <- sampleExperience(N = 1000, env = env, states = states, actions = actions)
Function uses an environment function to generate sample experience in the form of state transition tuples.
sampleGridSequence(N, actionSelection = "random", control = list(alpha = 0.1, gamma = 0.1, epsilon = 0.1), model = NULL, ...)
sampleGridSequence(N, actionSelection = "random", control = list(alpha = 0.1, gamma = 0.1, epsilon = 0.1), model = NULL, ...)
N |
Number of samples. |
actionSelection |
(optional) Defines the action selection mode of the reinforcement learning agent. Default: |
control |
(optional) Control parameters defining the behavior of the agent.
Default: |
model |
(optional) Existing model of class |
... |
Additional parameters passed to function. |
An dataframe
containing the experienced state transition tuples s,a,r,s_new
.
The individual columns are as follows:
State
The current state.
Action
The selected action for the current state.
Reward
The reward in the current state.
NextState
The next state.
-greedy action selectionImplements -greedy action selection. In this strategy, the agent explores the environment
by selecting an action at random with probability
. Alternatively, the agent exploits its
current knowledge by choosing the optimal action with probability
.
selectEpsilonGreedyAction(Q, state, epsilon)
selectEpsilonGreedyAction(Q, state, epsilon)
Q |
State-action table of type |
state |
The current state. |
epsilon |
Exploration rate between 0 and 1. |
Character value defining the next action.
Sutton and Barto (1998). "Reinforcement Learning: An Introduction", MIT Press, Cambridge, MA.
Performs random action selection. In this strategy, the agent always selects an action at random.
selectRandomAction(Q, state, epsilon)
selectRandomAction(Q, state, epsilon)
Q |
State-action table of type |
state |
The current state. |
epsilon |
Exploration rate between 0 and 1 (not used). |
Character value defining the next action.
Converts object of any class to a reinforcement learning state of type character.
state(x, ...)
state(x, ...)
x |
An object of any class. |
... |
Additional parameters passed to function. |
Character value defining the state representation of the given object.
A dataset containing 406,541 game states of Tic-Tac-Toe. The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row wins the game. All states are observed from the perspective of player X, who is also assumed to have played first.
tictactoe
tictactoe
A data frame with 406,541 rows and 4 variables:
The current game state, i.e. the state of the 3x3 grid.
The move of player X in the current game state.
The next observed state after action selection of players X and B.
Indicates terminal and non-terminal game states. Reward is +1 for 'win', 0 for 'draw', and -1 for 'loss'.