AForge.MachineLearning
Boltzmann distribution exploration policy.
The class implements exploration policy base on Boltzmann distribution.
Acording to the policy, action a at state s is selected with the next probability:
exp( Q( s, a ) / t )
p( s, a ) = -----------------------------
SUM( exp( Q( s, b ) / t ) )
b
where Q(s, a) is action's a estimation (usefulness) at state s and
t is .
Termperature parameter of Boltzmann distribution, >0.
The property sets the balance between exploration and greedy actions.
If temperature is low, then the policy tends to be more greedy.
Initializes a new instance of the class.
Termperature parameter of Boltzmann distribution.
Choose an action.
Action estimates.
Returns selected action.
The method chooses an action depending on the provided estimates. The
estimates can be any sort of estimate, which values usefulness of the action
(expected summary reward, discounted reward, etc).
Epsilon greedy exploration policy.
The class implements epsilon greedy exploration policy. Acording to the policy,
the best action is chosen with probability 1-epsilon. Otherwise,
with probability epsilon, any other action, except the best one, is
chosen randomly.
According to the policy, the epsilon value is known also as exploration rate.
Epsilon value (exploration rate), [0, 1].
The value determines the amount of exploration driven by the policy.
If the value is high, then the policy drives more to exploration - choosing random
action, which excludes the best one. If the value is low, then the policy is more
greedy - choosing the beat so far action.
Initializes a new instance of the class.
Epsilon value (exploration rate).
Choose an action.
Action estimates.
Returns selected action.
The method chooses an action depending on the provided estimates. The
estimates can be any sort of estimate, which values usefulness of the action
(expected summary reward, discounted reward, etc).
Exploration policy interface.
The interface describes exploration policies, which are used in Reinforcement
Learning to explore state space.
Choose an action.
Action estimates.
Returns selected action.
The method chooses an action depending on the provided estimates. The
estimates can be any sort of estimate, which values usefulness of the action
(expected summary reward, discounted reward, etc).
Roulette wheel exploration policy.
The class implements roulette whell exploration policy. Acording to the policy,
action a at state s is selected with the next probability:
Q( s, a )
p( s, a ) = ------------------
SUM( Q( s, b ) )
b
where Q(s, a) is action's a estimation (usefulness) at state s.
The exploration policy may be applied only in cases, when action estimates (usefulness)
are represented with positive value greater then 0.
Initializes a new instance of the class.
Choose an action.
Action estimates.
Returns selected action.
The method chooses an action depending on the provided estimates. The
estimates can be any sort of estimate, which values usefulness of the action
(expected summary reward, discounted reward, etc).
Tabu search exploration policy.
The class implements simple tabu search exploration policy,
allowing to set certain actions as tabu for a specified amount of
iterations. The actual exploration and choosing from non-tabu actions
is done by base exploration policy.
Base exploration policy.
Base exploration policy is the policy, which is used
to choose from non-tabu actions.
Initializes a new instance of the class.
Total actions count.
Base exploration policy.
Choose an action.
Action estimates.
Returns selected action.
The method chooses an action depending on the provided estimates. The
estimates can be any sort of estimate, which values usefulness of the action
(expected summary reward, discounted reward, etc). The action is choosed from
non-tabu actions only.
Reset tabu list.
Clears tabu list making all actions allowed.
Set tabu action.
Action to set tabu for.
Tabu time in iterations.
QLearning learning algorithm.
The class provides implementation of Q-Learning algorithm, known as
off-policy Temporal Difference control.
Amount of possible states.
Amount of possible actions.
Exploration policy.
Policy, which is used to select actions.
Learning rate, [0, 1].
The value determines the amount of updates Q-function receives
during learning. The greater the value, the more updates the function receives.
The lower the value, the less updates it receives.
Discount factor, [0, 1].
Discount factor for the expected summary reward. The value serves as
multiplier for the expected reward. So if the value is set to 1,
then the expected summary reward is not discounted. If the value is getting
smaller, then smaller amount of the expected reward is used for actions'
estimates update.
Initializes a new instance of the class.
Amount of possible states.
Amount of possible actions.
Exploration policy.
Action estimates are randomized in the case of this constructor
is used.
Initializes a new instance of the class.
Amount of possible states.
Amount of possible actions.
Exploration policy.
Randomize action estimates or not.
The randomize parameter specifies if initial action estimates should be randomized
with small values or not. Randomization of action values may be useful, when greedy exploration
policies are used. In this case randomization ensures that actions of the same type are not chosen always.
Get next action from the specified state.
Current state to get an action for.
Returns the action for the state.
The method returns an action according to current
exploration policy.
Update Q-function's value for the previous state-action pair.
Previous state.
Action, which leads from previous to the next state.
Reward value, received by taking specified action from previous state.
Next state.
Sarsa learning algorithm.
The class provides implementation of Sarse algorithm, known as
on-policy Temporal Difference control.
Amount of possible states.
Amount of possible actions.
Exploration policy.
Policy, which is used to select actions.
Learning rate, [0, 1].
The value determines the amount of updates Q-function receives
during learning. The greater the value, the more updates the function receives.
The lower the value, the less updates it receives.
Discount factor, [0, 1].
Discount factor for the expected summary reward. The value serves as
multiplier for the expected reward. So if the value is set to 1,
then the expected summary reward is not discounted. If the value is getting
smaller, then smaller amount of the expected reward is used for actions'
estimates update.
Initializes a new instance of the class.
Amount of possible states.
Amount of possible actions.
Exploration policy.
Action estimates are randomized in the case of this constructor
is used.
Initializes a new instance of the class.
Amount of possible states.
Amount of possible actions.
Exploration policy.
Randomize action estimates or not.
The randomize parameter specifies if initial action estimates should be randomized
with small values or not. Randomization of action values may be useful, when greedy exploration
policies are used. In this case randomization ensures that actions of the same type are not chosen always.
Get next action from the specified state.
Current state to get an action for.
Returns the action for the state.
The method returns an action according to current
exploration policy.
Update Q-function's value for the previous state-action pair.
Curren state.
Action, which lead from previous to the next state.
Reward value, received by taking specified action from previous state.
Next state.
Next action.
Updates Q-function's value for the previous state-action pair in
the case if the next state is non terminal.
Update Q-function's value for the previous state-action pair.
Curren state.
Action, which lead from previous to the next state.
Reward value, received by taking specified action from previous state.
Updates Q-function's value for the previous state-action pair in
the case if the next state is terminal.