E. Paxon Frady: What is value -- accumulated reward or evidence? I

Friston, K. Adams, R. Montague, R. (2012) What is value -- accumulated reward or evidence? Frontiers in Neurorobotics 6(11)

This is a reformulation of reinforcement learning, not as a maximization of integrated reward, but as a process of minimizing the error of a generative model. Thus optimal behavior is purely cast in terms of inference--policies are replaced by inferences.

basic principles of self-organization: minimize surprise (maximize evidence) associated with sensory states, minimize uncertainty about inferred causes of input. This makes value become log-evidence or negative surprise.

Hierarchical perspective on optimization is implicit in this account of optimal behavior. Hierarchical Bayesian inference: optimization at one level is constrained by empirical priors from a higher level. Yuille and Kersten: hierarchical aspect to inference emerges naturally from a separation of temporal scales.

Markovian formulations of value and optimal control.
Begin with Markov decision processes. Linking "free energy minimization" (active inference) and optimial decision theory (reinforcement learning(ish), optimizing policies). Create future states that entails a model of agency to link the two into one. From the text:

"The key distinction between optimal control and active infer-
ence is that in optimal control, action optimizes the expected cost
associated with the hidden states a system or agent visits. In con-
trast, active inference requires action to optimize the marginal
likelihood (Bayesian model evidence) of observed states, under
a generative model. This introduces a distinction between cost-
based optimal control and Bayes-optimal control that eschews
cost. The two approaches are easily reconciled by ensuring the
generative model embodies prior beliefs about state transitions
that minimize expected cost. Our purpose is therefore not to
propose an alternative implementation of optimal control but
accommodate optimal control within the larger framework of
active inference."

The set up for markov decision processes is to maximize cumulative reward. The solutions can be divided into reinforcement learning schemes that compute the value function explicitly and direct policy searches that find the optimal policy directly. The best answers depend on how complex the value funciton is, and how complex the state-space is (i.e. is it possible to visit all states?).

Cannot always know the current state - extension of MDP is partially observed MDP (POMDP). Cannot perform reinforcement learning directly on POMDP, but POMDP can be converted to MDP using beliefs about the current state. Replace reward with its expected value based on current belief state.

Initial approaches: replace reward with desired observations. They will show that any optimal control problem can be formulated as a Bayesian inference problem, within the active inference framework. Action does not minimize cumulative cost, but maximizes the marginal likelihood of observations, under a generative model that entails an optimal policy.

Active inference
"In active inference, action elicits observations that are the most plausible under beliefs about (future) states. This is in contrast to conventional formulations, in which actions are chosen to elicit (valuable) states."

Deﬁnition: Active inference rests on the tuple
(X, A, ϑ, P, Q, R, S) comprising:

A ﬁnite set of hidden states X
Real valued hidden parameters ϑ ∈ Rd
A ﬁnite set of sensory states S
A ﬁnite set of actions A
Real valued internal states μ ∈ Rd that parameterize a conditional density
A sampling probability R(s'|s, a) = Pr({st+1 = s |st = s, at = a}) that observation s ∈ S at time t + 1 follows action a ∈ A, given observation s ∈ S at time t
A generative probability P(s, x, theta|m)
A conditional probablity Q(x, theta|mu)

(Wait, is mu part of the tuple? Did they just miss it?).

Three distinctions between this and MDP: 1. Agent is equipped with a probabilistic mapping between actions and direct sensory consequences (this is the sampling probability). 2. Hidden states include future and past states, or the agent represents a sequence or trajectory over states. 3. There are no reward or cost functions. Cost functions will be replaced by priors over hidden states and transitions, such that costly states are surprising and are avoided by action.

The goal is to minimize the "free energy": internal states minimize the free energy of currently observed states, while action selects the next observation that, on average, has the smallest free energy. Can express free energy as Gibbs energy (expected under the conditional distribution, Q) minus entropy of the conditional distribution. When free energy is minimized the conditional distribution approximates (Q) the posterior distribution (P). Under some simplifying assumptions this corresponds to predictive coding.

By setting prior beliefs one can generalize the active inference model to any optimal control problem.

E. Paxon Frady

Pages

Monday, January 14, 2013

What is value -- accumulated reward or evidence? I

No comments:

Post a Comment