Lecture Notes in Computer Science

Estimating Internal Variables of a Decision Maker’s

bet	56/88
Sana	16.12.2017
Hajmi	12.42 Mb.
	#22381

1 ... 52 53 54 55 56 57 58 59 ... 88

Estimating Internal Variables of a Decision Maker’s

Brain: A Model-Based Approach for Neuroscience

Kazuyuki Samejima

and Kenji Doya

Brain Science Institute, Tamagawa University,

6-1-1 Tamagawa-gakuen, Machida, Tokyo 194-8610, Japan

samejima@lab.tamagawa.ac.jp

Initial Research Project, Okinawa Institute of Science and Technology

12-22 Suzaki, Uruma, Okinawa 904-2234, Japan

doya@oist.jp

Abstract. A major problem in search of neural substrates of learning and

decision making is that the process is highly stochastic and subject dependent,

making simple stimulus- or output-triggered averaging inadequate. This paper

presents a novel approach of characterizing neural recording or brain imaging

data in reference to the internal variables of learning models (such as

connection weights and parameters of learning) estimated from the history of

external variables by Bayesian inference framework. We specifically focus on

reinforcement leaning (RL) models of decision making and derive an estimation

method for the variables by particle filtering, a recent method of dynamic

Bayesian inference. We present the results of its application to decision making

experiment in monkeys and humans. The framework is applicable to wide range

of behavioral data analysis and diagnosis.

1 Introduction

The traditional approach in neuroscience to discover information processing

mechanisms is to correlate neuronal activities with external physical variables, such

as sensory stimuli or motor outputs. However, when we search for neural correlates of

higher-order brain functions, such as attention, memory and learning, a problem has

been that there are no external physical variables to correlate with. Recently, the

advances in computational neuroscience, there are a number of computational models

of such cognitive or learning processes and make quantitative prediction of the

according subject’s behavioral responses. Thus a possible new approach is to try to

find neural activities that correlate with the internal variables of such computational

models(Corrado and Doya, 2007).

A major issue in such model-based analysis of neural data is how to estimate the

hidden variables of the model. For example, in learning agents, hidden variables such

as connection weights change in time. In addition, the course of learning is regulated

by hidden meta-parameters such as learning rates. Another important issue is how to

judge the validity of a model or to select the best model among a number of

candidates.

Estimating Internal Variables of a Decision Maker’s Brain

597

The framework of Bayesian inference can provide coherent solutions to the issues

of estimating hidden variables, including meta-parameter from observable

experimental data and selecting the most plausible computational model out of

multiple candidates. In this paper, we first review the reinforcement learning model of

reward-based decision making (Sutton and Barto, 1998) and derive a Bayesian

estimation method for the hidden variables of a reinforcement learning model by

particle filtering (Samejima et al., 2004). We then review examples of application of

the method to monkey neural recording (Samejima et al., 2005) and human imaging

studies (Haruno et al., 2004; Tanaka et al., 2006; Behrens et al., 2007).

2 Reinforcement Leaning Model as an Animal or Human Decision

Maker

Reinforcement learning can be a model of animal or human decision based on reward

delivery. Notably, the response of monkey midbrain dopamine neurons are

successfully explained by the temporal difference (TD) error of reinforcement

learning models (Schultz et al., 1997). The goal of reinforcement learning is to

improve the policy, the rule of taking an action

t

a

at state

t

s

, so that the resulting

rewards

t

r

is maximized in the long run. The basic strategy of reinforcement learning

is to estimate cumulative future reward under the current policy as the value function

for each state and then it improves the policy based on the value function.

In a standard reinforcement learning algorithm called “Q-learning,” an agent learns

the action-value function

[

]

a

s

r

r

r

E

a

s

Q

t

t

t

t

t

...

)

(

(1)

which estimates the cumulative future reward when action

t

a

is taken at a

state

t

s

.The discount factor

0

<

is a meta-parameter that controls the time

scale of prediction. The policy of the learner is then given by comparing action-

values, e.g. according to Boltzman distribution

∑

∈

=

A

a

t

t

s

a

Q

s

a

Q

s

a

))

(

exp(

))

(

exp(

)

(

(2)

where the inverse temperature

>

β

is another meta-parameter that controls

randomness of action selection.

From an experience of state

t

s

, action

t

a

, reward

t

r

, and next state

+

t

, the

action-value function is updated by Q-learning algorithm(Sutton and Barto, 1998) as

t

t

t

t

t

t

t

t

A

a

t

t

a

s

Q

a

s

Q

a

s

Q

a

s

Q

r

αδ

⇐

−

∈

)

(

)

(

)

(

)

(

max

, (3)

598

K. Samejima and K. Doya

where

0

>

is the meta-parameter for learning rate. In the case of a reinforcement

learning agent, we have three meta-parameters.

Such a reinforcement learning model of behavior learning does not only predict

subject’s actions, but can also provide candidates of brain’s internal processes for

decision making, which may be captured in neural recording or brain imaging data.

However, a big problem is that the predictions are depended on the setting of meta-

parameters, such as learning rate

, action randomness

and discount factor

.

3 Probabilistic Dynamic Evolution of Internal Variable for

Q-Learning Agent

Let us consider a problem of estimating the course of action-values

}

0

,

);

(

{

T

t

A

a

S

s

a

s

Q

t

<

<

∈

, and meta-parameters

, and

of reinforcement

learner by only observing the sequence of states

t

s

, actions

t

a

and rewards

t

r

We use a Bayesian method of estimating a dynamical hidden variable

}

;

{

N

t

t

∈

x

from sequence of observable variable

}

;

{

N

t

t

∈

y

to solve this problem. We assume

that the unobservable signal (hidden variable) is modeled as a Markov process of

initial distribution

)

(

0

x

p

and the transition probability

)

(

1

t

t

p

x

x

. The

observations

}

;

{

N

t

t

∈

y

are assumed to be conditionally independent given the

process

}

;

{

N

t

t

∈

x

and of marginal distribution

)

(

t

t

p

x

y

. The problem to solve

in this setting is to estimate recursively in time the posterior distribution of hidden

variable

)

|

(

1

t

t

p

y

x

, where

}

{

0

T

T

x

x

x

and

}

{

1

T

T

y

y

y

. The

marginal distribution is given by recursive procedure of the following prediction and

updating,

Predicting:

)

(

)

(

)

(

−

∫

=

t

t

t

t

t

t

t

d

p

p

p

x

y

x

x

x

y

x

Updating:

1

:

)

(

)

(

)

(

)

(

)

(

−

∫

=

t

t

t

t

t

t

t

t

t

t

t

d

p

p

p

p

p

x

y

x

x

y

y

x

x

y

y

x

We use a numerical method to solve the Bayesian recursion procedure was proposed,

called particle filter (Doucet et al., 2001). In the Particle filter, the distributions of

sequence of hidden variables are represented by a set of random samples, also named

``particles’’. We use a Bootstrap filter, to calculate the recursion of the prediction and

the update the distribution of particles (Doucet et al. 2001).

Figure 1 shows the dynamical Bayesian network representation of a evolution of

internal variables in Q-learning agent. The hidden variable

t

x

consists of action-

values

)

,

( a

s

Q

for each state-action pair, learning rate

, inverse temperature

, and

Estimating Internal Variables of a Decision Maker’s Brain

599

discount factor

. The observable variable

t

y

consists of the states

t

s

, actions

t

a

and rewards

t

r

The observation probability

)

(

t

t

p

x

y

is given by the softmax action selection

(2). The transition probability

)

(

1

t

t

p

x

x

of the hidden variable is given by the

Q-learning rule (3) and the assumption on the meta-parameter dynamics. Here we

assume that meta-parameters (

α,β,

and

) are constant with small drifts. Because

α, β

and

should all be positive, we assume random-walk dynamics in logarithmic space.

)

,

0

(

)

log(

)

log(

1

x

x

x

t

t

N

x

x

(4)

where

x

is a meta-meta-parameter that defines random-walk variability of meta-

parameters

Fig. 1. A Bayesian network representation of a Q-learning agent: dynamics of observable and

unobservable variable is depended on decision, reward probability, state transition, and update

rule for value function. Circles: hidden variable. Double box: observable variable. Arrow:

probabilistic dependency.

4 Computational Model-Based Analysis of Brain Activity

4.1 Application to Monkey Choice Behavior and Striatal Neural Activity

Samejima et al (Samejima et al., 2005) used the internal parameter approach with

Q-leaning model for monkey’s free choice task of a two-armed-bandit problem

600

K. Samejima and K. Doya

(Figure 2). The task has only one state, two actions, and stochastic binary reward. The

reward probability for each action is fixed in a 30-150 trials of block, but randomly

chosen from five kinds of probability combination, block-by-block. The reward

probabilities P(a=L) for action a=L and P(a=R) for action a=R are selected randomly

from five settings; [P(a=L),P(a=R)]= {[0.5,0.5], [0.5,0.1], [0.1, 0.5], [0.5,0.9], [0.9,

0.5]}, at the beginning of each block.

Fig. 2. Two-armed bandit task for monkey’s behavioral choice. Upper: Time course of the

task. Monkey faced a panel in which three LEDs, right, left and up, were embedded and a small

LED was in middle. When the small LED was illuminated red, the monkeys grasped a handle

with their right hand and adjusted at the center position. If monkeys held the handle at the

center position for 1 s, the small LED was turned off as GO signal. Then, the monkeys turned

the handle to either right or left side, which was associated with a shift of yellow LED

illumination from up to the turned direction. After 0.5 sec., color of the LED changed from

yellow to either green or red. Green LED was followed by a large amount of reward water,

while red LED was followed by a small amount of water. Lower panel: state diagram of the

task. The circle indicates state. Arrow indicate possible action and state transition.

The Q-learning model of monkey behavior tries to learn reward expectation of

each action, action value, and maximize reward acquired in each block. Because the

task has only one state, the agent does not need to take into account next state’s value,

and thus, we set the discount factor as

=

γ

(Samejima et al., 2005) showed that the computed internal variable, the action

value for a particular movement direction (left/right), that is estimated by past history

of choice and outcome(reward), could predicts monkey’s future choice probability

(Figure 3). Action

value is an example of a variable that could not be immediately

Estimating Internal Variables of a Decision Maker’s Brain

601

Fig. 3. Time course of predicted choice probability and estimated action values. Upper

panel: an example history of action( red=right, blue=left), reward (dot=small, circle=large),

choice ratio (cyan line, Gaussian smoothed

=2.5) and predicted choice probability (black

line). Color of upper bar indicate reward probability combination. Lower panel: estimated

action values (blue=Q-value for left/ red=Q-value for right). (From Samejima et al. 2005).

Fig. 4. An example of the activity of a caudate neuron plotted on the space of estimated action

values Q

(t) and Q

(t).Left panel: 3-dimentional plot of neural activity on estimated Q

(t) and

(t). Right panel: 2-d projected plot for the discharge rates of the neuron on Q

axes(Left

side) and on Q

R

(right side). Grey lines derived from regression model. Circles and error bars

indicate average and standard deviation of neural discharge rates for each of 10 equally

populated action value bins. (from Samejima et al. 2005).

obvious from observable experimental parameters

but can be inferred using an action-

predictable computational model. Further more, the activity of most dorsal striatum

projection neurons correlate to the estimated action value for particular action

(figure 4).

4.2 Application to Human Imaging Data

Not only the internal variable estimation but also the meta-parameters (e.g. learning

rate, action stochasticity, and discount rate for future reward) are also estimated by

this methodology. Although the subjective value of learning meta-parameters might

be different for individual subject, the model-based approach could track subjective

internal value for different meta-parameters. Especially, in human imaging study, this

602

K. Samejima and K. Doya

admissibility is effective to extract common neuronal circuit activation in multiple

subject experiment.

One problem in the cognitive neuroscience by decision making task is lack of

controllability of internal variables. In conventional analysis of neuroscience and

brain-imaging study, experimenter tries to control a cognitive state or an assumed

internal parameter by a task demand or an experimental setting. Observed brain

activities are compared to the assumed parameter. However, the subjective internal

variables may depended personal behavioral tendency and may be different from the

parameter assumed by experimenter. The Baysian estimation method for internal

parameters including meta-parameter could reduce such a noise term of personal

difference by fitting the meta-parameters.

(Tanaka et al., 2006) showed that the variety of behavioral tendency for multiple

human subjects could be featured by the estimated meta-parameter of Q-learning

agent. Figure 5 shows distribution of three meta-parameters, learning rate

, action

stochasticity

and discount rate

. The subjects whose estimated

are lower tend to

be trapped on a local optimal polity and could not reach optimal choice sequence

(figure 5 left panel) . On the other hand, the subjects, whose learing rate

and

inverse temperature

are estimated lower than others

reported in post-experimental

questionnaire that they could not find any confident action selection in each state even

in later experimental session of the task (figure 5 right panel).

Regardless of the variety of subject’s behavioral tendency, the fMRI signal that

correlated to estimated action value for the selected action is observed in ventral

striatum in unpredictable condition, in which the state transitions are completely

random, whereas dorsal striatum is correlated to action value in predictable

environment, in which the state transitions are deterministic. This suggests that the

different cortico-basal ganglia circuits might be involved in different predictability of

the environment. (Tanaka et al., 2006)

Download 12.42 Mb.

Do'stlaringiz bilan baham:

1 ... 52 53 54 55 56 57 58 59 ... 88