Lecture Notes in Computer Science
Estimating Internal Variables of a Decision Maker’s
Download 12.42 Mb. Pdf ko'rish
|
- Bu sahifa navigatsiya:
- 2 Reinforcement Leaning Model as an Animal or Human Decision Maker
- 3 Probabilistic Dynamic Evolution of Internal Variable for Q-Learning Agent
- 4 Computational Model-Based Analysis of Brain Activity 4.1 Application to Monkey Choice Behavior and Striatal Neural Activity
- Fig. 2. Two-armed bandit task for monkey’s behavioral choice.
- Fig. 3. Time course of predicted choice probability and estimated action values.
- 4.2 Application to Human Imaging Data
Estimating Internal Variables of a Decision Maker’s Brain: A Model-Based Approach for Neuroscience Kazuyuki Samejima 1 and Kenji Doya 2
1 Brain Science Institute, Tamagawa University, 6-1-1 Tamagawa-gakuen, Machida, Tokyo 194-8610, Japan samejima@lab.tamagawa.ac.jp 2 Initial Research Project, Okinawa Institute of Science and Technology 12-22 Suzaki, Uruma, Okinawa 904-2234, Japan doya@oist.jp Abstract. A major problem in search of neural substrates of learning and decision making is that the process is highly stochastic and subject dependent, making simple stimulus- or output-triggered averaging inadequate. This paper presents a novel approach of characterizing neural recording or brain imaging data in reference to the internal variables of learning models (such as connection weights and parameters of learning) estimated from the history of external variables by Bayesian inference framework. We specifically focus on reinforcement leaning (RL) models of decision making and derive an estimation method for the variables by particle filtering, a recent method of dynamic Bayesian inference. We present the results of its application to decision making experiment in monkeys and humans. The framework is applicable to wide range of behavioral data analysis and diagnosis. 1 Introduction The traditional approach in neuroscience to discover information processing mechanisms is to correlate neuronal activities with external physical variables, such as sensory stimuli or motor outputs. However, when we search for neural correlates of higher-order brain functions, such as attention, memory and learning, a problem has been that there are no external physical variables to correlate with. Recently, the advances in computational neuroscience, there are a number of computational models of such cognitive or learning processes and make quantitative prediction of the according subject’s behavioral responses. Thus a possible new approach is to try to find neural activities that correlate with the internal variables of such computational models(Corrado and Doya, 2007). A major issue in such model-based analysis of neural data is how to estimate the hidden variables of the model. For example, in learning agents, hidden variables such as connection weights change in time. In addition, the course of learning is regulated by hidden meta-parameters such as learning rates. Another important issue is how to judge the validity of a model or to select the best model among a number of candidates.
Estimating Internal Variables of a Decision Maker’s Brain 597 The framework of Bayesian inference can provide coherent solutions to the issues of estimating hidden variables, including meta-parameter from observable experimental data and selecting the most plausible computational model out of multiple candidates. In this paper, we first review the reinforcement learning model of reward-based decision making (Sutton and Barto, 1998) and derive a Bayesian estimation method for the hidden variables of a reinforcement learning model by particle filtering (Samejima et al., 2004). We then review examples of application of the method to monkey neural recording (Samejima et al., 2005) and human imaging studies (Haruno et al., 2004; Tanaka et al., 2006; Behrens et al., 2007). 2 Reinforcement Leaning Model as an Animal or Human Decision Maker Reinforcement learning can be a model of animal or human decision based on reward delivery. Notably, the response of monkey midbrain dopamine neurons are successfully explained by the temporal difference (TD) error of reinforcement learning models (Schultz et al., 1997). The goal of reinforcement learning is to improve the policy, the rule of taking an action t a at state t s , so that the resulting rewards
is maximized in the long run. The basic strategy of reinforcement learning is to estimate cumulative future reward under the current policy as the value function for each state and then it improves the policy based on the value function. In a standard reinforcement learning algorithm called “Q-learning,” an agent learns the action-value function [ ]
s r r r E a s Q t t t t t , | ... ) , ( 2 2 1 + + + = + + γ γ (1) which estimates the cumulative future reward when action t a is taken at a state
.The discount factor 1 0
< γ is a meta-parameter that controls the time scale of prediction. The policy of the learner is then given by comparing action- values, e.g. according to Boltzman distribution
∑ ∈ = A a t t s a Q s a Q s a ' )) , ' ( exp( )) , ( exp(
) | ( β β π (2) where the inverse temperature 0 >
is another meta-parameter that controls randomness of action selection. From an experience of state
, action t a , reward
t r , and next state 1 +
s , the
action-value function is updated by Q-learning algorithm(Sutton and Barto, 1998) as t t t t t t t t A a t t a s Q a s Q a s Q a s Q r αδ γ δ + ⇐ − + = + ∈ ) , ( ) , ( ) , ( ) , ( max 1 , (3) 598 K. Samejima and K. Doya where 0
α is the meta-parameter for learning rate. In the case of a reinforcement learning agent, we have three meta-parameters. Such a reinforcement learning model of behavior learning does not only predict subject’s actions, but can also provide candidates of brain’s internal processes for decision making, which may be captured in neural recording or brain imaging data. However, a big problem is that the predictions are depended on the setting of meta- parameters, such as learning rate α , action randomness β and discount factor γ .
Q-Learning Agent Let us consider a problem of estimating the course of action-values } 0
, ); , ( {
t A a S s a s Q t < < ∈ ∈ , and meta-parameters α , β , and
γ of reinforcement learner by only observing the sequence of states
, actions t a and rewards t r .
We use a Bayesian method of estimating a dynamical hidden variable } ; { N t t ∈
from sequence of observable variable } ; { N t t ∈
to solve this problem. We assume that the unobservable signal (hidden variable) is modeled as a Markov process of initial distribution ) ( 0 x p
and the transition probability ) | ( 1 t t p x x + . The observations } ; { N t t ∈
are assumed to be conditionally independent given the process
} ; { N t t ∈
and of marginal distribution ) | ( t t p x y . The problem to solve in this setting is to estimate recursively in time the posterior distribution of hidden variable ) |
: 1 : 1 t t p y x , where
} , , { 0 : 0 T T x x x = and } , , { 1 : 1 T T y y y = . The marginal distribution is given by recursive procedure of the following prediction and updating, Predicting: 1 1 : 1 1 1 1 : 1 ) | ( ) | ( ) | ( − − − − − ∫ =
t t t t t t d p p p x y x x x y x
1 1
1 1 : 1 : 1 ) | ( ) | ( ) | ( ) | ( ) | ( − − − ∫ =
t t t t t t t t t t d p p p p p x y x x y y x x y y x
We use a numerical method to solve the Bayesian recursion procedure was proposed, called particle filter (Doucet et al., 2001). In the Particle filter, the distributions of sequence of hidden variables are represented by a set of random samples, also named ``particles’’. We use a Bootstrap filter, to calculate the recursion of the prediction and the update the distribution of particles (Doucet et al. 2001). Figure 1 shows the dynamical Bayesian network representation of a evolution of internal variables in Q-learning agent. The hidden variable
consists of action- values )
( a s Q for each state-action pair, learning rate α , inverse temperature β , and
Estimating Internal Variables of a Decision Maker’s Brain 599 discount factor γ . The observable variable t y consists of the states t s , actions t a , and rewards t r .
The observation probability ) | ( t t p x y is given by the softmax action selection (2). The transition probability ) | ( 1
t p x x + of the hidden variable is given by the Q-learning rule (3) and the assumption on the meta-parameter dynamics. Here we assume that meta-parameters ( α,β, and
γ ) are constant with small drifts. Because α, β
γ should all be positive, we assume random-walk dynamics in logarithmic space. ) ,
( ~ ) log( ) log( 1 x x x t t N x x σ ε ε + = + (4) where
σ is a meta-meta-parameter that defines random-walk variability of meta- parameters .
Fig. 1. A Bayesian network representation of a Q-learning agent: dynamics of observable and unobservable variable is depended on decision, reward probability, state transition, and update rule for value function. Circles: hidden variable. Double box: observable variable. Arrow: probabilistic dependency. 4 Computational Model-Based Analysis of Brain Activity 4.1 Application to Monkey Choice Behavior and Striatal Neural Activity Samejima et al (Samejima et al., 2005) used the internal parameter approach with Q-leaning model for monkey’s free choice task of a two-armed-bandit problem
600 K. Samejima and K. Doya (Figure 2). The task has only one state, two actions, and stochastic binary reward. The reward probability for each action is fixed in a 30-150 trials of block, but randomly chosen from five kinds of probability combination, block-by-block. The reward probabilities P(a=L) for action a=L and P(a=R) for action a=R are selected randomly from five settings; [P(a=L),P(a=R)]= {[0.5,0.5], [0.5,0.1], [0.1, 0.5], [0.5,0.9], [0.9, 0.5]}, at the beginning of each block.
task. Monkey faced a panel in which three LEDs, right, left and up, were embedded and a small LED was in middle. When the small LED was illuminated red, the monkeys grasped a handle with their right hand and adjusted at the center position. If monkeys held the handle at the center position for 1 s, the small LED was turned off as GO signal. Then, the monkeys turned the handle to either right or left side, which was associated with a shift of yellow LED illumination from up to the turned direction. After 0.5 sec., color of the LED changed from yellow to either green or red. Green LED was followed by a large amount of reward water, while red LED was followed by a small amount of water. Lower panel: state diagram of the task. The circle indicates state. Arrow indicate possible action and state transition. The Q-learning model of monkey behavior tries to learn reward expectation of each action, action value, and maximize reward acquired in each block. Because the task has only one state, the agent does not need to take into account next state’s value, and thus, we set the discount factor as 0 =
. (Samejima et al., 2005) showed that the computed internal variable, the action value for a particular movement direction (left/right), that is estimated by past history of choice and outcome(reward), could predicts monkey’s future choice probability (Figure 3). Action
value is an example of a variable that could not be immediately Estimating Internal Variables of a Decision Maker’s Brain 601
panel: an example history of action( red=right, blue=left), reward (dot=small, circle=large), choice ratio (cyan line, Gaussian smoothed σ =2.5) and predicted choice probability (black line). Color of upper bar indicate reward probability combination. Lower panel: estimated action values (blue=Q-value for left/ red=Q-value for right). (From Samejima et al. 2005).
values Q L (t) and Q R (t).Left panel: 3-dimentional plot of neural activity on estimated Q L (t) and
Q R (t). Right panel: 2-d projected plot for the discharge rates of the neuron on Q L axes(Left side) and on Q
(right side). Grey lines derived from regression model. Circles and error bars indicate average and standard deviation of neural discharge rates for each of 10 equally populated action value bins. (from Samejima et al. 2005). obvious from observable experimental parameters
but can be inferred using an action- predictable computational model. Further more, the activity of most dorsal striatum projection neurons correlate to the estimated action value for particular action (figure 4).
Not only the internal variable estimation but also the meta-parameters (e.g. learning rate, action stochasticity, and discount rate for future reward) are also estimated by this methodology. Although the subjective value of learning meta-parameters might be different for individual subject, the model-based approach could track subjective internal value for different meta-parameters. Especially, in human imaging study, this
602 K. Samejima and K. Doya admissibility is effective to extract common neuronal circuit activation in multiple subject experiment. One problem in the cognitive neuroscience by decision making task is lack of controllability of internal variables. In conventional analysis of neuroscience and brain-imaging study, experimenter tries to control a cognitive state or an assumed internal parameter by a task demand or an experimental setting. Observed brain activities are compared to the assumed parameter. However, the subjective internal variables may depended personal behavioral tendency and may be different from the parameter assumed by experimenter. The Baysian estimation method for internal parameters including meta-parameter could reduce such a noise term of personal difference by fitting the meta-parameters. (Tanaka et al., 2006) showed that the variety of behavioral tendency for multiple human subjects could be featured by the estimated meta-parameter of Q-learning agent. Figure 5 shows distribution of three meta-parameters, learning rate α , action stochasticity β and discount rate γ . The subjects whose estimated γ are lower tend to be trapped on a local optimal polity and could not reach optimal choice sequence (figure 5 left panel) . On the other hand, the subjects, whose learing rate α and
inverse temperature β
are estimated lower than others , reported in post-experimental questionnaire that they could not find any confident action selection in each state even in later experimental session of the task (figure 5 right panel). Regardless of the variety of subject’s behavioral tendency, the fMRI signal that correlated to estimated action value for the selected action is observed in ventral striatum in unpredictable condition, in which the state transitions are completely random, whereas dorsal striatum is correlated to action value in predictable environment, in which the state transitions are deterministic. This suggests that the different cortico-basal ganglia circuits might be involved in different predictability of the environment. (Tanaka et al., 2006)
Download 12.42 Mb. Do'stlaringiz bilan baham: |
ma'muriyatiga murojaat qiling