^{1}

^{2}

^{1}

^{3}

The authors have declared that no competing interests exist.

Decision-making in the real world presents the challenge of requiring flexible yet prompt behavior, a balance that has been characterized in terms of a trade-off between a slower, prospective goal-directed model-based (MB) strategy and a fast, retrospective habitual model-free (MF) strategy. Theory predicts that flexibility to changes in both reward values and transition contingencies can determine the relative influence of the two systems in reinforcement learning, but few studies have manipulated the latter. Therefore, we developed a novel two-level contingency change task in which transition contingencies between states change every few trials; MB and MF control predict different responses following these contingency changes, allowing their relative influence to be inferred. Additionally, we manipulated the rate of contingency changes in order to determine whether contingency change volatility would play a role in shifting subjects between a MB and MF strategy. We found that human subjects employed a hybrid MB/MF strategy on the task, corroborating the parallel contribution of MB and MF systems in reinforcement learning. Further, subjects did not remain at one level of MB/MF behaviour but rather displayed a shift towards more MB behavior over the first two blocks that was not attributable to the rate of contingency changes but rather to the extent of training. We demonstrate that flexibility to contingency changes can distinguish MB and MF strategies, with human subjects utilizing a hybrid strategy that shifts towards more MB behavior over blocks, consequently corresponding to a higher payoff.

To make good decisions, we must learn to associate actions with their true outcomes. Flexibility to changes in action/outcome relationships, therefore, is essential for optimal decision-making. For example, actions can lead to outcomes that change in value – one day, your favorite food is poorly made and thus less pleasant. Alternatively, changes can occur in terms of contingencies–ordering a dish of one kind and instead receiving another. How we respond to such changes is indicative of our decision-making strategy; habitual learners will continue to choose their favorite food even if the quality has gone down, whereas goal-directed learners will soon learn it is better to choose another dish. A popular paradigm probes the effect of value changes on decision-making, but the effect of contingency changes is still unexplored. Therefore, we developed a novel task to study the latter. We find that humans used a mixed habitual/goal-directed strategy in which they became more goal-directed over the course of the task, and also earned more rewards with increasing goal-directed behavior. This shows that flexibility to contingency changes is adaptive for learning from rewards, and indicates that flexibility to contingency changes can reveal which decision-making strategy is used.

For optimal decision-making, animals must learn to associate the choices they make with the outcomes that arise from them. Classical learning theories suggest that this problem is addressed by habitual or goal-directed strategies for reinforcement learning [

Recent studies have emphasized that MB and MF systems work in parallel rather than in isolation [

Theory predicts that flexibility to transition contingency changes can – like flexibility to reward structure–determine the relative influence of MB and MF strategies [

On top of a hybrid MB/MF strategy, subjects may not remain at one level of MB/MF control but instead shift their relative weight in accordance with environmental factors. In general, animals show habit formation with time, a robust effect reported since early reward devaluation studies [

We found that human subjects indeed showed a hybrid strategy in reacting to contingency changes in our task, with an increased influence of MB control over the first two blocks. However, relative MB/MF control did not significantly differ across rates of contingency changes; thus, the increase in MB control may be a more global effect of “anti-habitization” over time.

Subjects (N = 16) performed a two-level contingency change task that consisted of 600 trials (

(A) Each trial started from either the first-level state (S0), with 50% probability, or one of the two second-level states (S1 or S2), each with 25% probability. While two choices were available at S0, only a single forced choice was available at the second-level states. The transition structure from the second-level states to terminal states repeatedly flipped after a random number of trials (every 3–14), in an unsignalled fashion. One of the two terminal states (S3 or S4) was associated with a high reward outcome and the other resulted in a low reward outcome. (B) Timeline of the task for one example trial.

If a contingency change occurred, subjects always experienced the new transition structure regardless of whether they started at the first or second level, as contingency could only change between second-level and terminal states. Therefore, provided that an action was possible at the next trial (i.e. that the next trial started at the first level) the MB system would plan using the updated causal structure and thus would take the action that led under the new transition contingencies to the high reward terminal state. However, if a contingency change trial started from the second level, the MF system would not choose the optimal action on the next trial, as neither the received reward nor the new contingency would update the cached values of first-level actions, simply because no first-level action was experienced on those trials. As a result, the relative contribution of MB and MF systems can be measured by the degree of behavioral flexibility on first-level trials following contingency change trials starting from the second level.

To examine the effect of environmental volatility on the contribution of the two systems, the frequency of contingency changes was varied–from 3–6 trials for 200 trials, to 7–10 trials for another 200 trials, and then 11–14 trials for the final 200 trials. The order of fast and medium contingency-change blocks was counterbalanced across two subject groups (

Simulated choices on the task were implemented according to MB and MF reinforcement learning algorithms (see

Across these conditions, MB and MF systems showed different stay probability patterns. The MF system, having no experience of the action that led to the new contingency, was more likely to stay on the action leading to the high reward state, and shift on the action leading to the low reward state, under “fixed” than “changed” conditions (

(A) Stay probabilities from human subjects (N = 16) showed significant effects of both model-based (

While stay probabilities ruled out a purely MB or purely MF strategy, this measure could not quantify the degree to which subjects used the hybrid strategy; therefore, we used a hierarchical Bayesian method to fit candidate models of behavior to the subjects’ data, to determine which model best explained subjects’ choices and to obtain parameter estimates for the MB/MF weighting used by the subjects. The models tested included a pure MB model, a pure MF model, a hybrid model with one constant weight

Model-fitting results supported the existence of a hybrid MB/MF strategy in our task. Candidate models were compared using two criteria – integrated Bayesian Information Criterion (iBIC) which controls for number of parameters [

The median fitted

To confirm that the increase in model-based weight was not due to differences in the rate of contingency changes, we further analysed the fitted weights from the three-frequency hybrid model, which had a different

As subjects became more model-based, high reward choices and consequently reward rate also increased. Choice probabilities for the high reward action differed over blocks,

Additionally, the hybrid model was simulated using a range of MB weights (0, 0.2, 0.4, 0.6, 0.8 and 1) using the one-weight hybrid model for simplicity. Other free parameters were set to values fitted to the participants’ data. There was a significant effect of MB weight on reward rate, (

We developed a novel two-level contingency change task in which flexibility to frequently-changing transition contingencies between states could determine the extent to which subjects were using a model-based or a model-free strategy. Subjects showed a hybrid strategy when reacting to contingency changes, corroborating recent evidence of the parallel contribution of MB and MF systems in reward-guided decision-making. Importantly, this finding confirmed that changes to transition contingencies can elicit a balance of MB and MF behavior akin to changes to reward structure. Model-fitting analyses indicated that a hybrid model with three MB weights best explained subjects’ choices, with relative MB control increasing over blocks. The rate of contingency changes did not significantly shift the MB/MF balance; rather, MB control increased over the first two blocks of trials. This increase in MB control was concurrent with an increased proportion of high reward choices and consequently increased reward rate; individually, each subject’s

In all, these results illustrated that not only do subjects use a mixed MB/MF strategy, but within this hybrid strategy, the trade-off shifts towards “anti-habitization” across the first two blocks. This agrees with a previous study [

These findings of an increase in MB control over blocks, however, goes against another study [

Manipulations of the rate of contingency changes did not seem to affect MB/MF control. While it has been shown that environmental volatility can influence MB/MF levels in the context of common or rare updates of reward structure [

In conclusion, in a two-level contingency change task, subjects showed a hybrid MB/MF strategy, emphasizing their parallel contribution in reacting to changes in transition contingencies. The inclusion of multiple, frequent changes allowed us to perform model-fitting; by doing so, we found an increase in MB control over the first two blocks, a result not detectable in model-agnostic analyses alone. Our results build on the literature reporting the use of a hybrid MB/MF strategy in reacting to changes in information about reward structure, here demonstrating a mixture of strategies in reacting to multiple, frequent contingency changes that has yet been unexplored.

In addition to MB and MF systems, a third reinforcement learning algorithm known as the successor representation (SR) [

Sixteen subjects (nine males, mean age 24 years) took part. The study was approved by the University College London Research Ethics Committee (Project ID 3450/002). All subjects provided written informed consent.

Subjects performed 600 trials of three blocks (200 each) which differed in frequency of contingency changes: fast (contingency change every 3–6 trials), medium (every 7–10 trials) or slow (every 11–14 trials). Each subject was assigned to one of two groups (

To ensure subjects understood the task structure, they were first trained with practice trials (

At the first level, subjects had a two-alternative forced choice between two actions (pressing ‘S’ for the action available on the left side of the screen, ‘L’ for the right) with the presentation of stimuli randomized for the left/right side of the screen. To ensure that subjects recognized second-level states, they had to press ‘D’ if they encountered one of these states, and ‘K’ for the other. Both responses had a time limit of 750ms, following which the trial would end with no reward. Missed trials were not repeated.

Payoff at the high-reward terminal state varied according to a Gaussian random walk (

Both model-free and model-based algorithms seek to estimate the values of state-action pairs in order to choose the actions which can maximize expected future rewards. The state space was modelled as having a first-level state _{0} with two actions _{1} and _{2}, two possible second-level states _{1} and _{2}, and two possible terminal states _{3} and _{4}. There was only one action available on second-level and terminal states, as the subject did not have any choices at these levels.

The model-free algorithm updates values of state-action pairs using temporal difference Q-learning [_{t} is used to compute a reward prediction error _{t} which updates action values for that state _{MF}(_{t},_{t}). At the first level _{t} is set to be 0 as there is no reward at this level.

The reward prediction error updates existing action values according to a learning rate _{MF}, and is modified by the eligibility trace λ, which governs how much credit past actions are given for outcomes. In a

The model-based algorithm learns both transition probabilities _{T} and reward probabilities _{T}. The transition probabilities track the transition contingencies _{T} between states

The reward probabilities _{T} use the reward _{t} to update its subjective reward

These learned transition and reward functions are then used to update the action values for the model-based system, _{MB}.

Other parameters from the simulated models included learning rates for model-based and model-free systems, _{MB} and _{MF}, and a stay bias which temporarily increased the action value for the previously-selected action regardless of outcome, to quantify a perseveration bias. These additional parameters improved fit even when controlling for model complexity (

For both systems, values for the non-selected action were updated as well, assuming that subjects knew that the reward for the selected action and reward for the non-selected action were negatively related, according to proposals of fictive reward [_{t},_{t}) of the visited states. The inclusion of fictive reward updates resulted in a better fit to the subjects’ choices (

The hybrid model weighted MB and MF action values according to a parameter

Action selection was then determined for all models according to a “softmax” rule which computes action probabilities as proportional to the exponential of the action values.

The inverse temperature

To best replicate the subjects’ data of 600 trials for 16 subjects, each simulation was run for 16 initializations of 600 trials each. All reported simulations used fitted parameters from the three-block hybrid model for the learning rates _{MF} and _{MB}, inverse temperature

Subjects’ data were fit to the models using mixed effects hierarchical model fitting. Expectation-maximisation was used which iteratively generates group-level distributions over individual subject parameter estimates, choosing the parameters that maximizes the likelihood of the data given those estimates. In each iteration, parameters were estimated by minimizing the negative log-likelihood of parameter estimates using

The group-level distributions over all free parameters were assumed to be Gaussian, with no constraint. To then impose sensible constraints (0 ≤

To ensure the efficacy of _{block 1}_{block 2}_{block 3}

The integrated Bayesian information criterion (iBIC) [

Permutation tests were run to evaluate the probability that _{block 2} _{block 1}), and _{block 3} _{block 2}) were then evaluated for each permutation. The occurrences of the random permutations which had a smaller _{block 2} _{block 1}), and _{block 3} _{block 2}) than the true permutation were then tallied.

Likewise, to evaluate the effect of frequency of contingency changes, permutation tests were run to compare

To rule out the possibility that the effective learning rate of the MB and MF systems – rather than their fundamental differences – produced the behavior, we conducted several further analyses. A hybrid model composed of two MF systems with small (0.25) and large (0.75) learning rates could not replicate the stay probability patterns observed from subjects and the MB+MF hybrid system. Even more extreme, a hybrid model composed of two MF systems, one with a small (0.25) learning rate and no eligibility trace (λ = 0), and another with a large (0.75) learning rate and with eligibility trace (λ = 1) could not replicate those patterns. Furthermore, a hybrid model composed of two MB systems with small (0.25) and large (0.75) learning rates could not replicate the patterns either.

When fitted to the behavioral data, all three hybrids of MF(

Together, these results rule out the possibility of just different effective learning rates of the two systems having produced the observed behavior.

We further fitted a hybrid MF+MB with a free learning rate parameter, _{Transition} for updating transitions by the MB system (rather than assuming the parameter was 1). This model, in terms of model comparison, fit the data better than pure MB and MF models, and even slightly better than the hybrid MB(_{Transition} = 1)+MF model. In all previous analyses, the non-diagonal elements of the covariance matrix were set to zero (i.e., assuming no correlation between free parameters). However, for the hybrid MB+MF with a free _{Transition}, we observed strong correlations between parameters when the covariates were allowed to change freely. The _{Transition} parameter was highly negatively correlated with _{block 1}_{block 2}_{Transition} = 1)+MF model throughout the paper.

Choices were simulated with six different model-based weights (0, 0.2, 0.4, 0.6, 0.8, 1, with n = 16 iterations each) and the mean reward rate was computed. There was a significant difference in reward rate across different wMB values,

(TIF)

Best-fitting parameter estimates over the subjects from model-fitting.

(DOCX)

Integrated Bayesian Information Criterion (iBIC) and negative log-likelihood of all candidate models from model-fitting. The models tested were: pure model-free (“MF”), pure model-based (“MB”), hybrid MB/MF (“hybrid”), hybrid MB/MF with different weights fitted for each of the three 200-trial blocks (“three-block hybrid”), and a hybrid model with different weights fitted for each frequency of contingency changes (“three-frequency hybrid”). The winning model was the three-block hybrid, highlighted in gray, according to iBIC [

(DOCX)

Integrated Bayesian Information Criterion (iBIC) and negative log-likelihood of the winning three-block hybrid model with different weights fitted for each of the three 200-trial blocks and the same model without stay bias, with

(DOCX)

Thanks to Peter Dayan for helpful discussions and comments, and Thomas Akam for comments on the manuscript.