^{1}

^{2}

^{¤}

^{2}

^{3}

^{4}

^{2}

The authors have declared that no competing interests exist.

Current address: CortAIx,Thales, Montreal, Canada

Choice history effects describe how future choices depend on the history of past choices. In experimental tasks this is typically framed as a bias because it often diminishes the experienced reward rates. However, in natural habitats, choices made in the past constrain choices that can be made in the future. For foraging animals, the probability of earning a reward in a given patch depends on the degree to which the animals have exploited the patch in the past. One problem with many experimental tasks that show choice history effects is that such tasks artificially decouple choice history from its consequences on reward availability over time. To circumvent this, we use a variable interval (VI) reward schedule that reinstates a more natural contingency between past choices and future reward availability. By examining the behavior of optimal agents in the VI task we discover that choice history effects observed in animals serve to maximize reward harvesting efficiency. We further distil the function of choice history effects by manipulating first- and second-order statistics of the environment. We find that choice history effects primarily reflect the growth rate of the reward probability of the unchosen option, whereas reward history effects primarily reflect environmental volatility. Based on observed choice history effects in animals, we develop a reinforcement learning model that explicitly incorporates choice history over multiple time scales into the decision process, and we assess its predictive adequacy in accounting for the associated behavior. We show that this new variant, known as the double trace model, has a higher performance in predicting choice data, and shows near optimal reward harvesting efficiency in simulated environments. These results suggests that choice history effects may be adaptive for natural contingencies between consumption and reward availability. This concept lends credence to a normative account of choice history effects that extends beyond its description as a bias.

Animals foraging for food in natural habitats compete to obtain better quality food patches. To achieve this goal, animals can rely on memory and choose the same patches that have provided higher quality of food in the past. However, in natural habitats simply identifying better food patches may not be sufficient to successfully compete with their conspecifics, as food resources can grow over time. Therefore, it makes sense to visit from time to time those patches that were associated with lower food quality in the past. This demands optimal foraging animals to keep in memory not only which food patches provided the best food quality, but also which food patches they visited recently. To see if animals track their history of visits and use it to maximize the food harvesting efficiency, we subjected them to experimental conditions that mimicked natural foraging behavior. In our behavioral tasks, we replaced food foraging behavior with a two choice task that provided rewards to mice and humans. By developing a new computational model and subjecting animals to various behavioral manipulations, we demonstrate that keeping a memory of past visits helps the animals to optimize the efficiency with which they can harvest rewards.

Numerous perceptual and decision-making tasks have shown that choices systematically depend on the history of past choices, often referred to as choice history bias [

The best strategy for maximizing the reward rate over all of the available options thus depends on not only the set reward probability of each option, but also the history of how recently each option has been chosen in the past. To maximize their reward rate, optimal agents should choose the options with the highest set reward probability and switch when the reward probability of the other unchosen options overtakes that of the initial option. This choice strategy requires optimal agents to guide their decisions based on both past choices and set reward probabilities. However, whether animals use a choice strategy consistent with optimal models of behavior is an open question.

Evaluating whether animals follow optimal foraging principles faces several challenges. This is because, in experimental settings that approximate natural foraging, many variables such as energy, time, and opportunity costs are difficult to measure, and competing models can generate qualitatively similar predictions [

Here, we used a VI task to study foraging behavior from a normative perspective. We acquired behavioral data from both humans and mice and compared the choice history effects that we observed to those generated by optimal agents. By manipulating first- and second-order statistics of the reward outcomes, we characterized the contingency between choice and reward availability that may drive choice history effects. This result prompted us to derive a local decision rule by incorporating the choice history effects into a reinforcement learning (RL) algorithm. We call this the double trace (DT) model because it models choice history with both fast and slow decaying exponential functions. We found that choice history effects observed in species as diverse as mice and humans and including published data from monkeys [

All the experiments on mice were approved by the Danish Animal Experiments Inspectorate under the Ministry of Justice (Permit 2017—15—0201—01357) and were conducted according to institutional and national guidelines. Human behavioral task was not considered as health science research project and did not require approval by the Science Ethics Committee of Central Jutland Region, Denmark.

Water-deprived male mice (C57Bl/6J strain) were trained to initiate the trials by poking their noses into a central port equipped with sensors (infrared light emitter and phototransistor) that detected the exact time of entry and exit of animals. After poking their noses into the central port, the mice were free to poke their noses into the left or right sides of the port. Rewards were delivered inside each port via a metal tube using solenoid valves (

The VI task was implemented in the form of a computer game, and performed by 19 individuals (aged 18–60, 12 identified as females, 6 as males and 1 as other). The computer game was written on the Unity platform (

While optimal agents do not necessarily offer biologically realistic models of choice, they are still informative both conceptually, and in terms of providing upper bounds of performance given their specific constraints and assumptions. We constructed the following three optimal agents:

The choices of an Oracle agent follow a simple rule: Choose the option _{i∈A} _{i,t} = 1 when an agent chooses option _{i,t} = 0 otherwise.

The IB agent was constructed to infer the set reward probabilities _{set}(_{i} is the number of trials since the option

The IB agent followed a softmax selection rule

The multiplication of the prior probability of estimated set reward probabilities _{n}, drawn from a uniform probability distribution with boundaries [-0.025,0.025], was added to the new estimates of the set reward probabilities _{n} is added to the new estimates of the set reward probabilities, it is possible that some of these new estimates

A new iteration starts with trial

Different from standard RL models, the LK (name is after first and last authour of this paper) model performs two approximations concurrently that are interdependent. One approximation attempts to estimate the set reward probabilities, while the other attempts to approximate the “baiting state” of the environment. By combining these two, LK model approximates the actual reward probabilities.

The reward probability observed by an agent at trial _{set}(_{i} is the number of trials since the option i was chosen. For cases where _{set}(_{set}(_{i} > 0, the reward probability can be described as

While this model is sufficient to solve a VI task, it is possible to modify it to make it adaptable to a wider number of tasks. In particular, the LK model can be adaptable to tasks where there is no baiting structure, such as in the armed bandit tasks. It can even incorporate a continuous variable that defines the influence of baiting on the upcoming reward probability. To control for these variants, a dynamic factor _{ψ}(_{ψ}(_{ψ} ∈ [0, 1] is a fixed learning rate. A condition for this algorithm is that the rewards _{ψ}(

The influence of the past rewards _{Right}(_{Left}(_{Right}(_{Right}(_{Left}(

The choice vector was defined as ^{1} and ^{2} from lasso and ridge regression [_{log−odds}), where _{log−odds}) and λ is the tuning parameter of this penalty term. The tuning parameter λ was selected after observing a minimum on the deviance in a cross-validation process with different λ values. The penalty term interpolates between the ^{1} and ^{2} norms as follows:

In simple words, the regression coefficients that had the minimum mean deviance (or, equivalently, the maximum mean log-likelihood) as a function of the tuning parameter λ in a five-fold cross-validation process were selected as the estimated coefficients

Extended regression model includes effects of unsigned rewards _{Right}(_{Left}(_{Right}(_{Left}(_{Right}(_{Left}(_{Right}(_{Left}(

The choice history _{i}(1, 2, …, _{t−1} to compute the action values and determine the probability of choosing an action in the current trial _{i,t} is determined by the particular RL model (see more details in the description of RL models). Note that the softmax selection rule is not limited to two options, but the subsequent description of the optimization steps are limited to two option setups.

This probability _{t} = _{i,t} at time _{RL}(

To find the optimal parameters _{RL,1} from a uniform distribution, where the boundaries of the parameter space were _{F} ∈ [0, 1], _{M} ∈ [0, 1] and _{S} ∈ [0, 1]. Next, we took 1% of the combinations with the maximum mean log-likelihood and used the mean _{RL,1,1%}) of this subset to draw a new set of 1,000 combinations as follows:

We repeated this process several times to narrow the original parameter space until the highest log-likelihood (a negative value or zero) was less than 99.99% of its value for two iterations in a row. The optimal parameters for the prediction of each model

The optimised parameters for each model and the estimated coefficients were trained and tested via five-fold cross-validation on the behavioural data in order to obtain the average of the minimum negative log-likelihood and the average AUC. These latter metrics were used for model selection and the goodness of prediction for each model, respectively. To compute the AUC, the probability _{i,t} as the label. We used MATLAB function

Regret _{1} uniformly 10,000 times and by simulating one session of _{bl} ∈ {50, 100, 500} with two pair of the set reward probabilities 0.10 vs. 0.40 and 0.4 vs. 0.1 for left and right sides respectively. The second condition of the VI task have three different pairs of set probabilities 0.20 vs. 0.80, 0.30 vs.0.70 and 0.40 vs. 0.60, and one block size _{bl} = 100. The boundaries of the parameter space were _{F} ∈ [0, 1] and _{S} ∈ [0, 1]. Since the number of combinations was insufficient for a thorough search of the parameter space, we selected the 5% with the lowest regret (

We thereby realised a local search algorithm in our multidimensional parameter space, extending three standard deviations around the top 5% of the previous parameter combinations and selecting a new random set of parameters _{2}, 5000 times:

We repeated the selection of best parameters by for _{2} the same way described for _{1}. It is important to mention that at every iteration, a different random number generator was selected to prevent running the generator under the same probabilities during each trial and to give the model the opportunity to explore different probabilistic scenarios. However, this did not guarantee that the combinations had low regret due to this random selection. To overcome this potential problem, we took the 10% of _{2} with the lowest regret, having

The probability that a foraging animal will choose the left side in a discrete version of a VI task with two choices is _{t} = _{t} = _{l,t} of the reward _{R}, _{C}, plus a constant bias _{i,t} = 1 and the reward at time _{t} = 1.

The results of our experiments and in previous studies [_{R,m} and _{C,m} decay, as a function of the m trial history. We propose that these decays for the reward and choice history to be modelled according to the following equation:
_{F} and _{S} characterise the decay rate for choices according to:

The function _{t} of the logistic regression was separated into reward and choice terms to ensure simplicity in the next analysis.

In the next section, we show how these equations were used to derive the recursive DT update rule:

Like the logistic regression model _{i,t} is the reward expectation value of a given action (or option) _{F} and _{S}, and which can generate a vast diversity of curves. The softmax function then takes the following form:

Next, we show the equivalence of the

For the reward expectation, it was already shown that the logistic regression model is equivalent to the F-Q model [

There are enough trials

Under this assumption, we define

It follows that,

In the next section, we demonstrate how to reconstruct the proposed choice history decay observed in the experimental data in a recursive manner:

We introduce a recursive function that we call a choice trace. Its equivalence in logistic regression is shown in the next three equations. We follow similar derivations as shown by Katahira in his work [

We show the equivalence of the choice trace with the logistic regression in the following derivation:

Under the same assumption as that of the reward expectation, when the number of trials is high enough

When

Finally,

Therefore, the difference between the two choice traces with the same learning rate can be represented as follows:

Assuming that

We then show how, using the fast and slow choice traces, we can recover

Thus,

The indirect actor model updates the reward expectation (or state-action value) Q only for the chosen option.

The reward expectation of the direct actor model is updated based on the probability of the chosen action and the reward outcome. This rule also affects the reward expectation value of the unchosen actions.

Here c is a parameter that we fit to the behavioral data and can be seen as average reward rate in R-learning models [_{t}), it can be considered as a policy update model [

The F-Q model is a slight modification of the indirect actor model, where the reward expectation value of the unchosen actions are forgotten and vanish to zero.

According to the F-Q W/C model, the probability of taking the next action will depend not only on the reward expectation but also on choice trace F updated according to the following:

While the F-Q model decreases the value of the unchosen actions, the proposed model increases their value up to _{up}, which acts as a positive counter for unchosen actions that adds to the action value.

The initial value of _{i,1} is set from a uniform random distribution constrained between [0, 1].

All notations and symbols are provided in

The following text is the instruction (verbatim) given to the subjects who participated in human-VI task:

The analyses of the behavioural and computational models were performed using MATLAB (MathWorks, Inc.).

To understand the adaptive function of choice history effects in animals, we first tested if they would emerge in optimal models of behavior. To do this, we constructed several optimal agents that maximize reward rates in the VI task. This task is a variant of the two-alternative forced-choice (2AFC) task with a discrete version of VI schedule of reinforcements [_{i,t} = 1, when an agent chooses option _{i,t} = 0. Here, _{set}(

(A) Schematic of the VI task and the equation for updating the reward probabilities of each option i (A vs. B or left vs. right). Once the agent initiates a trial at t, it can choose between options A or B. Each option has its own dynamic probability of reward delivery P(R). (B) Change in reward probability in the VI task is determined by the set reward probability and the update rule following the baiting equation depicted in (A). (C) In the left-hand panel, a scheme of how the Oracle Agent solves the VI task. The Oracle agent “knows” the exact reward probabilities for each option in each trial. It chooses the option with the highest probability of reward in a greedy manner. After making a choice, it observes a reward or no reward. The second panel shows the influence of past choices on current choice when the set reward probabilities of each option are sampled from a uniform distribution ∈ [0,0.50]. The third panel shows the influence of past choices on the current choice with the set reward probabilities of 0.1 vs. 0.4 for the left and right options, respectively. On the rightmost panel, we added to previous set reward probabilities 0.25 vs. 0.25 reward probability trials with block transitions as described in the main text. The x-axis depicts trials back in history and the y-axis logistic regression coefficients. (D) In the left-hand panel, we show how a IB agent solves the VI task. First, at trial t, the IB agent samples the estimated set reward probabilities for each option i from prior distributions. Then, it uses the baiting equation to estimate the reward probabilities for each option. Using a softmax action selection rule, it makes a decision. After observing a reward or the absence of a reward, it updates the priors for the next trial. The rightmost panel shows the influence of past choices on the IB agent’s next decision, when the VI task had set reward probability pairs of 0.10 vs. 0.40, 0.4 vs. 0.1, and 0.25 vs. 0.25. The VI task included possible changes in the set reward probability pairs established at each block transition with a duration of 50 to 150 trials (randomly sampled from a uniform distribution) (E) The left panel shows how a reinforcement learning (RL) agent using a newly derived LK model solves the task. According to the LK model, the action values are converted into choices using the baiting update equation and the softmax selection rule. After observing a reward or the absence of a reward, the model updates the action values and the estimated baiting rate for the next trial. The right panel shows the influence of past choices on the LK model’s next decision, when the the VI task had set reward probability pairs of 0.10 vs. 0.40, 0.4 vs. 0.1 and 0.25 vs. 0.25. We used the same block structure for the LK model as one used for the IB agent.

We studied the performance of several optimal agents, increasing in their biological plausibility: 1. The Oracle agent which has full knowledge of the objective reward probabilities and task structure (

In optimal agents we focused our analysis only on the choice history effects as reward probabilities are “known” or quickly inferred by these agents and do not contribute significantly to current choices. To quantify the effect of past choices on current choices, we used the agent’s past choices to predict its current choices. To do this, we analyzed the choice dynamics using a regularized regression analysis with a cross-validated elastic net (see linear regression in the

We tested the performance of the Oracle agent using three different conditions. These conditions varied from more general to specific task conditions that we also used for mice and humans. 1. The set reward probabilities were drawn from a uniform distribution _{set}(

For the IB agent, we used n = 100 sessions and for the LK model we used n = 5000 sessions. The influence of past rewards and past choices for all optimal agents was analyzed using regression with LASSO regularization. The coefficients with the lowest deviation were selected in a five-fold cross-validation process (see

For the Oracle agent, the regression analysis revealed a strong alternation effect for immediate past choices, and perseverance effect for choices further back in the past (

Very similar choice history effects were observed in the IB agent (

Finally, we tested the RL agent (LK model), which learns the action values of the two options by estimating the baiting rate (LK model in

Next, we sought to elucidate choice history effects in mice and humans performing the VI task. In contrast to the optimal agents, set reward probabilities as well as the baiting structure of the task were hidden from the mice and humans. We reasoned that mice and humans could estimate reward probabilities based on their reward history. Therefore, we anticipated that our test subjects may need to track of the history of rewards and choices. The subjects in the VI task had to choose either left- or right-side option after initiating the trial by pressing the space bar (

(A) Snapshot of the VI task in a computer game played by human participants. After opening a virtual fence by pressing the space bar on the keyboard, subjects had to wait from 0 to 5 seconds before making the decision to press the left or right key. The rewarded trials were indicated by a collection of virtual apples. (B) The scheme of the VI task adapted for water-deprived mice. The rodents had to poke the center port to start a trial and wait at the center port from 0.2 to 0.4s before choosing the right or left port. In the case of rewarded trials, the mice received 2 μL of water. (C) The influence of past rewards and choices on current choice for rodents (219 sessions with 7 mice) and human subjects (25 sessions with 19 subjects) was analyzed by logistic regression with LASSO regularization. The coefficients with the lowest deviation were selected in a five-time cross-validation process. Videos are available for this figure as follows: _{F} = 0.71, _{S} = 0.24, _{F} = 0.56, _{S} = 0.40,

Logistic regression with elastic-net regularization (Linear regression for reward and choice history effects in

One alternative interpretation of choice history effects is that they reflect some reward contingencies. We decomposed reward history effects into multiple components. Specifically, we sought to reveal the contribution of past rewards unlinked to the previous choice (unsigned rewards), the past reward contribution linked to past choice, and the contribution of past no rewards linked to past choices, to current choices (see extended linear regression model in

Based on our previous results (_{i,t} denotes reward expectation, _{i,t} is the fast choice trace, _{i,t} is the slow choice trace, and _{F} or _{S}. _{F} ≥ 0.5 and _{S} < 0.5). The reward expectation and choice traces are updated with the learning rates of _{F}, and _{S} as follows:

Here, _{i,t} = 1 when an agent chooses option _{i,t} = 0 otherwise. The outcome _{t} is equal to 1 at a time

To evaluate a variety of different RL models against the DT model, we performed a model comparison and evaluated the comparative predictive accuracies of the various models. Here, we briefly explain how the DT model compared to the other known RL models when computing choice probabilities. These models differ in how they update the value of unchosen options and whether choice history is incorporated into the value update process. The indirect actor model updates the values _{it} for each option

The model selection criteria were determined based on a test involving the negative log-likelihood. Briefly, the behavioral data (choice and reward histories) were concatenated across all sessions and all subjects and were split into five parts. The model parameters were computed on 4/5 of the data with a 5-fold cross-validation method. The different models were tested on the remaining 1/5 of the data by computing the negative log-likelihood of the choices (see _{F} and _{S}) compared to temperature (

Cross-validation tests for model comparison may favor more complex models [

We showed that reward history and choice history exert separable effects on current choices. First, we found that first regression coefficients for rewards (mean(SD) = 1.34(0.42)) and choices (mean(SD) = -2.4(1.29)) were significantly different from each other (p = 0.0006, Mann-Whitney U test, n = 7 mice) for the mice and human data (rewards: mean(SD) = 1.39(1.37); choices: mean(SD) = -1.17(1.93), p = 0.00001, Mann-Whitney U test, n = 19 human subjects). From this finding, it was not clear whether these effects reflected the same or different computational processes. To interrogate what drives choice and reward history effects, we manipulated the reward outcome statistics over three different dimensions, and performed new behavioral manipulations on the new set of mice (n = 4). Specifically, we manipulated one of three behavioral contingencies: experienced total-reward rate, difference in set reward probabilities, and volatility, while holding the other two contingencies constant. Here we define volatility as the length of the block (number of trials) that maintained the same set reward probabilities. Thus, volatility was inversely proportional to the block length. In addition to reward and choice history effects, we also included reaction time (RT) as a metric of an animal’s performance. We defined RT as the time between the animal leaving the center port to the time the animal poked one of the side ports. It has been reported that RTs can be driven by reward expectations [

The effects of volatility on behavioral performance were analyzed by pooling the set of behavioral sessions in which block size was manipulated (4 animals, 44 sessions) and dividing the sets into three groups (terciles) of block lengths of 14–54, 55–96, or 97–136 trials per block in a session. Our results showed that the volatility of the reward outcomes affected the reward history effects. The strongest effects were observed on the first regression coefficient for past rewards (1^{st}tercile mean(SD) = 1.86(1.59), 2^{nd}tercile mean(SD) = 1.24(0.89), 3^{rd}tercile mean(SD) = 0.34(0.38), Pearson correlation coefficient (r) = -0.35, p = 0.0032, 95% CI [-0.54–0.12]), while the regression coefficients for past immediate choices (1^{st}tercile mean(SD) = -3.66(1.9), 2^{nd}tercile mean(SD) = -2.49(1.88), 3^{rd}tercile mean(SD) = -2.95(1.47), r = 0.24, p = 0.043, 95% CI [0.01 0.45]) and RTs (1^{st}tercile mean(SD) = 0.19(0.04), 2^{nd}tercile mean(SD) = 0.2(0.05) 3^{rd}tercile mean(SD) = 0.24(0.04), r = 0.29, p = 0.016, 95% CI [0.06 0.49]) were less affected (

Sessions sorted into three groups, or terciles (14–54, 55–96 and 97–136 trials for each group), based on the number of trials per block condition (4 mice in 44 sessions). The set reward probabilities were 0.1:0.4 and 0.4: 0.1 for left vs. right options respectively. (A) We show reward (left panel) and choice history effects (right panel) for trials sorted based on block length. (B) shows the average number of rewards per trial obtained by the mice (left panel) and reaction time (right panel). The MATLAB function boxplot depicts median (circle), 25th to 75th percentile of data as edges, all extreme data points with whiskers, and outliers with red crosses. Sessions sorted into three groups based on the difference (Δ) in set reward probabilities (4 mice in 48 sessions). The set reward probabilities pair per session were 0.4:0.6, 0.3:0.7, and 0.2:0.8 for left vs. right and right vs. left options. (C) shows reward (left panel) and choice history effects (right panel) for each group. (D) Shows the average number of rewards per trial obtained by the mice (left panel) and reaction time (right panel) as in B. Sessions sorted in three groups based on the average number of rewards collected per trial (4 mice in 48 sessions). The set reward probabilities pair per session were 0.4 vs. 0.6, 0.3 vs. 0.7, and 0.2 vs. 0.8. (E) shows reward (left panel) and choice history effects (right panel). (F) shows the average number of rewards per trial obtained by the mice (left panel) and reaction time (right panel), as in B and D. The first regression coefficients were used for statistical comparisons across different conditions. The Pearson correlation coefficient of the block lengths, difference in set reward probabilities, and rewards per trial with the regression coefficients of reward and choices one trial back is reported as R within the plots, with their corresponding significance labelled as

The effects of the difference in the set reward probabilities (0.8:0.2,0.7:0.3, 0.6:0.4) on mice performance were analyzed by running animals on a different set of sessions (mice, n = 4; sessions, n = 48). Once more, we partitioned the behavioral sessions into terciles based on the difference in the set reward probabilities. The choice history effects that were analyzed for the first regression coefficients of the mice showed significant changes in response to the difference in the set reward probabilities (1^{st}tercile mean(SD) = -3.66(1.48), 2^{nd}tercile mean(SD) = -3.64(1.78), 3^{rd}tercile mean(SD) = -1.4(2.56), r = 0.42, p = 0.0032, 95% CI [0.15 0.62]), while reward history effects only analyzed for the first regression coefficients (r = -0.15, p = 0.32, 95% CI [-0.41 0.15]) and RTs (r = -0.064 and p = 0.66, 95% CI[-0.34 0.22]) did not show significant changes (

Next, we partitioned the same set of sessions (animals, n = 4 animals; sessions, n = 48) used in the previous analysis into terciles based on the reward rate experienced by the animal per trial. We found that RTs changed as a function of the experienced reward rates, showing a significant and strong positive correlation (1^{st}tercile mean(SD) = 0.14(0.01), 2^{nd}tercile mean(SD) = 0.17(0.03), 3^{rd}tercile mean(SD) = 0.2(0.03), r = 0.51, p = 0.0002, 95% CI [0.26 0.69]) to the overall experienced reward rates (

Finally, we note that the choice history effects analyzed here (

The DT model is built based on the animal’s choice strategy (_{t} collected by an agent in a session of

The optimization of the DT model was slightly different than that of the other RL models. The animal-derived parameters for

We examined the reward and choice trace of the optimized DT model to see if it retained the characteristic shape of reward and choice history effects observed in animals (

First, we manipulated the difference in the set reward probabilities as we did for the mice (

(A) Regret for different RL models as a function of the difference between the set reward probabilities. The set reward probabilities of the task were 0.45:0.05, 0.4:0.1, and 0.35:0.15 for left vs. right and right vs left options. The block length was fixed to 100 trials. (B) We show the reward and choice traces of the optimized DT model for these pairs of set reward probabilities. (C) Regret for different RL virtual agents as a function of the volatility (block lengths of 50, 100, and 500). The set reward probabilities of the task were 0.1:0.4 and 0.4:0.1 for left and right options respectively. (D) The reward and choice traces of the optimized DT model under these three volatility conditions. For (A) and (C), the regret is defined as the number of rewards away from the optimal collection by the ideal observer in a virtual session with 1,000 trials as defined by

Second, we tested the optimized DT model parameters under different volatilities. Again, the RL model with choice trace achieved a near optimal performance (

In this study, we addressed the contingencies between behavior and reward availability, and how these may configure choice history effects. We examined this question from the narrow perspective of biological and synthetic agents that attempt to maximize their reward rates. We showed that similar choice history effects emerge both in synthetic and biological agents, suggesting its adaptive function. This is further supported by our findings and published data in mice, rats, monkeys and humans [

Here, we note a number of limitations that accompany our studies and offer counterarguments that may mitigate these concerns. Although arguably more ecologically valid than experiments in which choice history is decoupled from reward availability, our task could still be limited in its ecological validity. This issue could be remedied by employing more naturalistic foraging fields and by affording greater degrees of motoric freedom when foraging. A number of studies in rodents and humans have explored more ecologically realistic scenarios with innovative task designs [

Animal behavior must be tuned to environmental statistics to maintain its adaptive function [

If choice history effects emerge as an evolutionary adaptation to natural habitats, then the full suppression of this behavior, even in well-trained animals, in a perceptual or value-based decision-making tasks might be difficult. Indeed, in many decision-making tasks that do not impose contingencies of previous choices on current choices, choice history effects persist [

In terms of choice history effects, the possibility that short-term alternation is driven by no rewards, and that long-term is driven by previous choices, raises questions regarding their neural representation. Are these slow and fast processes encoded by distinct neural populations? Or do they emerge from the same neural circuits at different time scales by virtue of mixed selectivity [

What benefits could choice history effects and its implementation in the form of an RL model provide to foraging agents that make decisions between engaging with the current option or searching for better alternatives? The two major classes of models can formalize these type of decision-making processes. Models based on the MVT state that optimal agents should leave the food patch when the reward rate of that patch drops below the average reward rate of the habitat [

Here we show that mice and humans can express similar choice history effects when searching for rewards in habitats where past choices effect future reward availability. By simulating agents in equivalent environments, we found that the same choice history effects yielded gains in the efficiency by which rewards are harvested. The double trace model that we propose, which incorporates choice history explicitly in its architecture, was advantageous over competing models, both in terms of its performance in predicting choice behavior, and in terms of the efficiency with which it can harvest rewards. All elements combined, we provided an initial explanatory and algorithmic account of choice history effects beyond their description as a bias, connecting this concept to a broader class of optimality models within behavioral ecology.

(A) Fractional difference between consecutive choices and alternation on the last alternation trial in a two-alternative task under the baiting schedule (VI task) where the set reward probabilities of each option are taken randomly between 0 and 1 from a uniform distribution. Fractional difference was computed by subtracting the consecutive choices from alternation and dividing that difference on the total number of choices on each trial from the entire simulation set (n = 200000). (B) Average regret (missed rewards collected with respect to the Oracle agent reward collection) per trial in 100 sessions of an agent making random choices, the IB agent and the LK model. Sessions had

(EPS)

(A) Reward and choice history effects analyzed by LASSO regression analysis for 19 human subjects. Each line indicates an individual subject. (B) The reward and choice history effects shown for individual mice (n = 7).

(EPS)

(A) LASSO regularized regression analysis (30 trials back) was carried out on past rewards unlinked to the rewarded option (denoted as unsigned rewards here), past choices, past right rewards, past left rewards, past right no rewards and past left no rewards in that order. Six panels show concatenated sessions from 7 mice as in

(EPS)

(A) The DT model with one set of randomly selected parameter values (shown with blue horizontal bars) was run 200 times in a session of 1000 trials. Each session consisted of a block of trials with the length of 100 trials. Each block had a pair of set reward probabilities 0.1:0.4 and 0.4:0.1 for left and right options respectively. Next to the true parameter values (blue bar) boxplots (median—red bar, edges 25-75 percentile and whiskers show all data points except outliers) and all 200 parameter values (black dots) recovered by the DT model. (B) The same parameter recovery was done for 1000 randomly selected parameter values, except that this time we performed recovery only once. Scatter plot shows on x axis the initial parameter values and y-axis is the recovered value. All insets in lower 3 panels show zoomed version of the parameter values restricted to the full range of true parameter values for x and y axis. The initial parameter values were restricted to the following range: _{F} _{S}

(EPS)

Numbers on x axis indicate different models. 1—Indirect model, 2—F-Q, 3—F-Q up, 4—F-Q W/C, 5—Double trace, 6—LK model, 7—Generalized linear model (A). Humans, (B) mice.

(EPS)

(A) Sessions of the Oracle agent sorted in three groups by the block length. We used blocks of 50, 100 and 500 trials for 150 sessions with 1000 trials for each session. The set reward probabilities pair per session were 0.1:0.4 and 0.4:0.1 for left vs. right options. We show their reward and choice history effects. (B) Sessions of the Oracle agent sorted in three groups by the difference (Δ) in set reward probabilities (150 sessions with 1000 trials each). The set reward probabilities pair per session were 0.4:0.6, 0.3:0.7 and 0.2:0.8 for left vs. right options and right vs. left options. We show the reward and choice history effects. The correlations of the first regression coefficients as a function of block length or difference in set reward probability for reward and choices are reported as R (Pearson correlation coefficient) with their correspondent significance labelled as

(EPS)

Sessions of four mice sorted in three groups (terciles) by the mean in set reward probabilities (62 sessions), or in other words, the mean set probability of rewards in a session. The set reward probabilities of the leaner side was kept to 0.1, while the richer side was set to 0.4, 0.5, 0.6 or 0.8 in a session. (A) Shows their correspondent reward history effects, choice history effects. (B) The average number of rewards collected per trial (left panel) and reaction times (right panel). The correlations of the block lengths with the coefficients of reward and choices one trial back are reported as Pearson correlation coefficient (R) on the plots with their correspondent significance labelled as

(EPS)

(MP4)

(MP4)

(TIF)

We thank Larry F. Abbott and Ashok L. Kumar for their suggestions on the DT model and the manuscript. We thank Sophie Seidenbecher and Madeny Belkhiri for their assistance with editing the manuscript. We thank Søren Rud Keiding for his advice and discussions, Eske Nielsen for programming the human game in Unity platform and Maris Sala and Daniel Kozlovski for assisting with the data collection.

Dear Dr Kvitsiani,

Thank you very much for submitting your manuscript "Choice history effects in mice and humans improve reward harvesting efficiency" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

As you will see, although encouraging, all reviewers point out several issues that should be fully addressed. I would like to attract your attention on the parameter recoverability issue raised by two reviewers. I also would like to add that a model recovery analysis is warranted in your case.

All reviewers pointed out several instances where the paper crucially lacks clarity or non-standard methods are used. We encourage your to better clarify these issues and justify the methods.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Stefano Palminteri

Associate Editor

PLOS Computational Biology

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

***********************

As you will see, although encouraging, all reviewers point out several issues that should be fully addressed. I would like to attract your attention on the parameter recoverability issue raised by two reviewers. I also would like to add that a model recovery analysis is warranted in your case.

All reviewers pointed out several instances where the paper crucially lacks clarity or non-standard methods are used. We encourage your to better clarify these issues and justify the methods.

Reviewer's Responses to Questions

Reviewer #1: In this study, the authors investigated the choice history effects in humans and mice during choice behavior reinforced with the variable interval (VI) schedule. They claimed that the observed choice history effects have an advantage in foraging behavior in a natural foraging environment. Given that the choice history effects have attracted growing attention, studies on their adaptive functions are of much interest. But I think there are many issues to be fixed in this paper: There are several places where the methods are not clear, and the validity of the models and analysis is somewhat doubtful. Perhaps this is due to the author's use of their customized method without being fully aware of standard modeling methodologies. I think that the authors should make sufficient efforts to fix these issues. Below are the main issues, especially regarding the methods. But if the authors are willing to revise the manuscript, I recommend that the authors review and revise the entire manuscript.

1. Task description

The behavioral task descriptions are insufficient, so the readers cannot obtain information to reproduce the experiments. For example, the paper contains no information about the sex of the subjects (of both mice and humans). How many trials did Mice experienced per session? How long is each session? How did the authors create the program of the computer game for human experiments? Was this done online or in a lab? Why are the trial numbers different across subjects? Also, there is no information on the experimental equipment for mice.

2. Models

line 571- The Oracle agent

- I could not follow why Eq (7) become looks like this. Where did “(1 – P(R|I,t))(1 - \\delta_{I,t})” come from? What exactly does "during block transition" (line 575) mean? If this is an expression that depends on the task design, it should be explained in the task description instead of this section.

line 577- The Bayesian agent

- I'm not sure if the update rule of this model is a calculation that can be called "Bayesian." At first glance, it seems to perform a sampling-based approximation to Bayesian inference, like the sequential Monte Carlo method or the particle filter. However, looking closely, that is not the case. Eq. (12) certainly follows Bayes' formula, but using it for each samples of the parameter does not necessarily lead to that the samples of θ approximate a Bayesian posterior distribution. Also, I cannot figure out how adding a noise factor (line 596) will affect the distribution. If there is research using such a method, the reference is required. If it is the author's original, it is necessary to explain why this update rule should be Bayesian.

line 604- Optimal RL agent (LK model)

- It is unclear to me in what sense this LK model can be called an optimal RL. What does the LK model stand for? I could not figure out why the Qset updated by Eq (16) equals P_set. I could not follow subsequent calculations either.

line 636 “The learning rate should then be affected by the prediction error of the baiting rate.”

- It is unclear why this can be said and what kind of logic will be used in the subsequent update rule, Eq (20).

line 643- Linear regression for computing reward and choice history effects

- It is difficult to understand whether Eq (21) is an equation (How do the left and right sides correspond? Why the parameter \\lambda is included in the left side while not in the right side?). It is also necessary to explain why L1 regularization and L2 regularization are used instead of the standard logistic regression.

Eq (22) is an equation that can be applied to any number of choices, A, but it should be noted that Eq (23) holds when there are only two choices.

lines 670-680

It seems that maximum likelihood estimation is performed with an original method, but why did not the authors use the established and packaged nonlinear optimization method (which can be executed by the function fmincon in Matlab)? Since it seems that the gradient of the objective function has not been estimated, it might take extra time to converge. If there is literature that examines the validity of the method used here, it should be cited.

lines 681-687

Also, I have never seen a document in which AUC is also used in this way. I don't know how to calculate it. I think it needs an explanation. Authors should cite any references available.

line 712- Derivation of the DT model from logistic regression

- The DT model consists of a combination of two choice kernels with fast decay and slow decay. I think this can be introduced quite naturally. I'm not sure why it needs to be derived from logistic regression, as the authors discussed. I think it makes sense to associate it with logistic regression for a deeper understanding of the model property, but I'm not sure if the derivation here works for that purpose.

p.46- RL models

I understand that “Indirect actor” is a natural Q-learning update, but I couldn’t understand how the update rule for the “Direct actor model” (64) was derived. An explanation is necessary.

p.48

“F-Q up model” is equivalent to the RL model with “default value”, discussed in Toyama, Katahira, & Ohira (2019, Frontiers in Human Neuroscience), albeit this paper assumes that the initial value of Q is set to the default value.

Related to this, the authors should mention how the initial value of Q was set.

3. Results

Eq (1) (p.6)

Why is “P_set, i (R)” denoted like a function of R? Isn't this a constant?

line 211 “The discrepancy in choice history effects of optimal agents and animals must stem from the fact that optimal agents can precisely infer the reward probabilities while animals can not.”

- The “discrepancy” mentioned in this sentence is not clear. I think it would be better to show the actual data rather than refer to Lau & Glimcher (2005).

line 233

How the data were split into five parts? Did the authors split sessions within each subject? Or did they ignore the subject identity and pool all the sessions?

line 270

“This finding was also evident when looking at the representative choice dynamics that the DT model, mice, and humans generated (Fig. 2E)."

- I could not figure out how this was evident from Fig.2E.

Table 1 (p.16)

Only a single simulation of parameter recovery was reported. Parameter recovery should be done many times while changing the true parameters. Please see the following paper:

Wilson, R. C., & Collins, A. G. (2019). Ten simple rules for the computational modeling of behavioral data. elife, 8, e49547.

line 380

“Due to the symmetric nature of the two choice traces, whereby different signed parameters can yield the same predictions, we restricted \\varphi to negative and \\vartheta to positive values.”

- I think this statement is inappropriate. Because \\varphi was defined as the weight for *first* decay and \\vartheta was defined for *slow* decay, it is not symmetric.

How did the authors determine model parameters for Figure 4? (Was it selected to minimize the regret?)

4. Other points

Throughout this paper, the authors argue that VIs are more ecologically valid. But I do not think this is always the case. While this may be the case for herbivore foraging, the variable ratio (VR) schedule, in which rewards do not depend on choice history, is more appropriate for carnivore foraging (see Sakai & Fukai, 2008, Neural Computation).

Also, it would be interesting to analyze the data in terms of whether it follows the matching law. (e.g., Does the choice history effect promote or inhibit matching behavior?) As Sakai & Fukai demonstrated that actor-critic model (but not Q-learning model) can explain the matching law, it may be better to include an actor-critic model as a candidate model.

Reviewer #2: General comments:

Junior Samuel Lopez-Yepez and his colleagues are interested in the issues of choice history effects. The authors are interested in the “decoupling” of reward and choices in the sequence of trials. In the natural habitat, the choices act on the environment and change its state, which subsequently influences the reward richness of the environment. But such inter-trial dependency of the choices and rewards are not carefully examined in the previous studies. They devised the task paradigm with a variable interval (VI) reward schedule, which incorporates such inter-trial dependencies of choices and rewards. By examining the behaviors of humans and mice on this task paradigm, the authors have found that application of the double trace (DT) model, which was originally applied to the reward history effects of monkeys, to choice history effects can explain the behavior of animals well. Therefore, the value of the paper depends on how the VI task paradigm is novel and in introducing the intertrial reward-choice dependencies and on how the DT model is adequate in describing and explaining the interactions of choice and reward in long-term scales beyond a single trial. Overall, there is some puzzling aspect in this paper. The conclusion of the paper is just about the choice history effects but the task design and the initial aim of the paper is so explicitly set out to examine the effects of the long-term contingency of choice and rewards. The authors need to address the disconnections in the design, analysis, and the interpretation in this study.

Major points:

1. The first issue is how much of this new task paradigm is effective for elucidating the effects of long-term choice-reward contingency compared to the previous paradigms. As the authors mentioned, the task paradigms inspired by foraging theory has uncovered interesting patterns of the long-term history effects of choice and reward (Hayden et al., Nature Neuroscience, 2011; Kolling et al., Science, 2012; Kolling et al., Neuron, 2014; Wittmann et al., Nature Communications, 2016; Wittmann et al., Nature Communications, 2020). The authors should conduct more careful comparisons of these existing literatures.

2. Related to the first point, the previous literature of foraging decision paradigms with long-term contingency of choice and reward elucidated mainly the history effects of rewards in the previous trials (Wittmann et al., 2016; Wittmann et al., 2020). In contrast, the authors claim that the long-term history effects are mainly due to the choice history effects. It is crucial to reconcile the difference of these interpretations between the current results and the previous results using the conceptually similar tasks.

3. Separation of the choice history effects and reward history effects in the behavioral analyses are basically artificial in the current study and in the previous studies. This issue is more pronounced in this study because authors explicitly aimed to examine the effects of the across-trial contingency between the choices and rewards. The analyses in this study, however, linearly separated history effects of choices and rewards. It might be more straightforward to examine such “interaction terms” of choices and rewards if the authors are genuinely interested in these intricate relationships between choices and rewards.

4. DT model of “choice history” effects might contain the similar problem of the artificial separation of the choice and rewards. The author’s focus on the choice history effects is understandable given the recent popularity of the issue of the choice history effects or decision inertia. But it is also puzzling that the conclusion of the paper is just about the choice history effects given that the task design and the initial aim of the paper is so explicitly set out to examine the effects of the long-term contingency of choice and rewards.

Minor points

1. The article uses concepts or variables (alternation, perseverance, volatility) without introducing them first, which is pretty confusing, even if those variables can be more or less intuitive. The results section did not give enough information for the analyzed variables to be intuitive enough. I needed to go back and forth from the results to the methods constantly, and sometimes to find that the methods did not contain the information I was looking for either.

2. Figures are also incomplete and not everything that appears is labeled or explained. The text gets particularly confusing when dealing with the particular dataset each analysis was based on. Some were based on simulations (optimal agents or their proposed model), while others are based on mouse and human data. In the case of simulated agents, there are several reward distributions tested for optimal agents, and it looks like sometimes block transitions are used and sometimes not. In the case of biological data, two mouse datasets are used, and most of the analyses are not done with the human dataset at all. The authors should make an effort to clarify the particularities of the environment data was simulated/collected on, and to justify why each kind of analysis was done on only certain datasets.

3. In the authors summary (page 3), authors make clear what process their model captures which other models do not. Namely, and in their own words, the fact that “animals track their history of visits and use it to maximize the food harvesting efficiency". To me that makes a lot of sense. However, it is difficult to find this clear message in the main text, particularly in the discussion.

Specific points

PAGE 3

• Lines 55-57: the authors suggest that “choice history bias” is synonymous with “choice inertia”, while that is not exactly true. Inertia is a particular kind of bias where the probability of repeating a choice increases the more a choice is repeated (Akaishi et al., Neuron, 2014).

PAGE 4

• Lines 71-72: I find the way the reward probability in the VI task is explained a bit confusing. Maybe it is because it starts by saying “in each trial rewards are assigned with fixed or set reward probabilities keyed to different options”, and it is not until much later that it is said that the probability also depends on the number of elapsed trials without that option being chosen. Maybe authors should start by saying that each option’s reward probability is defined by both a set (or initial) probability, unique for each option, and the number of trials without that option being chosen.

PAGE 5

• Lines 89-90: The affirmation “To maximize their reward rate, optimal agents should choose the options with the highest set reward probability” assumes that both options lead to a reward of the same magnitude, if this is delivered (e.g. 1 point). Maybe this should be stated before when describing the VI task, as many decision-making studies implement different rewards for each option.

• Lines 92-93: The authors say that an optimal strategy in this environment may “resemble choosing the best option, with occasional switches to lesser options”. This statement is not well connected to the remaining part of the paragraph, but it could make an interesting point to be developed in the discussion. It would be particularly interesting to connect it to what previous work thinks such lapses mean (i.e.

PAGE 8

• Lines 171-173: The peaks observed in simulated data should be discussed a little bit.

PAGE 16

Lines 303-304: a short definition of volatility should be introduced here, as it is important to understand the following analyses. The authors need to explain what “reward history effects are sensitive to the volatility” means. In the original paper, it was learning rate, which was affected by the volatility (Behrens et al., Nature Neuroscience, 2007).

PAGE 20

• Figure 3: the figure and the legend need serious reworking. Why are the middle panels on the left hand side of A, B and C needed? They simply depend on experimenters’ fixed parameters. What do yellow dots mean? What kind of variability are we being shown in each panel with those error bars? What do red crosses mean? What do blue and grey circles represent on the right hand side panels? Also it should be said that the volatility analyses were made on a different dataset than the rest of the analyses described in the figure, for which new data (although with the same participants) was gathered.

PAGE 22

• Lines 410-411: why do authors think that the DT model’s choice trace reflects volatility, but they do not find that in data?

• Lines 414-436: the first paragraph of the discussion is way too long and it does not summarize the achievements of the present work. It only gives context, and it would be more appropriate for the introduction.

PAGE 23

• Lines 424-432: the authors take a convoluted path to make a simple point, which, if I understand correctly, is the following. Some experimental work whose primary aim is studying optimal behavior, but not necessarily within the domain of perception, still resorts to perceptual tasks. But assessing optimality in perceptual tasks has extra nuisances that we could disregard in simpler, non-perceptual scenarios, such as the one proposed here. I feel like this point could be made easier.

• Line 435: what property are the authors talking about?

PAGE 24

• Lines 442-444: authors say that “optimal agents show the same characteristic shape of choice history effects observed in animals when probed with uniform distribution of set reward probabilities”. I think this is quite confusing. The key information to understand this is to bear in mind that the optimal agents the authors are talking about have access to the exact set probabilities of each option. Thus, in this case their “knowledge” transparently reflects the experimental parameters. Maybe authors want to make that clear.

• Lines 445-447: I do not see how point #2 falls from what the authors have previously reasoned in that paragraph.

PAGE 25

• Lines 480-483: I feel this is an overstatement. Just because these models take reward history into account it does not mean they will offer an accurate description of behavior and neural activity.

• Line 483: why the “however”?

PAGE 26

• Lines 502-515: this is a paragraph about neural representations, and yet no neuroimaging work is cited.

• Line 507: sometimes authors spell “time scales”, sometimes “time-scales”. A unified spelling should be used.

PAGE 27

• Lines 516-534: Charnov’s article aside, this paragraph only has one citation. Authors should try to cite more sources.

• Line 536-540: the conclusion should succinctly summarize the most important points of the manuscript, yet the authors use it to suggest, for the first time, that choice history effects are phylogenetically conserved. Maybe this evolution-related bit should be moved to the discussion, and the conclusion should focus on a short wrap-up.

Reviewer #3: Lopez-Yepez and colleagues examine the role of choice history in decision making using a binary 2AFC task with a Variable Interval (VI) Reward Schedule; this introduces contingency between the time spent since an option was chosen and likelihood of payout if the option is chosen again. This is different to a lot of 2AFC tasks in which choice biases emerge despite their being no built-in dependency between past choices and current prospects. They find that the choices of animals and humans are influenced by both the history of rewards and history of choices. They also find in mice that the effects of reward history are mediated by volatility (defined as length of the block) whilst effects of choice history are mediated by differences in set reward probabilities. A learning model that incorporates the history of choices is shown to be able to recapitulate these effects, confer advantage over various other models and shown to earn rewards that are closest to maximum possible (indexed via regret) suggesting that incorporating this knowledge helps agents maximise returns.

I like the task and applaud the authors attempt to provide a plausible account of choice biases that are often observed in tasks.

My main set of (related) issues are that in quite a few places, some things about the design and the analysis came across as unclear or confusing.

An example is the experiments the authors run – with mice they run 3 conditions (0.25:0.25, 0.40:0.10 and 0.10:0.40) but with humans they just run 2 conditions (0.40:0.10, and 0.10:0.40). Already it is confusing why this difference in designs exists (no explanation is provided from what I could see). But to confuse things further, in the figure (Figure 2C), it suggests that humans actually did also have the 0.25:0.25 condition? I put a few more examples below. Note – these may each be things that can be tidied up easy enough, but combined it gave me the impression of being a bit careless.

A few other examples (not an exhaustive list):

• Optimal Agents: The analysis of choice history for the 3 optimal agent simulations (Fig. 1) is presented in a haphazard way.

e.g., why have the number of trials/sessions be different for each combinations of set reward probabilities for the Oracle Agent (P.7)? [For instance, in (1) Number of trials is 1000 and number of sessions is 10 but in (3) Number of trials is either 450 or 1350]

e.g., why have n=100 sessions for Bayesian agent, n=5000 for the LK agent and n=10 for Oracle?

e.g., why use a greedy rule for the Oracle but a softmax for the Bayesian and LK Agents?

[Maybe there are good reasons for these differences, but it wasn’t obvious to me.]

• Parameter Recovery (Table 1): There are large details missing on how this procedure was carried out. For instance:

how many simulations were run?

why were these specific DT parameter values selected (e.g., are they the average of the parameters fit to the mice data, the human data, both?)

How many trials/blocks were used in the simulations (was it like the human task or the animal task, for instance)

Is it possible to get estimates of variability for the recovered parameters?

• Figure 1B – I think the legend is incorrect as the option with the higher probability (blue line) actually has lower probability of reward for each number of unchosen trials.

Other issues:

• Terminology: The authors describe their approach and task as one that studies “foraging behaviour” (ln97) and explores “foraging decisions” (ln437). Whilst incorporating a nice feature - that time since a specific option was last chosen influences likelihood of its future reward – which is more like some real situations faced by animals outside of the lab, this is definitively not a foraging task. The key premise of foraging tasks (such as patch foraging or prey selection) is to have one explicit foreground option that needs to be considered against an estimate of the background reward rate. This is just not the case here; the task used is a 2AFC binary choice task where agents are given an explicit “menu” of all the options available on each trial. See for instance:

o Hayden, B. Y., & Walton, M. E. (2014). Neuroscience of foraging. Frontiers in neuroscience, 8, 81.

o Hall-McMaster, S., & Luyckx, F. (2019). Revisiting foraging approaches in neuroscience. Cognitive, Affective, & Behavioral Neuroscience, 19(2), 225-230.

o Stephens & Krebs, Foraging Theory

• Task (humans): There are only 19 participants in the human task and 20-30 trials per block. This seems very few trials to be able to fit a learning model reliably. Did the authors conduct any checks for this?

• DT Model. It seemed to me a bit peculiar to include both a fast trace and a slow trace of choice history in the model in so far as these quantities have the same update applied to them on every trial. I understand that they end up being different quantities all the same owing to the learning rates being high and low respectively and potentially this enables their model to capture both the fast trial to trial oscillations in the choice effects (e.g., Fig 2c) as well as slower trends over time. Nonetheless, did the authors do any work to untangle whether you need both these traces in their model? (e.g., is one trace doing most of the work in explaining choice history, could a single trace model beat a double trace model?). One option might be to run Model Recovery – simulate data from 3 models (No trace, One Trace, Two Trace) and see which of the 3 models is best fit by the data in each case. For instance, when simulating choices from a model in which there is no trace history incorporated, this should be the winning model (e.g., determined by Bayesian Model Selection) when you fit the 3 contenders to the choices.

• There seem to be some differences in how well the DT model recovers the history effects. For different set probabilities (figure 4a), for the small set probabilities condition, it seems the DT model has a positive effect after a few trials back – but this is not the case in the data (Figure 3b) where effects seems to be asymptote after a few trials at around 0. The size of the effects also seem to differ by a lot. Compare the scale of the y axis on 4a Choice Traces plot (ranges from -1.5 to 0.5) to that of 3B Choice effects (range -4 to 2). Similar differences emerge when you look at the reward history effects for different block lengths – scale of axis in 4B ranges from -0.6 to 0.4, but is in the range -4 to 2 in 3A. Have the authors any idea why these differences arise (it could be something obvious I missed!)?

**********

Large-scale datasets should be made available via a public repository as described in the

Reviewer #1: Yes

Reviewer #2:

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: No

Reviewer #2:

Reviewer #3: No

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool,

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see

Submitted filename:

Submitted filename:

Dear Dr Kvitsiani

Thank you very much for submitting your manuscript "Choice history effects in mice and humans improve reward harvesting efficiency" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

As you will see several important issues that have been raised by both reviewer 2 and reviewer 3 remain to be fully addressed (the paper will be send back to the original reviewers).

I would also like to point out that what is claimed in this study seems to be already known in the field of experimental analysis of behaviour. Specifically, it is known that a strategy that depends on the choice history (alternating or periodic choice) maximizes the reward in the VI schedule, and that animals and humans actually adopt such a strategy when switching choices is not costly, so it would be good to further stress the novelty of the study in this respect.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Stefano Palminteri

Associate Editor

PLOS Computational Biology

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact

[LINK]

As you will see several important issues that have been raised by both reviewer 2 and reviewer 3 remain to be fully addressed (the paper will be send back to the original reviewers).

I would also like to point out that what is claimed in this study seems to be already known in the field of experimental analysis of behaviour. Specifically, it is known that a strategy that depends on the choice history (alternating or periodic choice) maximizes the reward in the VI schedule, and that animals and humans actually adopt such a strategy when switching choices is not costly, so it would be good to further stress the novelty of the study in this respect.

Reviewer's Responses to Questions

Reviewer #1: I am grateful to the authors for responding to my previous suggestions. However, I am still concerned about “Bayesian agent.” In addition, there is an additional issue regarding the significance of this study, which I noticed after I wrote the first review report. Since I was unable to point this out during the initial peer review, I would like to leave it to the editor to decide whether or not to take this into account.

Regarding the Bayesian agent:

In response to my previous comments regarding the Bayesian agent, the authors responded as: “The confusion in seeing the update equation as Bayesian may stem from the misspelling. The correct equation should be P(Qn(i)|R) = P(R|Qn(i)) * P(Qn(i)). We also provide additional supplementary (Fig.1 figure suppl.1C) to show that this update correctly infers the set reward probabilities.”

However, this is not enough as an answer. If one calls a Bayesian agent, the agent should represent the distribution of the variable of interest (in this case, the set reward probability). What Fig.1 suppl.1C shows are the point estimate of set reward probability rather than its distribution (By the way, the title of figure suppl.1C includes misspelling). Representing the posterior distribution requires at least making sure that its variance (uncertainty) can be expressed according to the Bayes formula. In the authors’ framework, the distribution of the estimates for the set reward probability, \\theta^n(i), should obey the posterior distribution. However, the authors did not evaluate this. When we are interested only in point estimator, we may call it Bayesian estimation if it provides the maximum a posteriori (MAP) estimator. However, it is unclear whether this is the case (whether the parameter corresponds to the MAP estimator rather than the maximum likelihood estimator) in the author's model.

Also, regarding adding the noise, the authors wrote as:

“Overall this results as we understand is similar to approaches used in hierarchical Bayesian models that use noise term in parameter estimation (Mathys C. et al, Front.in Human Neurosc. 2011).”

As far as I understand, the Mathys et al. approach assumed noise in the generation process, but the estimation is done deterministically under variational approximation, which is quite different from adding noise to the estimates as the authors did.

The Bayesian inference model is a concept that is becoming well established in behavioral modeling. Calling the author's model a Bayesian agent may cause confusion. If the author cannot properly construct a Bayesian inference model, then the Bayesian agent model should be removed from the manuscript, or the name should be changed.

Additional issue regarding the significance of the paper:

I noticed that Houston and McNamara (1981, Journal of the experimental analysis of behavior, 35(3), 367-396), which was also cited in the manuscript, theoretically showed that the periodic deterministic choice behavior gives the highest reward in the concurrent VI schedule. In addition, periodic (alternation) choice strategy have been commonly observed in VI-schedules without change over delay (COD) procedure, which seems absent in the authors’ experiment, as stated as follows in Corrado et al. (Corrado, G. S., Sugrue, L. P., Seung, H. S., & Newsome, W. T. (2005). Journal of the experimental analysis of behavior, 84(3), 581-617.):

“A second feature that our foraging task shares with many classical matching paradigms is the incorporation of a changeover delay (COD). The COD is a common technique for introducing a ‘‘cost,’’ in this case a temporal delay, to switching from one choice option to another (Shahan & Lattal, 1998). Thus, although graded matching behavior can be observed without the use of a COD (see, e.g., Lau & Glimcher, 2005), its incorporation shields the data from partial contamination with competing behavioral strategies based on alternation. Without such a cost, an animal can gather rewards surprisingly efficiently by alternating between options…”

Thus, I think that it is already known that alternating behavior, which appeared as the negative history effect of the previous trial observed in the authors' results, maximizes reward under certain VI schedule conditions. Given all this, what new knowledge can we say that the results of this paper bring? I would like to see a discussion of COD and the author's contribution in light of these previous studies.

Minor points:

Line 282:

“The higher the inverse temperature, the more stochastic the choices are…”

- I think this is the opposite.

Line 357:

were significantly different (p = 0.0006,…

- “were significantly different from zero” or “were significant”?

Line 446

“,Here an agent takes action a duringin each trial t.”

- The sentence is broken.

Reviewer #2: Authors has addressed the previous comments sufficiently. The manuscript improved with the borader perspective on this issue. Because the previous manuscript was so unfocused and confusing with the methodology, I could not see some central points of the paper. With the further analyses in this version, there are some new finding that is relevant the central claims of the paper such as the finding of the effect of the no-reward for the immediately preceding choice. So I would like to ask the authors to address the following points:

1. Can we still call the model a "double" trace model of choice history effects given that the fast component is based on no-reward on the previous choice?

2. Choice history effect as rational or irrational bias depends on its relevance to the task. Clearly in this task the elapsed time (trials and number of choices on the options) is directly connected to the possible reward on another option. In contrast, in other studies such as Akaishi et al. (2014), Fritsche et al. (2017). and Akrami et al.(2018), the choice history bias is irrelevant to the task performance. I am just wondering whether the interpretation of the choice history bias in the current task may not generalize to other tasks with the task irrelevance of the biases.

3. This is relatively minor point but this necessary for good scholarship with sufficient documentation of previous literature. The neural bases of the multiscale representation of the past choices have been reported. For example, single unit activities in the prefrontal cortex have been found to be related with the fast and slow choice biases.

Mochol, Kiani, Moreno-Bote

Current Biology, 2021Prefrontal cortex represents heuristics that shape choice bias and its integration into future behavior

One of the earliest studies of choice history bias in the past decade (Akaishi et a., 2014, Neuron), from which authors burrowed the equation of the double trace model, has reported the extensive evidence about the neural bases of the multiscale choice history bias. The study described not only the involvement of the prefrontal but also the medial parietal areas and its interaction with the prefrontal cortex.

Reviewer #3: my comments have all been addressed.

**********

The

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: No

Reviewer #2:

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool,

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here:

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

Submitted filename:

Dear Dr Kvitsiani,

Thank you very much for submitting your manuscript "Choice history effects in mice and humans improve reward harvesting efficiency" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

As you can see Reviewer 1 is happy with the revised version of the manuscript. However, Reviewer 2 raises few remaining minor issues that should be addressed. Considering that other readers may share the same views, it could be better to clarify even more the justification of the model labelling and (also to highlight the novelty of the paper) the novelty of the model within the broader literature using the VI task or others (e.g., the references suggested by R2).

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Stefano Palminteri

Associate Editor

PLOS Computational Biology

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact

[LINK]

As you can see Reviewer 1 is happy with the revised version of the manuscript. However, Reviewer 2 raises few remaining minor issues that should be addressed. Considering that other readers may share the same views, it could be better to clarify even more the justification of the model labelling and (also to highlight the novelty of the paper) the novelty of the model within the broader literature using the VI task or others (e.g., the references suggested by R2).

Reviewer's Responses to Questions

Reviewer #1: I think that the authors addressed my comments. I am now happy to recommend publication.

Reviewer #2: The authors responded to the questions of the reviewers partially in the new submission of this manuscript. In this resubmission, the authors added a critical piece of the new data regarding the "double trace model of choice history effect". In the previous round of the review, I have asked the authors the justifications of the use of the "double trace model of choice history effect". Below I reproduce the interaction:

My comment: With the further analyses in this version, there are some new findings that are relevant to the central claims of the paper such as the finding of the effect of the no-reward for the immediately preceding choice. So I would like to ask the authors to address the following points:

Major point 1. Can we still call the model a "double" trace model of choice history effects given that the fast component is based on no-reward on the previous choice?

Author's response: As long as we explain what behavioral processes double trace captures, we would prefer to keep the term “double trace” as it is.

My point was: if the alternative models incorporate both history of reward-choice interaction and history of choices, which is the implication of the new results, can we still call it the "double trace model of choice history effect"? There are confusions of the term "double trace model" throughout the paper regarding the components of the "double trace model". If it is purely consisting of choice history effects, please clearly mention this. And if the author means this, the issue posed by the new results of "the effect of the no-reward for the immediately preceding choice" has to be resolved.

Usually the authors need to answer the questions of the reviewers not with the "preference" but "justification". It is a long review process but it is the responsibility of the authors to keep a good manner of interactions.

The editor in the previous round of the review asked the authors to clarify the novelty of the paper given that VI task has been used in the past studies. The authors need to address this point sufficiently as well.

Also, within the discussion, in the main paragraph of page 27 the authors say "Indeed, in many decision-making tasks that do not impose contingencies of previous choices on current choices, choice history effects persist (Akaishi et al., 2014; Fritsche et al., 2017; Akrami et al., 2018; Hwang et al., 2017; Mochol, Kiani, & Moreno-Bote, 2021). Understanding the behavioral mechanisms that generate choice-history effects in these tasks may require the manipulation of behavioral contingencies, linking past choices to current choices in a parametric way." Authors can mention the papers that discuss how experimental instructions may make changes of choice history effects, or if there is any review comparing the size of choice history effects across different paradigms (foraging tasks with "replenishing" patches vs tasks with static reward probabilities, for instance). In the paradigms of perceptual decisions, (trial) frequencies of each stimulus condition in a block can bias the decisions (and possibly choice history effects too) such as the following study:

Elapsed Decision Time Affects the Weighting of Prior Probability in a Perceptual Decision Task

The comparison to these papers clarify the value of the current paper in the background related studies.

Some minor/technical points

It is still not clear why the authors conclude that participants assume uniform distributions of set reward probabilities. In the current version of the manuscript, the end of the first paragraph of page 12 refers the reader to two figures. But these figures are not that self-explanatory to me. I think a short explanation in that paragraph would help. Particularly because that is brought up again in the discussion, and the information there is not sufficient either.

**********

The

Reviewer #1: None

Reviewer #2:

**********

PLOS authors have the option to publish the peer review history of their article (

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: No

Reviewer #2:

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool,

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here:

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

Submitted filename:

Dear Dr Kvitsiani

We are pleased to inform you that your manuscript 'Choice history effects in mice and humans improve reward harvesting efficiency' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Stefano Palminteri

Associate Editor

PLOS Computational Biology

Daniele Marinazzo

Deputy Editor

PLOS Computational Biology

***********************************************************

PCOMPBIOL-D-21-00059R3

Choice history effects in mice and humans improve reward harvesting efficiency

Dear Dr Kvitsiani,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofi Zombor

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom