Introducing neuromodulation in deep neural networks to learn adaptive behaviours

Animals excel at adapting their intentions, attention, and actions to the environment, making them remarkably efficient at interacting with a rich, unpredictable and ever-changing external world, a property that intelligent machines currently lack. Such an adaptation property relies heavily on cellular neuromodulation, the biological mechanism that dynamically controls intrinsic properties of neurons and their response to external stimuli in a context-dependent manner. In this paper, we take inspiration from cellular neuromodulation to construct a new deep neural network architecture that is specifically designed to learn adaptive behaviours. The network adaptation capabilities are tested on navigation benchmarks in a meta-reinforcement learning context and compared with state-of-the-art approaches. Results show that neuromodulation is capable of adapting an agent to different tasks and that neuromodulation-based approaches provide a promising way of improving adaptation of artificial systems.


Introduction
The field of machine learning has seen tremendous progress made during the past decade, predominantly owing to the improvement of deep neural network (DNN) algorithms.DNNs are networks of artificial neurons whose interconnections are tuned to reach a specific goal through the use of an optimization algorithm, mimicking the role of synaptic plasticity in biological learning.This approach has led to the emergence of highly efficient algorithms that are capable of learning and solving complex problems.Despite these tremendous successes, it remains difficult to learn models that generalise or adapt themselves efficiently to new, unforeseen problems based on past experiences.This calls for the development of novel architectures specifically designed to enhance adaptation capabilities of current DNNs.
In biological nervous systems, adaptation capabilities have long been linked to neuromodulation, a biological mechanism that acts in concert with synaptic plasticity to tune neural network functional properties.In particular, cellular neuromodulation provides the ability to continuously tune neuron input/output behaviour to shape their response to external stimuli in different contexts, generally in response to an external signal carried by biochemicals called neuromodulators [1,2].Neuromodulation regulates many critical nervous system properties that cannot be achieved solely through synaptic plasticity [3,4].It has been shown as being critical to the adaptive control of continuous behaviours, such as in motor control, among others [3,4].In this paper, we introduce a new neural architecture specifically designed for DNNs and inspired from cellular neuromodulation, which we call NMN, standing for "Neuro-Modulated Network".
At its core, the NMN architecture comprises two neural networks: a main network and a neuromodulatory network.The main network is a feed-forward DNN composed of neurons equipped with a parametric activation function whose parameters are the targets of neuromodulation.It allows the main network to be adapted to new unforeseen problems.The neuromodulatory network, on the other hand, dynamically controls the neuronal properties of the main network via the parameters of its activation functions.Both networks have different inputs: the neuromodulatory network processes feedback and contextual data whereas the main network is in charge of processing other inputs.
Our proposed architecture can be related to previous works on different aspects.In [5], the authors take inspiration from Hebbian plasticity to build networks with plastic weights, allowing them to tune their weights dynamically.In [6] the same authors extend their work by learning a neuromodulatory signal that dictates which and when connections should be plastic.Our architecture is also related to hypernetworks [7], in which a network's weights are computed through another network.Finally, other recent works focused on learning fixed activation functions [8,9].

NMN architecture
The NMN architecture revolves around the neuromodulatory interaction between the neuromodulatory and main networks.We mimic biological cellular neuromodulation [10] in a DNN by assigning the neuromodulatory network the task to tune the slope and bias of the main network activation functions.
Let sðxÞ : R !R denote any activation function and its neuromodulatory capable version σ NMN (x, z; w s , w b ) = σ(z T (xw s + w b )) where z 2 R k is a neuromodulatory signal and w s ; w b 2 R k are two parameter vectors of the activation function, respectively governing a scale factor and an offset.In this work, we propose to replace all the main network neuron activation functions with their neuromodulatory capable counterparts.The neuromodulatory signal z, where size k is a free parameter, is shared for all these neurons and computed by the neuromodulatory network as z = f(c), where c is a vector representing contextual and feedback inputs.The function f can be any DNN taking as input such vector c.For instance, c may have a dynamic size (e.g. more information about the current task becomes available as time passes), in which case f could be parameterised as a recurrent neural network (RNN) or a conditional neural process [11], enabling refinement of the neuromodulatory signal as more data becomes available.The complete NMN architecture and the change made to the activation functions are depicted in Fig 1.
Notably, the number of newly introduced parameters scales linearly with the number of neurons in the main network whereas it would scale linearly with the number of connections between neurons if the neuromodulatory network was affecting connection weights, as seen for instance in the context of hypernetworks [7].Therefore our approach can be extended more easily to very large networks.

Setting
A good biologically motivated framework to which the NMN can be applied and evaluated is meta-reinforcement learning (meta-RL), as defined in [12].In contrast with classical reinforcement learning (RL), which is formalised as the interaction between an agent and an environment defined as a Markov decision process (MDP), the meta-RL setting resides in the sub-division of an MDP as a distribution D over simpler MDPs.Let t denote the discrete time, x t the state of the MDP at time t, a t the action taken at time t and r t the reward obtained at the subsequent time-step.At the beginning of a new episode i, a new element is drawn from D to define an MDP, referred to as M, with which the meta-RL agent interacts for T 2 N timesteps afterwards.The only information that the agent collects on M is through observing the states crossed and the rewards obtained at each time-step.We denote by h t = [x 0 , a 0 , r 0 , x 1 , . .., a t−1 , r t−1 , x t ] the history of the interaction with M up to time-step t.As in [12], the goal of the meta-learning agent is to maximise the expected value of the discounted sum of rewards it can obtain over all the time-steps and episodes.

Training
In [12], the authors tackle this meta-RL framework by using an advantage actor-critic (A2C) algorithm.This algorithm revolves around two distinct parametric functions: the actor and the critic.The actor represents the policy used to interact with the MDPs, while the critic is a function that rates the performance of the agent policy.All actor-critic algorithms follow an iterative procedure that consists of the three following steps.
1. Use the policy to interact with the environment and gather data.
2. Update the actor parameters using the critic ratings.
3. Update the critic parameters to better approximate a value function.
In [12], the authors chose to model the actor and the critic with RNNs, taking h t as the input.In this work, we propose comparing the NMN architecture to standard RNN by modelling both the actor and the critic with NMN.To this end, we define the feedback and contextual inputs c (i.e. the neuromodulatory network inputs) as h t \x t while the main network input is defined as x t .Note that h t grows as the agent interacts with M, motivating the usage of a RNN as neuromodulatory network.A graphical comparison between both architectures is shown on Fig 2 .To be as similar as possible to the neuronal model proposed by [10], the main network is a fully-connected neural network built using saturated rectified linear unit (sReLU) activation functions σ(x) = min(1, max(−1, x)), except for the final layer (also neuromodulated), for which σ(x) = x.In Section 4, we also report results obtained with sigmoidal activation functions which are often appreciably inferior to those obtained with sReLUs, further encouraging their use.
We built our models such that both standard RNN and NMN architectures have the same number of recurrent layers/units and a relative difference between the numbers of parameters that is lower than 2%.Both models are trained using an A2C algorithm with generalized advantage estimation [13] and proximal policy updates [14].Finally, no parameter is shared between the actor and the critic.We motivate this choice by noting that the neuromodulatory signal might need to be different for the actor and the critic.For completeness and reproducibility, we provide a formal description of the algorithms used as supplementary material.This material aims mainly to describe and discuss standard RL algorithms in the context of meta-RL and, to a lesser extent, it aims to provide full implementation details.We also provide the exact neural architectures used for each benchmark as supplementary material.

Benchmarks description
We carried out our experiments on three custom benchmarks: a simple toy problem and two navigation problems with sparse rewards.These benchmarks were built to evaluate our architecture in environments with continuous action spaces.For conciseness, we only provide a mathematical definition of the first benchmark.The two other benchmarks are briefly textually depicted and further details are available as supplementary material.Benchmark 2. The second benchmark consists of navigating towards a target in a 2-D space with noisy movements.Similarly to the first benchmark, we can distinguish all different MDPs in D through a three-dimensional random vector of variables α.The target is placed at (α [1], α [2]) in the 2-D space.At each time-step, the agent observes its relative position to the target and outputs the direction of a move vector m t .A perturbation vector w t is then sampled uniformly in a cone, whose main direction α½3� � U½À p; p½, together with the target's position, define the current task in D. Finally the agent is moved following m t + w t and receives a reward (r t = −0.2).If the agent reaches the target, it instead receives a high reward (r t = 100) and is moved to a position sampled uniformly in the 2-D space.This benchmark is represented on  Each task is defined by the main direction α of a wind cone from which a perturbation vector w t is sampled at each time-step.This perturbation vector is then applied to the movement m t of the agent, whose direction is given by the action a t .If the agent reaches the target, it receives a reward of 100, otherwise a reward of −2.
https://doi.org/10.1371/journal.pone.0227922.g004Benchmark 3. The third benchmark also involves navigating in a 2-D space, but which contains two targets.As for the two previous benchmarks, we distinguish all different MDPs in D through a five-dimensional random vector of variables α.The targets are placed at positions (α [1], α [2]) and (α [3], α [4]).At each time-step, the agent observes its relative position to the two targets and is moved along a direction given by its action.One target, defined by the task in D through α [5], is attributed a positive reward (100) and the other a negative reward (−50).In other words, α [5] is a Bernoulli variable that determines which target is attributed the positive reward and which is attributed the negative one.As for benchmark 2, once the agent reaches a target, it receives the corresponding reward and is moved to a position sampled uniformly in the 2-D space.This benchmark is represented on Fig 5.

Learning
From a learning perspective, a comparison of the sum of rewards obtained per episode by NMNs and RNNs on the three benchmarks is shown in Fig 6 .Results show that, on average, NMNs learn faster (with respect to the number of episodes) and converge towards better policies than RNNs (i.e., higher rewards for the last episodes).It is worth mentioning that, NMNs show very stable results, with small variances over different random seeds, as opposed to RNNs.To put the performance of the NMN in perspective, we note that an optimal Bayesian policy would achieve an expected sum of rewards of 4679 on benchmark 1 (see supplementary material for proof) whereas NMNs reach, after 20000 episodes, an expected sum of rewards of 4534.For this simple benchmark, NMNs manage to learn near-optimal Bayesian policies.

Adaptation
From an adaptation perspective, Fig 7 shows the temporal evolution of the neuromodulatory signal z (part A), of the scale factor (for each neuron of a hidden layer, part B) and of the rewards (part C) obtained with respect to α for 1000 episodes played on benchmark 1.For small values of t, the agent has little information on the current task, leading to a non-optimal behaviour (as it can be seen from the low rewards).Of greatest interest, the signal z for the first time-steps exhibits little dependence on α, highlighting the agent uncertainty on the current task and translating to noisy scale factors.Said otherwise, for small t, the agent learned to play a (nearly) task-independent strategy.As time passes, the agent gathers further information about the current task and approaches a near-optimal policy.This is reflected in the convergence of z (and thus scale factors) with a clear dependency on α and also in wider-spread values of z.For a large value of t, z holding constant between time-steps shows that the neuromodulatory signal is almost state-independent and serves only for adaptation.We note that the value of z in each of its dimensions varies continuously with α, meaning that for two similar tasks, the signal will converge towards similar values.Finally, it is interesting to look at the neurons scale factor variation with respect to α (B).Indeed, for some neurons, one can see that the scale factors vary between negative and positive values, effectively inverting the slope of the activation function.Furthermore, it is interesting to see that some neurons are inactive (scale factor almost equal to 0, leading to a constant activation function) for some values of α.
For benchmark 2, let us first note that z seems to code exclusively for α [3].Indeed, z converges slowly with time with respect to α [3], whatever the value of α [1] and α [2] (Fig 8).This, could potentially be explained by the fact that one does not need the values of α [1] and α [2] to compute an optimal move.The graphs on Fig 8 are projected on the dimension α [3], allowing the same analysis as for benchmark 1.
The results obtained for benchmark 2 (Fig 8 ) show similar characteristics.Indeed, despite the agent receiving only noisy information on α [3] at each time-step (as perturbation vectors are sampled uniformly in a cone centered on α [3]), z quasi-converges slowly with time (part A).The value of z in each of its dimensions also varies continuously with α[3] (as for the first benchmark) resulting also in continuous scale factors variations.This is clearly highlighted at time-step 100 on Fig 8 where the scale factors of some neurons appear highly asymmetric, but with smooth variations with respect to α [3].Finally, let us highlight that for this benchmark, the agent continues to adapt even when it is already performing well.Indeed, one can see that after 40 time-steps the agent is already achieving good results (part C), even though z has not yet converged (part A), which is due to the stochasticity of the environment.Indeed, the agent only receives noised information on α and thus after 40 time-steps it has gathered sufficient information to act well on the environment, but insufficient information to deduce a nearexact value of α [3].This shows that the agent can perform well, even while it is still gathering relevant information on the current task.
It is harder to interpret the neuromodulatory signal for benchmark 3.In fact, for that benchmark, we show that the signal seems to code not only for the task in D but also for the state of the agent in some sense.As α is five-dimensional, it would be very difficult to look at its impact on z as a whole.Rather, we fix the position of the two references in the 2-D space and look at the behaviour of z with respect to α [5].In Fig 9 adaptation is clearly visible in the rewards obtained by the agent (part C) with very few negative rewards after 30 time-steps.We note that for later time-steps, z tends to partially converge (A) and: • some dimensions of z are constant with respect to α [5], indicating that they might be coding for features related to α[1, 2, 3, 4].
• Some other dimensions are well correlated to α [5], for which similar observations than for the two other benchmarks can be made.For example, one can see that some neurons have a very different scale factors for the two possible different values of α[5] (B).
• The remaining dimensions do not converge at all, implying that these are not related to α, but rather to the state of the agent.
These results suggest that in this case, the neuromodulation network is used to code more complex information than simply that required to differentiate tasks, making z harder to interpret.Despite z not converging on some of its dimensions, we stress that freezing z after adaptation will not strongly degrade the agent's performance.That is, the features coded in z that do not depend on α are not critical to the performance of the agent.To illustrate this, we will analyse the behaviour of the agent within an episode when freezing and unfreezing the neuromodulation signal and when changing task.This behaviour is shown on Fig 10, for which: (a) Shows the behaviour of the agent when z is locked to its initial value.This plot thus shows the initial "exploration" strategy used by the agent; that is, the strategy played by the agent when it has not gathered any information on the current task.
(b) Shows the behaviour of the agent after unlocking z, that is when the agent is able to adapt freely to the current task by updating z at each time-step.
(c) Shows the behaviour of the agent when locking z at a random time-step after adaptation.z is thus fixed at a value which fits well the current task.As one can see, the agent continues to navigate towards the correct target.The performance is however a slightly degraded as the agent seems to lose some capacity to avoid the wrong target.This further suggests that, in this benchmark (as opposed to the two others), the neuromodulation signal does not only code for the current task but also for the current state, in some sense, that is hard to interpret.
(d) Shows the same behaviour as in (c) as z is still locked to the same value, but the references are now switched.As there is no adaptation without updating z; the agent is now always moving towards to wrong target.
(e) Shows the behaviour of the agent when unlocking z once again.As one can see, the agent is now able to adapt correctly by updating z at each time-step, and thus it navigates towards the correct target once again.

Robustness study
Even though results are quite promising for the NMN, it is interesting to see how it holds up with another type of activation function as well as analysing its robustness to different main networks' architectures.between sReLUS and sigmoids is often far inferior for NMNs than RNNs (especially for benchmark 2).

Architecture impact
Fig 12 shows the learning curve, on benchmark 1, for different main network architectures (0, 1 and 4 hidden layers in the main network respectively).As one can see, RNNs can, in fact, reach NMNs' performances for a given architecture (no hidden layer in this case), but seem relatively dependant on the architecture.On the contrary, NMNs seem surprisingly consistent with respect to the number of hidden layers composing the main network.

Conclusions
In this work, we adopt a high-level view of a nervous system mechanism called cellular neuromodulation in order to improve the adaptive capabilities of artificial neural networks.The results obtained for three meta-RL benchmark problems showed that this new architecture was able to perform better than classical RNN.The work reported in this paper could be extended along several lines.First, it would make sense to explore other types of machine-learning problems where adaptation is required.Supervised meta-learning would be an interesting track to follow as, for example, it is easy to see how our architecture could be applied to few-shot learning.In such a framework, the context fed to the neuromodulatory network would be a set composed of a few samples and their associated ground-truth.It would be of great interest to compare the performance of our architecture to that of conditional neural processes [11] (CNP).Indeed, the NMN used in this few-shot setting can, in fact, be seen as a CNP with a specifically designed neuromodulatory connection for conditioning the main network.
Second, research work could also be carried out to further improve the NMN introduced here.For instance, one could introduce new types of parametric activation functions which are not linear, or even spiking neurons.This would amount to designing a brand-new parametric activation functions, the parameters of which could thus be more powerful than simple slope and bias.It would also be of interest to look at sharing activation function parameters per layer, especially in convolution layers, as this would essentially result in scaling the filters.One could also build a neuromodulatory signal per-layer rather than for the whole network, allowing for more complex forms of modulation.Furthermore, it would be interesting to see if, with such a scheme, continuity in the neuromodulatory signal (with respect to the task) would be preserved.
Third, it would be a logical progression to tackle other benchmarks to see if the observations made here hold true.More generally, analysing the neuromodulatory signal to a greater depth (and its impact on activation functions) with respect to different more complex tasks would be worthwhile.An interesting point raised in this work is that, for some tasks, neurons have been shown to have a scaling factor of zero, making their activation constant with respect to the input.Generally, any neuron that has a constant output can be pruned if the corresponding offset is added to its connected neurons.This has two interesting implications.First, some neurons have a scale factor of zero for all of the tasks and thus, by using this information, one could prune the main network without losing performance.Second, neurons having a zero-scale factor for some tasks essentially leads to only a sub-network being used for the given task.It would be interesting to discover if very different sub-networks would emerge when an NMN is trained on tasks with fewer similarities than those used in this work.
Finally, we should emphasize that even if the results obtained by our NMN are good and also rather robust with respect to a large choice of parameters, further research is certainly still needed to better characterise the NMN performances.
Before defining the three benchmark problems, let us remind that for each benchmark, the MDPs that belong to the support of D, which generates the different tasks, have transition probabilities and reward functions that differ only according to the value of a scalar α.Drawing an MDP according to D will amount for all the benchmark problems to draw a value of α according to a probability distribution P α (•) and to determine the transition function and the reward function that correspond to this value.Let us also denote by X and A the state and action spaces respectively.

Benchmark 2
State space and action space: where U[a, b] stands for a uniform distribution between a and b.

Initial state distribution:
The initial state x 0 is drawn through 2 auxiliary random variables that represent the x and y initial coordinates of the agent and are denoted u x 0 , u y 0 .At the beginning of an episode, those variables are drawn as follows: From those four auxiliary variables, we define x 0 as: The distribution P x0 (•) is thus fully given by the distributions over the auxiliary variables.
June 5, 2020 1/26 Transition function: Fist, let target be the set of points (x, y) ∈ R 2 such that When taking action a t in state x t drawing the state x t+1 from the transition function amounts to first compute u x t+1 and u y t+1 according to the following procedure: 2. If the preceding condition is not met, an auxiliary variable n t ∼ U[ −π 4 , π 4 ] is drawn to compute u x t+1 and u y t+1 through the following sub-procedure: (a) Step one: One can see that taking an action a t moves the agent in a direction which is the vectoral sum of the intended move m t of direction a t and of a perturbation vector p t of direction α + n t sampled through the distribution over n t .
(b) Step two: In the case where the coordinates computed by step one lay outside S[−2; 2] 2 , they are corrected so as to model the fact that when the agent reaches an edge of the 2D space, it is moved to the opposite edge from which it continues its move.More specifically, ∀k ∈ {x, y}: Once u x t+1 and u y t+1 have been computed,

Reward function:
The reward function can be expressed as follows:

Initial state distribution:
The initial state x 0 is drawn through two auxiliary random variables that represent the x and y initial coordinates of the agent and are denoted u x 0 , u y 0 .At the beginning of an episode, those variables are drawn as follows: From those six auxiliary variables, we define x 0 as:

Transition function:
For all i ∈ {1, 2} let target i be the set of points (x, y) ∈ R 2 such that . When taking action a t in state x t , drawing the state x t+1 from the transition function amounts to first compute u x t+1 and u y t+1 according to the following procedure: 1.If ∃i ∈ {1, 2} : (u x t , u y t ) ∈ target i , which means that the agent is in one of the two targets, then u k t+1 ∼ U[−1.5, 1.5] ∀k ∈ {x, y} 2. If the preceding condition is not met, u x t+1 and u y t+1 are computed by the following sub-procedure: (a) Step one: u x t+1 = u x t + sin(a t * π) * 0.25 u y t+1 = u y t + cos(a t * π) * 0.25 .This step moves the agent in the direction it has chosen.

(b)
Step two: In the case where the coordinates computed by step one lay outside [−2; 2] 2 , they are corrected so as to model the fact that when the agent reaches an edge of the 2D space, it is moved to the opposite edge from which it continues its move.More specifically, ∀k ∈ {x, y}: Once u x t+1 and u y t+1 have been computed,

Reward function:
In the case where (u x t , u y t ) either belongs to only target 1 , only target 2 or none of them, the reward function can be expressed as follows: In the case where (u x t , u y t ) belongs to both target 1 and target 2 , that is (u x t , u y t ) ∈ target 1 ∧ (u x t , u y t ) ∈ target 2 , the reward function can be expressed as follows: That is, we consider that the agent belongs to the target to which it is closer to the centre.

Advantage actor-critic with generalized advantage estimation
In our meta-RL setting, both the actor and the critic are parametric functions that are defined on the trajectories' histories.With θ ∈ Θ and ψ ∈ Ψ the parameters of the actor and critic (Θ and Ψ are the actor and critic parameters spaces), respectively, we define π θ and c ψ as the policy and critic functions.Let π θ k and c ψ k be the models for the policy and the critic after k updates of the parameters θ and ψ, respectively.
To update from θ k to θ k+1 and ψ k to ψ k+1 , the actor-critic algorithm uses the policy π θ k to select actions during B MDPs drawn sequentially from D, where B ∈ N 0 is a parameter of the actor-critic approach.This interaction between the actor-critic algorithm and the meta-RL problem is presented in a tabular version in Algorithm 1 (2.1).
Using the L ∈ N 0 first elements of each trajectory generated from the interaction with the B MDPs and the values of θ k and ψ k , the algorithm computes θ k+1 and ψ k+1 .To this end, the algorithm exploits the set [h B * k,L , . . ., h B * (k+1)−1,L ], which we denote as H k .Note that we use a replay buffer for updating ψ, thus for this update we also use several previous sets H k−1 , H k−2 , etc...A tabular version of the algorithm that details how MDPs are drawn and played, as well as how the set M denote the sum of discounted rewards obtained when playing policy π θ on task M. That is, When used in a classical RL setting, an AC algorithm should interact with its environment to find the value of θ that leads to high values of the expected return given a probability distribution over the initial states.This expected return is written as: where M denotes the Markov Decision Process with which the AC algorithm interacts.When working well, actor critic algorithms produce a sequence of policies π θ1 , π θ2 , π θ3 , . . .whose expected returns increase as the iterative process evolves and eventually reaches values close to those obtained by if π θ is flexible enough, are themselves close to those obtained by an optimal policy π * M defined as: where Π is the set of all admissible policies.
Let h t = {x 0 , a 0 , r 0 , . . ., x t } be a trajectory generated by policy π θ on M and let J π θ M (h t ) be the expected sum of discounted rewards that can be obtained while starting from h t and playing the policy π θ in this environment, that is: where ρ M (x j , a j , x j+1 ) is the reward function of task M. In a classical RL setting, and again for an efficient AC algorithm, the value of the critic for h t , c ψ (h t ), also converges to J We also note that in such a setting, the critic is updated at iteration k + 1 in a direction that provides a better approximation of J π θ k M (•).Now, let us go back to our meta-RL problem and let V π denote the expected sum of returns that policy π can obtain on this problem: Let θ * ∈ arg max θ∈Θ V π θ .When interacting with our meta-RL problem, a performant AC algorithm should, in principle, converge towards a policy π θ * , leading to a value of is called a Bayes optimal policy in a Bayesian RL setting where the distribution D is assumed to be known.If we are working with policies that are, indeed, able to quickly adapt to the environment, we may also expect that the policy π θ * learned by the algorithm is such that, when applied on an M belonging to the support of D, it leads to a value of J π θ * M (h t ) close to max π∈Π J π M (h t ) as t increases.In other words, once the agent has gathered enough information to adapt to the current MDP, it should start behaving (almost) optimally.This is the essence of meta-RL.
We may also expect that, in such case, the value of the critic for h t when the budget is exhausted closely estimates the expected value of the future discounted rewards that June 5, 2020 5/26 can be obtained when using policy π θ * and after having already observed a trajectory h t .Therefore, we may also expect that once the episode budget is exhausted, c ψ (h t ): 1. will be close to 2. will, as t increases, tend to get closer to max M (h t ) where M can be any environment belonging to the support of D used to generate h t .
Existing actor-critic algorithms mainly differ from each other by the way the actor and critic are updated.While in early actor-critic algorithms the critic was directly used to compute the direction of update for the actor's parameters (see for example the REINFORCE policy updates [1]), now it is more common to use an advantage function.This function represents the advantage in terms of return of selecting specific actions given a trajectory history (or simply a state when AC algorithms are used in a standard setting) over selecting them following the policy used to generate the trajectories.Here, we use generalised advantage estimations (GAE), as introduced in [2].More recently, it has been shown that avoiding too large policy changes between updates can greatly improve learning ( [3], [4]).Therefore, while in classical AC algorithms the function used to update the actor aims at representing directly the gradient of the actor's return with respect to its parameters, we rather update the actor's parameters θ by minimising a loss function that represents a surrogate objective.We have selected as surrogate function one that is similar to the one introduced in [4] with an additional loss term that proved to improve (albeit slightly) the performances of PPO in all cases.
As our actor and critic are modelled by differentiable functions, they are both updated through gradient descent.We now proceed to explain the losses use to compute the gradient for both the actor and the critic.
Actor update First, we define the temporal error difference term for any two consecutive time-steps of any trajectory: where ψ k denotes the critic's parameters for playing the given trajectory.This temporal difference term represents, in some sense, the (immediate) advantage obtained, after having played action a j over what was expected by the critic.If c ψ k (•) was the true estimate of J π θ k M (•) and if the policy played was π θ k , the expected value of these temporal differences would be equal to zero.We now define the GAE's terms that will be used later in our loss functions: where λ ∈ [0, 1] is a discount factor used for computing GAEs, T D j i is the value of T D i for trajectory j, and where L is another hyper-parameter of the algorithm, chosen in combination with L in order to have a value of GAE j i that accurately approximates June 5, 2020 6/26 ∞ t=i (γ * λ) t−i * T D j t ∀i, j.Note that the value chosen for L also has to be sufficiently large to provide the loss function with a sufficient number of GAE terms.These GAE terms, introduced in [2], represent the exponential average of the discounted future advantages observed.Thanks to the fact that GAE terms can catch the accumulated advantages of a sequence of actions rather than of a single action, as it is the case with the temporal difference terms, they can better represent the advantage of the new policy played by the AC algorithm over the old one (in terms of future discounted rewards).
In the loss function, we will actually not use the advantage terms as defined by Equation 5, but normalised versions in order to have advantages that remain in a similar range regardless of rewards magnitude.Thanks to this normalisation, the policy learning rate does not have to be tuned according to the loss magnitude.However, this normalisation does not mask actions that have led to higher or lower returns than average.We normalize as follows ∀k ∈ [1, . . ., E] with E the number of actor and critic updates that have been carried: where is the symbol we use to represent the average sum operator (i.e.Once advantages have been computed, the values of θ k+1 are computed using updates that are strongly related to PPO updates with a Kullback Leibler (KL) divergence implementation [4].The loss used in PPO updates is composed of two terms: a classic policy gradient term and a penalisation term.Let us now present a standard policy gradient loss, note that from now on, we will refer to the value of a t , x t and h t at episode i by a i t , x i t and h i t respectively.
1 Although not explicitly written in the text for clarity, we use a normalisation technique when computing discounted sums for the AC algorithm update.In fact, when carrying an update of the AC algorithm, if rewards appear in discounted sums, they are multiplied by (1 − γ).This has for effect that the discounted sum values remain of the same magnitude regardless of γ.The implications of this normalization are two-fold.(i) The critic does not directly approximate J π M (•) but rather (1 − γ) * J π M (•).(ii) Second, for the temporal differences to remain coherent with this normalisation, r i must also be multiplied by (1 − γ) when computing T D i .Those two small changes are included in Algorithm 3.
where B k is the set of all pairs (i, t) for which i, t ∈ ([B * k, . . ., B * (k +1)−1] * [0, . . ., L ]), that is, the set containing the first L time-steps of the B trajectories played for iteration k of the actor-critic algorithm.
One can easily become intuitive about Equation 6as, given an history h i t , minimising this loss function tends to increase the probability of the policy taking actions leading to positive advantages (i.e.GAE i t > 0) and decreases its probability to take actions leading to negative advantages (i.e.GAE i t < 0).It has been found that to obtain good performances with this above-written loss function, it was important to have a policy that does not change too rapidly from one iteration to the other.Before explaining how this can be achieved, let us first give an explanation on why it may be important to have slow updates of the policy.Let us go back to the loss function given by Equation 6. Minimising this loss function will give a value for θ k+1 that will lead to higher probabilities of selecting actions corresponding to high values of the advantages GAE i t .A potential problem is that these advantages are not really related to the advantages of the would-be new policy π θ k+1 over π θ k but are instead related to the advantages of policy π θ k over π θ k−1 .Indeed, the advantages GAE i t are computed using the value function c ψ k , whose parameters have been updated from ψ k−1 in order to better approximate the sum of discounted rewards obtained during the episodes [B * (k − 1), . . ., B * k − 1].It clearly appears that ψ k has, in fact, been updated to approximate discounted rewards obtained through the policy π θ k−1 (used to play episodes for update k − 1).A solution to this problem is to constraint the minimisation to reach a policy π θ k+1 that does not stand too far from π θ k .We may reasonably suppose that the advantage function used in ( 6) still correctly reflects the real advantage function of π θ k+1 over π θ k .To achieve this, we add a penalisation term P(θ) to the loss function.In the PPO approach, the penalisation term is where KL is the Kullback-Leibler divergence, detailed later on.This term penalises policies that are too different from π θ k .
We note that the β k dynamical updates use a hyper-parameter d targ ∈ N 0 called the divergence target.The update is done through the following procedure (note that, unlike updates of β proposed in [4], we constrain β to remain in the range [β min , β max ]; we explain later why): With this update strategy, the penalisation term will tend to evolve in a way such that the KL divergence between two successive policies does not tend to go beyond d targ without having to add an explicit constraint on d, as was the case in Trust Region Policy Optimization (TRPO) updates [3], which is more cumbersome to implement.
As suggested in [5], adding another penalisation term (squared hinge loss) to P P P O to further penalise the KL divergence, in cases where it surpasses 2 * d targ , improved algorithm performance.The final expression of the penalisation term is: June 5, 2020 8/26 where δ is a hyper-parameter that weights the third loss term.The loss function L policy that we minimise as a surrogate objective becomes: We now detail how to compute the KL divergence.First, let us stress that we have chosen to work with multi-variate Gaussian policies for the actor.This choice is particularly well suited for MDPs with continuous action spaces.The approximation architecture of the actor will therefore not directly output an action, but the means and standard deviations of an m-dimensional multi-variate Gaussian from which the actor's policy can be defined in a straightforward way.For each dimension, we bound the multi-variate Gaussian to the support, U, by playing the action that is clipped to the bounds of U whenever the multi-variate Gaussian is sampled outside of U. In the remaining of this paper, we will sometimes abusively use the terms "output of the actor at time t of episode i" to refer to the means vector µ θ k i,t and the standard deviations vector σ θ k i,t that the actor uses to define its probabilistic policy at time-step t of episode i.Note that we have chosen to work with a diagonal covariance matrix for the multi-variate Gaussian distribution.Its diagonal elements correspond to those of the vector σ θ k i,t .We can then compute the KL divergence in each pair [i, t] following the well-established formula: where Σ θ k ,i,t , Σ θ,i,t are the diagonal covariance matrices of the two multi-variate Gaussian distributions π θ k (•|h i t ), π θ (•|h i t ) that can be derived from σ θ k i,t and σ θ i,t .The loss function L vanilla can be expressed as a function of Σ θ k ,i,t , Σ θ,i,t , µ θ i,t and µ θ k i,t when working with a multi-variate Gaussian.To this end, we use the log-likelihood function ln (π θ (a i,t |h i t )), which gives the log-likelihood of having taken action a i t given a trajectory history h i t .In the case of a multi-variate Gaussian, ln (π θ (a i t |h i t )) is defined as: where m is the dimension of the action space and where |Σ θ,i,t | represents the determinant of the matrix.From this definition, one can rewrite L vanilla as: By merging equation ( 11), (10) and equation ( 8), one gets a loss L policy that depends only on Σ θ k ,i,t , Σ θ,i,t , µ θ i,t and µ θ k i,t .

Critic update
The critic is updated at iteration k in a way to better approximate the expected return obtained when following the policy π θ k , starting from a given trajectory history.To this end, we use a mean-square error loss as a surrogate objective for optimizing ψ.First, we define Ri From the definition of Ri j we express the loss as: where (i) crb ∈ N 0 is a hyper-parameter; (ii) B k−crb is the set of all pairs (i, t) for which i, j ∈ ([B * (k − crb), . . ., B * (k + 1) − 1] * [0, . . ., L ]).The set B k−crb used in ( 12) contains all the pairs from the current trajectory batch and from the crb previous trajectory batches.We call this a replay buffer whose length is controlled by crb which stands for "criticreplaybuffer". Minimising L critic does not lead to updates such that c ψ directly approximates the average expected return of the policy π θ k .Rather, the updates are such that c ψ directly approximates the average expected return obtained by the last crb + 1 policies played.We found out that using a replay buffer for the critic smoothed the critic's updates and improved algorithm performances.
Note that the loss ( 12) is only computed on the L << L first time-steps of each episode, as was the case for the actor.The reason behind this choice is simple.The value function c ψ k should approximate R i j = +∞ t=j γ t−j * r i t for every h i j , where R i j the infinite sum of discounted rewards that are attainable when "starting" from h i j .However, this approximation can become less accurate when j becomes close to L since we can only guarantee Ri j to stand in the interval:

Gradients computation and update
The full procedure is available as Tabular versions in 2.1.As a summary, we note that both the actor and critic are updated using the Adam procedure ( [6]) and back-propagation through time ( [7]).The main difference between both updates is that the actor is updated following a full-batch gradient descent paradigm, while the critic is updated following a mini-batch gradient descent paradigm.
[3] hyperparameters 0 : The set of hyper-parameters that contains the following elements: • B : Number of episodes played between updates.
• P θ0 and P ψ0 : The distributions for initialising actor and critic's parameters.Those distributions are intrinsically tied to the models used as function approximators.
• L : Number of time steps played per episode.
• L : Number of time steps per episode used to compute gradients.
• e a : The number of epochs per actor update.
• e c : The number of epochs per critic update.
• d targ : The KL divergence target.
• d thresh : The threshold used for early stopping.
• β min and β max : The minimum and maximum β values.
• β 0 : The initial value of β k for penalising the KL divergence.
• a lr0 : The initial value of the policy learning rate a lr k .
• c v0 , c z0 , a v0 and a z0 : The initial value for the ADAM optimiser moments c v k , c z k , a v k and a z k .
• c lr : The critic learning rate.
• cmb : The mini-batch size used for computing the critic's gradient.
• crb : The number of previous trajectory batches used in the replay buffer for the critic.
We note that some of the hyper-parameters are adaptive.These are β k , a lr k , c v k , c z k , a v k and a z k .Thus the hyper-parameter vector may have to change in between iterations.For this reason we introduce the notation hp k which represents the hyper-parameter vector with the values of the adaptive parameters at iteration k.
Random initialisation 6: ψ 0 ∼ P ψ0 (.) Random initialisation 7: while B * k < E do 8: Algorithm 2 Kth run of B episodes The parameters of the policy at iteration k.
[2] D : The distribution from which the MDPs are sampled.
[3] hp k : In this procedure, we use as hyper-parameters: • B : The number of episodes to be played.
• L : The number of time steps played by episode.
x i t ∼ P x0 (.) 9: while t <= L do 11: 12: The right-side refers to P (x i t+1 |x i t , a i t ) of current task M.
The right-side refers to ρ(x t+1 , x t , a t ) of M. 14: 15: Algorithm 3 Kth update of the actor critic model . ., H k : The crb + 1 last sets of B trajectories of length L.
[2] θ k and ψ k : The parameters of the actor and critic after k updates.
• a lr k : The current policy learing rate.
[4] hp k : In this procedure, we use as hyper-parameters: • e actor : The number of epochs per actor update.
• d targ : The KL divergence target.
• d thresh : The threshold used for early stopping.
• L : The number of time-steps per trajectory used for computing gradients.
• a lr k : The actor learning rate.
• a v k and a z k : The last ADAM moments computed at iteration k − 1.
[2] hp k : In this procedure, we use as hyper-parameters: • d targ : The KL divergence target.
• β min and β max : The minimum and maximum β values.
• a lr k : The current actor learning rate.[2] ψ k : The critic's parameters.
[3] hp k : In this procedure, we use as hyper-parameters: • e critic : The number of epochs per critic update.
• cmb : The mini-batch size used for computing the critic's gradient.
• T : A hyper-parameter of our gradient estimate.
• crb : The replay buffer size.
• c lr : The critic learning rate.
• L : The number of time-steps per trajectory used for computing gradients.
• c v k and c z k : The last ADAM moments computed at iteration k − 1.
[2] Z : The set of pairs (i, j) such that v(α) i,j appears in L(α).We emphasise that from the way Z is built (see Algorithms 4 and 6), most of the time Z contains x batches of T consecutive pairs.Note that very rarely, batches may have fewer than T consecutive pairs (whenever a batch contains the last pairs of an episode which does not contain a multiple of T pairs), although, the same gradient descent algorithm can still be applied.
[3] α : The element for which the estimate gradient of L(α) needs to be evaluated.
[4] hp k : In this procedure, we use as hyper-parameter: • T : The number of time-steps for which the gradient can propagate.
3: Output: The gradient estimate of the function L(α) evaluated in α .
We refer the reader to the source code which is available on Github (https://github.com/nvecoven/nmd_net),which is a particular implementation of standard BPTT [7].We note that giving a full tabular version of the algorithm here would not constitute valuable information to the reader, due to its complexity/length.

Architecture details
For conciseness, let us denote by f n a hidden layer of n neurons with activation functions f , by → a connection between two fully-connected layers and by () a neuromodulatory connection (as described in Section ??).
Benchmark 1.The architectures used for this benchmark were as follows: Benchmark 2 and 3.The architectures used for benchmark 2 and 3 were the same and as follows: • RNN : GRU 100 → GRU 75 → ReLU 45 → ReLU 30 → ReLU 10 → I 1 5 Bayes optimal policy for benchmark 1 A Bayes optimal policy is that maximises the expected sum of rewards it obtains when playing an MDP drawn from a known distribution D. That is, a Bayes optimal policy π * bayes belongs to the following set: with P M being the state-transition function of the MDP M and R π M the discounted sum of reward obtained when playing policy π on M.
In the first benchmark, the MDPs only differ by a bias, which we denote α.Drawing an MDP according to D amounts to draw a value of α according to a uniform distribution of α over [−α max , α max ], denoted by U α , and to determine the transition function and the reward function that correspond to this value.Therefore, we can write the previous equation as: with M(α) being a function giving as output the MDP corresponding to α and Π the set of all possible policies.We now prove the following theorem.
Theorem 1 The policy that selects: 1. at time-step t = 0 the action a 0 = x 0 + γ * (αmax+4.5)1+γ 2. at time-step t = 1 a) if r 0 = 10, the action a ) and otherwise the action a 1 = a 0 + r 0 + 1 3. for the remaining time-steps: a) if r 0 = 10, the action a t = x t + a 0 − x 0 b) else if r 1 = 10, the action a t = x t + a 1 − x 1 c) and otherwise the action a t = x t + i t where i t is the unique element of the set {a is Bayes optimal for benchmark 1.

June 5, 2020 19/26
Proof Let us denote by π * theorem1 the policy in this theorem.To prove this theorem, we first prove that in the set of all possible policies Π there are no policy π which leads to a higher value of than π * theorem1 .Or equivalently: Afterwards, we prove that the policy π * theorem1 , generates for each time-step t ≥ 2 a reward equal to R max which is the maximum reward achievable, or written alternatively as: By merging ( 14) and ( 15), we have that which proves the theorem.
Part 1.Let us now prove inequality (14).The first thing to notice is that for a policy to maximise expression (13), it only needs to satisfy two conditions for all x 0 .The first one: to select an action a 1 , which knowing the value of (x 0 , a 0 , r 0 , x 1 ), maximises the expected value of r 1 .We denote by V 1 (x 0 , a 0 , r 0 , x 1 ) the maximum expected value of r 1 that can be obtained knowing the value of (x 0 , a 0 , r 0 , x 1 ).The second one: to select an action a 0 knowing the value of x 0 that maximises the expected value of the sum r 0 + γV 1 (x 0 , a 0 , r 0 , x 1 ).We now show that the policy π theorem1 satisfies these two conditions.
Let us start with the first condition that we check by analysing four cases, which correspond to the four cases a), b), c), d) of policy π theorem1 for time step t = 1.a) If r 0 = 10, the maximum reward that can be obtained, we are in a context where a 0 belongs to the target interval.It is easy to see that, by playing a 1 = x 1 + a 0 − x 0 , we will obtain r 1 equal to 10.This shows that in case a) for time step t = 1, π theorem1 maximises this expected value of r 1 .
b) If |r 0 | > α (a 0 − ∧ a 0 − x 0 > 0 and r 0 = 10 it is easy to see that the value of α to which the MDP corresponds can be inferred from (x 0 , a 0 , r 0 ) and that the action a 1 = a 0 + r 0 will fall in the middle of the target interval, leading to a reward of 10.Hence, in this case also, the policy π theorem1 maximises the expected value of r 1 .
c) If |r 0 | > α max − (x 0 − a 0 ) ∧ a 0 − x 0 < 0 and r 0 = 10, we are also in a context where the value of α can be inferred directly from (x 0 , a 0 , r 0 ) and the action a 1 = a 0 − r 0 targets the centre of the target interval, leading to a reward of 10.
Here again, π theorem1 maximises the expected value of r 1 .
In the following we will fix a 1 to a 0 + |a 0 − x 0 − α| + 1 when a 0 − x 0 ≥ 0. Let us also observe that the expected value of r 1 is equal to 5.5 − |a 0 − x 0 − α|.Up to now in this item d), we have considered (a 0 − x 0 ) > 0. When (a 0 − x 0 ) ≤ 0, using the same reasoning we reach the exact same expression for the optimal action to be played and for the maximum expected return of r 1 .This is due to the symmetry that exists between both cases.Since π theorem1 plays the action a 1 = a 0 + r 0 + 1 = a 0 − |a 0 − x 0 − α| + 1 in the case d) at time step 1, it is straightforward to conclude that, in this case, it also plays an action that maximises the expected value of r 1 .
Now that the first condition for π theorem1 to maximise expression (13) has been proved, let us turn our attention to the second one.To this end, we will compute for each x 0 ∈ X , the action a 0 ∈ A that maximises: and show that this action coincide with the action taken by π theorem1 for time step t = 0. First let us observe that for this optimisation problem, one can reduce the search space A to [x 0 − α max + 1, x 0 + α max − 1] ⊂ A. Indeed, an action a 0 that does not belong to this latter interval would not give more information about α than playing a 0 = x 0 − α max + 1 or x 0 + α max − 1 and lead to a worse expected r 0 .This reduction of the search space will be exploited in the developments that follow.
From here, we can compute the value of expression (19) when a 0 − x 0 ≥ 0. We note that due to the symmetry that exists between the case a 0 − x 0 ≥ 0 and a 0 − x 0 ≤ 0, expression (19) will have the same value for both cases.Since we have: And thus, by computing the integrals, we have: Let us now analyse the first term of the sum in equation ( 18), namely E α∼Uα (r 0 ).
We have that: Due to the reduction of the search space, we can assume that a 0 belongs to [x 0 − α max + 1, x 0 + α max − 1], we can write: To find the action a 0 that maximises (16), one can differentiate (18) with respect to a 0 : This derivative has a single zero value equal to: It can be easily checked that it corresponds to a maximum of expression ( 16) and since it also belongs to the reduced search space [x 0 − α max + 1, x 0 + α max − 1], it is indeed the solution to our optimisation problem.Since π theorem1 plays this action at time t = 0, Part 1 of this proof is now fully completed.
Part 2. Let us now prove that the policy π * theorem1 generates for every t ≥ 2 rewards equal to R max = 10.We will analyse three different cases, corresponding to the three cases a), b) and c) of policy π theorem1 for time step t ≥ 2. a) If r 0 = 10, we are in a context where a 0 belong to the target interval.It is straightforward to see that, by playing a t = x t + a 0 − x 0 , the action played by π theorem1 in this case, we will get a reward r t equal to 10.
b) If r 1 = 10 and r 0 = 10, one can easily see that playing action a t = x t + a 1 − x 1 , the action played by π theorem1 , will always generate rewards equal to 10. c) If r 0 = 10 and r 1 = 10, it is possible to deduce from the first action a 0 that the MDP played corresponds necessarily to one of these two values for α: {a 0 − x 0 + r 0 ; a 0 − x 0 − r 0 }.Similarly, from the second action played, one knows that α must also stand in {a 1 − x 1 + r 1 ; a 1 − x 1 − r 1 }.It can be proved that because a 0 = a 1 (a property of our policy π theorem1 ), the two sets have only one element in common.Indeed if these two sets had all their elements in common, either this pair of equalities would be valid: June 5, 2020 24/26 or this pair of equalities would be a − = 1 − x r 1 a 0 − x 0 − r 0 = a 1 − x 1 + r 1 .
By summing member by member the two equations of the first pair, we have: Taking into account that x 0 = x 1 because none of the two actions yielded a positive reward, it implies that a 0 = a 1 , which results in a contradiction.It can be shown in a similar way that another contradiction appears with the second pair.As a result the intersection of these two sets is unique and equal to α.From here, it is straightforward to see that in this case c), the policy π theorem1 will always generate rewards equal to R max .
From Theorem 1, one can easily prove the following theorem.
Theorem 2 The value of expected return of a Bayes optimal policy for benchmark 1 is equal to Proof The expected return of a Bayes optimal policy can be written as follows: From the proof of Theorem 1, it is easy to see that:

Fig 1 .
Fig 1. Sketch of the NMN architecture.A. The NMN is composed of the interaction of a neuromodulatory neural network that processes some context signal (top) and a main neural network that shapes some input-output function (bottom).B. Computation graph of the NMN activation functions σ NMN , where w s and w b are parameters controlling the scale factor and the offset of the activation function σ, respectively.z is a context-dependent variable computed by the neuromodulatory network.https://doi.org/10.1371/journal.pone.0227922.g001

1 .
Figs 3, 4 and 5 are a graphical representation of each of the benchmarks.Benchmark We define the first benchmark (made of a 1-D state space and action space) through a random variable α, informative enough to distinguish all different MDPs in D. With this definition, α represents the current task and drawing α at the beginning of each episode amounts to sampling a new task in D. At each time-step, the agent observes a biased version x t = p t + α of the exact position of a target p t belonging to the interval [−5 − α, 5 − α], with a � U½À 10; 10�.The agent outputs an action a t 2 [−20, 20] and receives a reward r t

Fig 3 .Fig 4 .
Fig 3. Sketch of a time-step interaction between an agent and two different tasks M (A and B) sampled in D for the first benchmark.Each task is defined by the bias α on the target's position p t observed by the agent.x t is the observation made by the agent at time-step t and a t its action.For these examples, a t falls outside the target area (the zone delimited by the dashed lines), and thus the reward r t received by the agent is equal to −|a t − p t | and p t+1 = p t .If the agent had taken an action near the target, then it would have received a reward equal to 10 and the position of the target would have been re-sampled uniformly in [−5 − α, 5 − α].https://doi.org/10.1371/journal.pone.0227922.g003

Fig 5 .
Fig 5. Sketch of a time-step interaction between an agent for the two different tasks M (A and B) sampled in D for the third benchmark.Each task is defined by the attribution of a positive reward to one of the two targets (in blue) and a negative reward to the other (in red).At each time-step the agent outputs an action a t which drives the direction of its next move.If the agent reaches a target, it receives the corresponding reward.https://doi.org/10.1371/journal.pone.0227922.g005

Fig 6 .Fig 7 .
Fig 6.Mean (± std in shaded) sum of rewards obtained over 15 training runs with different random seeds with respect to the episode number.Results of benchmark 1,2 and 3 are displayed from left to right.The plots are smoothed thanks to a running mean over 1000 episodes.https://doi.org/10.1371/journal.pone.0227922.g006

Fig 8 .
Fig 8. Adaptation capabilities of the NMN architecture on benchmark 2. A. Temporal evolution of the neuromodulatory signal z with respect to α[3], gathered on 1000 different episodes.As α[3] is an angle, the plot is projected in polar coordinates for a better interpretability of the results.Each dimension of z is corresponds to a different radius.B. The value of the scale factors with respect to α[3] for each neuron of a hidden layer in the main network.Again, the plot is projected in polar coordinates.For a given α[3], the values of the neurons' scale factor are given thanks to the radius.c.Average reward obtained at each time-step by the agent during those episodes.Note that after an average of 40 time-steps, the agent is already achieving decent performances even though z has not yet converged.https://doi.org/10.1371/journal.pone.0227922.g008

Fig 9 .
Fig 9. Adaptation capabilities of the NMN architecture on benchmark 3. A. Temporal evolution of the neuromodulatory signal z with respect to α[5], gathered on 1000 different episodes.Note that the neuromodulatory signals go from uniform distributions over all possible alpha values (i.e., the different contexts) to nonuniform and adapted (w.r.t.alpha) distributions along with an increase of the rewards.B. The value of the scale factors with respect to α[5] for the 5 neurons of a hidden layer in the main network, for which the scale factor is the most correlated to α[5].C. Average number of good and bad target hits at each time-step during those episodes.On average, after 15 time-steps, the agent starts navigating towards the correct target while avoiding the wrong one.https://doi.org/10.1371/journal.pone.0227922.g009

Fig 11 shows
Fig 11  shows the comparison between having sigmoids as the main network's activation function instead of sReLUs.As one can see, sigmoid activation functions lead to worse or equivalent results to sReLUs, be it for RNNs or NMNs.In particular, the NMN architecture seems more robust to the change of activation function as opposed to RNNs, as the difference

Fig 10 .
Fig 10.Analysis of the agent's behaviour when freezing and unfreezing the neuromodulation signal and when changing task within an episode.The green reference is attributed a reward of 100 while the red one is attributed a reward of −50.Each blue arrow represents the movement of the agent for a given time-step.(a) Shows the behaviour with z fixed at its initial value.In (b) we unlock z.Then, in (c) we lock z with its current value.Finally in (d) we switch the references before unlocking z once again in (e).https://doi.org/10.1371/journal.pone.0227922.g010

Fig 11 .Fig 12 .
Fig 11.Mean (± std in shaded) sum of rewards obtained over 15 training runs with different random seeds with respect to the episode number.Results of benchmark 1, 2 and 3 are displayed from left to right.The plots are smoothed thanks to a running mean over 1000 episodes.https://doi.org/10.1371/journal.pone.0227922.g011 t where r t are the rewards gathered at each time-step.To have a properly performing actor-critic algorithm, the value chosen for L has to be chosen sufficiently large to June 5, 2020 4/26 produce an accurate estimation of the returns R π θ k Mi ∀i ∈ [B * k, . . ., B * (k + 1) − 1] obtained by the policy π θ k .
To define the loss functions used to compute θ k+1 and ψ k+1 , only the GAE terms corresponding to time-steps [0, . . ., L ] of episodes [B * k, . . ., B * (k + 1) − 1] are computed.A tabular version of the algorithm used to compute these terms is given in Algorithm 3 of Appendix 2.1 1 .

5) a 1 Fig 1 .
Fig 1. Graphical representation of the 5 different cases when playing a 1 .

Table 1 .
Value of the hyper-parameters that are kept constant for every benchmark in this paper.Table1provides the values of all the hyper-parameters used for training.