^{1}

^{¤}

^{2}

^{1}

^{2}

^{*}

Conceived and designed the experiments: PC SW CZ. Performed the experiments: MS SW. Analyzed the data: PC MS SW. Wrote the paper: PC SW.

Current address: Diagnostic Radiology Department, Clinical Center, National Institutes of Health, Bethesda, Maryland, United States of America

The authors have declared that no competing interests exist.

Cooperation plays a key role in the evolution of complex systems. However, the level of cooperation extensively varies with the topology of agent networks in the widely used models of repeated games. Here we show that cooperation remains rather stable by applying the reinforcement learning strategy adoption rule, Q-learning on a variety of random, regular, small-word, scale-free and modular network models in repeated, multi-agent Prisoner's Dilemma and Hawk-Dove games. Furthermore, we found that using the above model systems other long-term learning strategy adoption rules also promote cooperation, while introducing a low level of noise (as a model of innovation) to the strategy adoption rules makes the level of cooperation less dependent on the actual network topology. Our results demonstrate that long-term learning and random elements in the strategy adoption rules, when acting together, extend the range of network topologies enabling the development of cooperation at a wider range of costs and temptations. These results suggest that a balanced duo of learning and innovation may help to preserve cooperation during the re-organization of real-world networks, and may play a prominent role in the evolution of self-organizing, complex systems.

Cooperation is necessary for the emergence of complex, hierarchical systems

Small-world, scale-free or modular network models, which all give a chance to develop the complexity of similar, yet diverse agent-neighborhoods, provide a good starting point for the modeling of the complexity of cooperative behavior in real-world networks

As an illustrative example for the sensitivity of cooperation on network topology, we show cooperating agents after the last round of a ‘repeated canonical Prisoner's Dilemma game’ (PD-game) on two, almost identical versions of a modified Watts-Strogatz-type small-world model network

The modified Watts-Strogatz small-world network was built on a 15×15 lattice, where each node was connected to its eight nearest neighbors. The rewiring probabilities of the links placed originally on a regular lattice were 0.01 (left panels) and 0.04 (right panels), respectively. For the description of the canonical repeated Prisoner's Dilemma game, as well as the best-takes-over (top panels) and Q-learning (bottom panels) strategy adoption rules see

On the contrary to the general sensitivity of cooperation to the topology of agent-networks in PD-games using the short-term strategy adoption rule shown above, when the long-term, reinforcement learning strategy adoption rule, Q-learning was applied, the level and configuration of cooperating agents showed a surprising stability (cf. the bottom panels of

Extending the observations shown on

Small-world (SW, filled, red symbols) networks were built as described in the legend of _{innovation} = 0.0002, bottom panel) strategy adoption rules, see

Next we wanted to see, if other long-term strategies besides Q-learning can also promote cooperation between agents. In Q-learning agents consider a long-term experience learned in all the past rounds of the play. Therefore, we modified the best-takes-over strategy adoption rule allowing the agents to use accumulative rewards of their neighbors in all past rounds instead of the reward received just in the last round. In agreement with our expectations, both on small-world and scale-free networks this long-term strategy adoption rule outperformed its short-term version allowing a larger number of agents to cooperate – especially at high temptation values. Importantly, the differences between cooperation levels observed in small-world and scale-free networks were even greater, when we applied the long-term strategy adoption rule compared to its short-term version (middle panel of

Next we tested, if the innovative elements of the Q-learning strategy adoption rule may contribute to the stability of cooperation in various network topologies. For this, we constructed an ‘innovative’ version of the long-term version of the best-takes-over, ‘non-innovative’ strategy adoption rule by adding a low level of randomness instructing agents to follow the opposite of the selected neighbor's strategy with a pre-set _{innovation}_{innovation}_{innovation}_{innovation}_{innovation}

We have shown so far that long-term, learning strategy adoption rules help the development of cooperation, while ‘innovative’ strategy adoption rules make the cooperation level more independent from the actual network topology.

(Top middle panel) The small-world (spheres) and scale-free (cones) model networks were built as described in the legends of

As a summary, our simulations showed that long-term learning strategy adoption rules promote cooperation, while innovative elements make the appearance of cooperation less dependent from the actual network topology in two different games using a large number of network topologies in model networks. We must emphasize that the term ‘learning’ is used in our paper in the sense of the collection and use of information enriching and diversifying game strategy and behavior, and not in the restricted sense of imitation, or directed information-flow from a dominant source (the teacher) pauperizing the diversity of game strategies. The help of learning in promoting cooperation is already implicitly involved in the folk theorem, which opens the theoretical possibility for the emergence of cooperation at infinitely repeated games

We use the term ‘innovation’ in the sense of irregularities in the selection of adoption rules of game strategy. Therefore, ‘innovation’ may be caused by errors, mutations, mistakes, noise, randomness and temperature besides the

Cooperation helps the development of complex network structures

Our current work can be extended in a number of ways. The complexity of the game-sets and network topologies offers a great opportunity for a detailed equilibrium-analysis, similarly to that described by Goyal and Vega-Redondo

hub-rewiring including the formation and resolution of ‘rich-clubs’, where hub-hub contacts are preferentially formed

emergence of modularity beyond to our data in ESM1 Figure S1.4;

appearance and disappearance of bridge-elements between modules;

changes of modular overlaps and module hierarchy, etc.

Tan

In both the Hawk-Dove and the Prisoner's Dilemma games, each agent had two choices: to cooperate or to defect. In the repeated, multi-agent Hawk-Dove game the benefit of defectors is higher than that of cooperators, when they are at low abundance, but falls below cooperator benefit, when defectors reach a critical abundance

In the Hawk-Dove games (or in the conceptually identical Snowdrift and Chicken games

In Hawk-Dove games T>R>S>P, in the extended (also called ‘weak’) Prisoner's Dilemma game T≥R>P≥S, while in the canonical Prisoner's Dilemma game T>R>P>S. This makes the following order of games from less to more stringent general conditions allowing less and less cooperation: Hawk-Dove game>extended Prisoner's Dilemma game>canonical Prisoner's Dilemma game. Due to this general order, we showed the results of the canonical Prisoner's Dilemma game in the main text, and inserted the results of the two other games to the

In our simulations each node in the network was an agent, and the agent could interact only with its direct neighbors. Agents remained at the same position throughout all rounds of the repeated games, and they were neither exchanged, nor allowed to migrate. If not otherwise stated, games started with an equal number of randomly mixed defectors and cooperators (hawks and doves in the Hawk-Dove game), and were run for 5,000 rounds (time steps). The payoff for each agent in each round of play was the average of the payoffs it received by playing with all its neighbors in the current round. In our long-term learning strategy adoption rules introduced below, the accumulative payoff means the accumulation of the average payoffs an agent gets in each round of play. Average payoff smoothes out possible differences in the degrees of agents, and in several aspects may simulate real-world situations better than non-averaged payoff, since in real-world situations agents usually have to observe a cost of maintaining a contact with their neighbors

In Prisoner's Dilemma and Hawk-Dove games our agents followed three imitation-type, short-term strategy adoption rules, the ‘pair-wise comparison dynamics’ (also called as ‘replicator dynamics’), ‘proportional updating’ and ‘best-takes-over’ (also called as ‘imitation of the best’) strategy adoption rules

In the ‘pair-wise comparison dynamics’ strategy adoption rule _{i}_{max} = (_{max} = max(_{i}_{j}

For the ‘proportional updating’ strategy adoption rule _{i}_{i}_{i}_{i}

In the ‘best-takes-over’ strategy adoption rule (also called as imitation of the best strategy adoption rule,

As a reinforcement learning _{t}_{t}_{t}_{+1} after the action of the agent, and the agent received the reward _{t}_{t}_{t}_{+1} when the agent chose action _{t}

The task of the agent was to learn the optimal strategy to maximize the total discounted expected reward. The discounted reward meant that the rewards received by the agent in the future were worth less than that received in the current round. Under a policy π denoting how the agent selected the action at its actual state and reward, the value of state, _{t}_{t}_{t}

The theory of Dynamic Programming ^{*}, which can be written as

The task of Q-learning was to learn the optimal policy, π, when the initial conditions of both the reward function and transition probabilities were unknown. If the environment model (reward model and transition probabilities of states) is known, then the above problem can be solved by using Dynamic Programming. Watkins and Dayan _{t}_{t}_{t}

In repeated multi-agent games, the state of each agent was affected by the states of its direct neighbors. Those neighbors constituted the environment of the agent. The reward of the agent _{t}_{t}_{i}_{t}_{i}_{t}

Long-term learning strategy adoption rules were generated by considering the accumulative average payoffs instead of instantaneous average rewards in the update progress during each round of play for all strategy adoption rules used. In both short term and long-term innovative strategy adoption rules, agent _{innovation}_{innovation}

In our work we used a set of widely adopted model networks to simulate the complexity of real-world situations. Generation of the Watts-Strogatz-type small-world model network

At the visualization the coordinates of the small-world networks with a rewiring probability of p = 0.01 were used for the p = 0.04 networks to avoid the individual variations of the Pajek-figures

This supporting information extends the major findings of the paper to two different games (the extended Prisoner's Dilemma Game and the Hawk-Dove/Snowdrift game) and a wide parameter set, and gives additional methods, discussion and references.

(0.74 MB PDF)

The useful comments of our Editor, Enrico Scalas, referees, including Michael König and Bence Toth, as well as Robert Axelrod, János Kertész, István A. Kovács, Robin Palotai, György Szabó, Attila Szolnoki, Tamás Vicsek and members of the LINK-Group (