Abstract
Tackling multi-agent environments where each agent has a local limited observation of the global state is a non-trivial task that often requires hand-tuned solutions. A team of agents coordinating in such scenarios must handle the complex underlying environment, while each agent only has partial knowledge about the environment. Deep reinforcement learning has been shown to achieve super-human performance in single-agent environments, and has since been adapted to the multi-agent paradigm. This paper proposes A3C3, a multi-agent deep learning algorithm, where agents are evaluated by a centralized referee during the learning phase, but remain independent from each other in actual execution. This referee’s neural network is augmented with a permutation invariance architecture to increase its scalability to large teams. A3C3 also allows agents to learn communication protocols with which agents share relevant information to their team members, allowing them to overcome their limited knowledge, and achieve coordination. A3C3 and its permutation invariant augmentation is evaluated in multiple multi-agent test-beds, which include partially-observable scenarios, swarm environments, and complex 3D soccer simulations.
Introduction
A multi-agent system (MAS) is, by definition, an environment where multiple entities (i.e., agents) are interacting. Most complex systems can be modeled as a MAS, from an ant colony to the solar system. It can also be useful to describe a problem as a MAS when a group of agents can solve it, but the problem would otherwise be difficult or impossible for a single monolithic system. Research shows, indeed, that multiple fields explore interactions of agents in MAS, including distributed target tracking [1], autonomous robots [2], and spacecraft formation [3], where a team of agents must coordinate to spatially arrange themselves; traffic monitoring [4] and distributed sensor networks [5], where sensing agents are geographically spread and coordinate to achieve meaningful information; competitive negotiation [6, 7] or decision-support systems [8], where agents interact with other, possibly human, agents; competitions or games [9, 10], where robotic and software agents compete for supremacy, demonstrating the strength and applicability of machine learning algorithms, by achieving agents that surpass the best human players in the world; and even in crowd-modeling areas [11, 12], where MAS can be used to detect anomalies, predict flow, and improve the interactions between robots and humans in real-life scenarios.
Essentially, MAS form the basis of most complex systems around us, and its agents include both software agents, physical robots, and humans. Coordination is considered a key characteristic of MAS, and an agent’s capability of coordinating with others constitutes one of its major qualities. In cooperative environments, coordination consists on harmonizing the interactions of multiple agents, such that a global plan can be carried out. The global plan, composed of the sum of each agent’s individual actions, will ideally fulfill the agent’s individual goals or the global objective of the MAS, as efficiently as possible. However, for multiple systems, defining by hand how an agent should behave in all possible situations and with all possible interactions with others may be prohibitively difficult, time-consuming, or expensive. Ideally, agents would learn their own behaviors without human guidance, and develop their own policies to coordinate with others in the MAS and complete the required task. This is known as multi-agent learning.
Various research efforts [13, 14, 15] have shown, however, that achieving coordination in MAS remains a complex challenge with open questions. A popular technique for multi-agent learning is to apply single-agent machine learning algorithms to each independent agent and demonstrate that successful policies can be learned with implicit coordination. However, theoretical convergence guarantees offered by single-agent algorithms are lost, as the vast majority of algorithms assumes a stationary environment [16]. In a MAS, agents must take into account the remaining agents’ policies, which can also adapt their behavior (known as a moving target problem [17]). In other words, each agent must handle all the complexity that exists in a single-agent environment, as well as the additional issues that arise in the multi-agent paradigm. Single-agent challenges include the underlying environment’s complexity (where a task may be very complex to learn), the partial-observability of the environment (where an agent observes only a small local section of the entire environment), or high-dimensional state- or action-spaces (video games like Starcraft II present millions of pixel screen combinations and hundreds of possible actions at any given point). Multi-agent challenges include the moving target problem[17], where multiple independent agents learning simultaneously causes each agent’s perspective on the environment to appear non-stationary, as well as the structural credit assignment problem, where the actions of agents that contribute to a task’s completion may be rewarded in an inadequate manner (e.g., in a soccer match, each agent gets a point when a goal is scored, but not all agents contributed to that goal equally).
To achieve coordination, exchanging information has been shown to be a helpful solution. However, when agents can exchange information between them, learning or defining a communication protocol that best improves the team’s performance remains an open problem. Communication allows agents to share different types of information [18, 19], which helps compensate for the partial observability of the environment, reducing the complexity of the task. However, there is no consensus on how best to determine the communication protocol for any given environment.
Another helpful solution for coordination is for a central referee to evaluate the entire team’s performance and contribution to the task, instead of each agent evaluating this locally, lessening the structural credit assignment problem. However, such solutions often suffer from poor scalability with large amounts of agents, as the referee’s complexity scales with the team size. Methods that can improve the scalability of these referees become valuable in such cases. These include, among others, pre-processing the referee’s input, simplifying information at the cost of requiring a priori domain knowledge, and permutation invariance techniques, which reduce the complexity of the referee by allowing it to ignore the order in which agents are deployed.
This article proposes Asynchronous Advantage Actor-Centralized-Critic with Communication (A3C3), previously introduced in Simões et al. [20], a multi-agent deep learning algorithm. Agents are executed independently, but a centralized critic is used in the learning phase for agents to learn implicit coordination. Agents also learn to communicate through learned communication protocols. Our contributions are three-fold. Firstly, A3C3 is described and evaluated more thoroughly, being compared in multiple complex and partially-observable environments against other state-of-the-art approaches. Secondly, the complexity of A3C3’s critic increases with the amount of agents in the environment, so a network architecture is proposed to alleviate the problem. This architecture focuses on allowing permutation invariance in a deep neural network, which A3C3 can benefit from when the amount of agents in the team is large. It allows A3C3 to be used in swarm-like scenarios with tens of agents. Finally, A3C3 agents learn a communication protocol tabula rasa, through which they share relevant information for other team members. The interpretation of each communication protocol is not trivial, and an analysis of the learned protocols in four environments is presented in this paper. Direct correlations between local observations (e.g., each agent’s location) and the sent messages can be found, demonstrating the meaning behind the protocol.
The remainder of this paper is organized as follows. Section 2 describes related work and current state of the art in the field. Section 3 states the problem, Section 4 describes our main contribution, the A3C3 algorithm, and Section 5 describes a network architecture that can be integrated with A3C3 to increase its scalability through permutation invariance. Section 6 lists the used test-beds, and Section 7 the results obtained in them, with A3C3 and multiple other state-of-the-art algorithms. Finally, Section 8 draws conclusions and lists future work directions.
Related work
Deep Neural Networks are powerful non-linear function approximators and form the basis of Deep Learning (DL) algorithms. At its core, a neural network is a directed graph where nodes represent neurons and are divided into an input layer, an output layer, and one or more hidden layers [21], as can be seen in Fig. 1. How many hidden layers are necessary to distinguish between deep and shallow neural networks is not clearly defined. Each edge of the graph between nodes
An example of a deep neural network with a fully-connected architecture and 
Neural networks and DL algorithms have recently emerged as a popular solution in a multitude of different areas. These include time series forecasting in big data scale [22], which rely on sequences of samples of information to predict events, like flash floods [23] or seismic activity [24, 25], allowing systems to warn humans in a timely manner. Other applications of DL are based on image classification [26, 27], useful for extracting information from images [28], detecting and identifying vehicles in traffic surveillance cameras [29], or structural failure risk in bridges [30, 31], roads [32], and other structures [33, 34]. Neural networks show tremendous potential across tasks where mapping an input (like an image) to a result (like a classification of “dog” or “cat”) is useful. This is partly due to the universal approximation theorem [35], which states that deep neural networks with at least one hidden layer and a non-linear activation function can provably approximate any function with arbitrary precision.
DL is now commonly used across many different fields, and its potential has been taken advantage of for medical purposes, analyzing patient data [36, 37], predicting physical properties of materials [38], and to achieve super-human results in multi-agent learning problems based on video games [39, 40]. In the latter scenario, DL maps pixel values or game information into game controls (like moving or attacking). These multi-agent learning algorithms rely on Reinforcement Learning, a machine learning category where at least one agent, which behaves as the learner, interacts with the environment. Agents select and perform actions on the environment, which then reaches a new state, from which agents take an observation, and possibly a reward associated with that transition. This can be seen in Fig. 2.
The RL cycle for single-agent systems [41]. At time-step 
A rational agent is trained until it can output a policy
Actor-Critic algorithms merge both value- and policy-based methods. They use an actor to control the agent by learning the optimal policy, and a critic that evaluates the agent’s actions by approximating the optimal value function. DL actor critic methods use two artificial neural networks, one representing the actor, and another the critic. The actor network with weights
Asynchronous Advantage Actor-Critic (A3C) [42] is an asynchronous DL actor-critic algorithm. Global critic and actor networks are asynchronously updated by
Multi-agent deep reinforcement learning proposals introduce new techniques to enforce agent coordination, such as agent communication or centralized learning.
The Counterfactual Multi-Agent Policy Gradients (COMA) algorithm [45] is a centralized training, distributed execution algorithm where the value network is augmented with the observations and actions of the entire team. Agents still run independently, but the centralized value network now models a value function for a stationary environment. The algorithm tackles the credit assignment problem by using its value estimations to evaluate the benefits of specific actions by specific agents. However, COMA does not support heterogeneous reward functions.
The CommNet algorithm [46] is a joint-action algorithm [47], where an artificial neural network outputs the actions for all agents simultaneously, thus not supporting distributed execution. However, the algorithm allows agents to share messages during the feedforward passes of the network, allowing agents to learn policies and a continuous communication protocol simultaneously. CommNet’s joint-action approach leads to poor scalability (as the joint-action space scales exponentially with the amount of agents), and has uncommon and unintuitive hyper-parameters (agents share an arbitrary amount of messages before committing to each joint-action).
Value-Decomposition Networks (VDN) [48] allow agents to learn a factorized joint-action value function based on their independent observations, and the sum of each agent’s estimation approximates the centralized joint-action function. Agents can communicate by concatenating the output of their layers at some points. VDN disregards any additional information available from the environment, and limits the complexity of the centralized joint-action function to a simple sum. QMIX [49] is a VDN-extension, where each agent’s value function is no longer summed to approximate the centralized joint-action function. Instead, an additional mixing network is used to combine each individual value function in a more complex manner, which is also able to incorporate additional environment information.
Other proposals learn communication using a pre-determined vocabulary. The Mordatch algorithm [50] uses recursive neural networks for agents to learn how to maintain a meaningful conversation in cooperative scenarios. Agents communicate through a discrete set of symbols and are optimized with a joint-reward function. REINFORCE [51] is extended with Hierarchical Recurrent Encoder-Decoder by Das et al. [52], such that agents can learn to communicate with a set of natural-language symbols while learning to complete tasks. Lazaridou et al. [53] replace agents’ action-space with an alphabet so that agents learn to communicate in a cooperative task. However, this is no different from learning a cooperative policy.
In conclusion, many deep reinforcement learning algorithms feature undesirable properties for an independent MAS. Algorithms like COMA, CommNet, VDN, and QMIX use centralized critics or output joint-actions, and scale poorly with the amount of agents in the environment. Research has shown that environments with only four agents can be too complex for joint-action approaches like CommNet [54], and centralized critics used by COMA, VDN, and QMIX incorporate all agent observations and may require additional solutions (such as a Permutation Invariant Network architecture, discussed below) when working with large teams. COMA and QMIX also do not take advantage of inter-agent communication, a flexible way for agents to share information. Communication can help compensate for local information and improve the team coordination. A3C3 features a centralized critic with a scalable architecture for large teams, and learns a communication protocol where agents share relevant information to complete the task.
The scope of our work encompasses cooperative multi-agent environments with limited observability. A team of
As an analogy, consider a group of
We consider a centralized learning, distributed execution paradigm [56], where an algorithm has access to all agents’ observations and actions during the training phase, acting as a centralized controller, but where agents behave independently during the execution phase. This allows an algorithm to take advantage of centralized architectures, and incorporate additional information from the environment or from the agents into their policies. We also assume that the amount of agents
We also model agent communication as message transmission, where agents can send vectors of continuous values to others. Messages may be limited by range (e.g., agents can only communicate with other geographically close agents), size and value (a message can range from multiple values within [
Asynchronous advantage actor centralized-critic with communication
This section describes the Asynchronous Advantage Actor Centralized-Critic with Communication (A3C3) algorithm, a multi-agent actor-critic algorithm. A3C3 uses an actor network with weights
In addition to actor and centralized critic networks for each agent
Formally, for agent
A3C3 is also an asynchronous algorithm. This class of algorithms runs multiple parallel workers in multi-core CPU and has been shown to outperform single-threaded GPU-based algorithms [42]. It allows algorithms to scale horizontally, an increase the amount of updates over time by increasing the amount of workers, and without requiring specialized hardware. Asynchronous algorithms keep global networks which are updated by multiple workers, each worker keeping local copies of the networks, and updating the global networks with mini-batches of samples. In the case of single-agent actor-critic algorithms, each worker has a local actor and critic networks, copies of the global actor and critic networks. In multi-agent actor-critic, there are
The pseudo-code for an A3C3 worker is described in Algorithm 1. Workers create local copies of the global networks, and sample the environment. They compute a message and an action for each agent, execute them, and re-sample the environment. After a mini-batch is collected or a terminal state is found, the loss of local networks is calculated and the corresponding gradients are applied to the global networks. This loop is repeated until agents’ policies have converged. When using inter-agent parameter sharing (i.e., homogeneous agents share the same networks), the optimization steps of each agent’s mini-batch optimize the same global network, thus making the learning phase more efficient.
After
As an example of an optimization cycle, consider two vehicles learning to cross an intersection. Vehicles (the agents) know whether they wish to turn or not, and can either move forward, turn, or wait (three possible actions). They can signal each other by turning a light on or off (a binary message with a single value). A3C3 simulates multiple intersections (one per worker), each intersection with two agents trying to cross. Each simulation defines random goals for agents, and agents get heavily penalized for crashing or following the wrong direction, and lightly penalized for waiting at the intersection. As simulations are completed, agents learn a value function, and understand that they get a small penalty for waiting and a large penalty for going the wrong way. When they advance towards their intended destination, they are penalized only on occasion, when the other vehicle also went that direction. The centralized critic learns that agents could avoid large penalties if they did not collide, and so vehicles learn to share the direction they wish to move. By turning on their light when wishing to turn (just like humans turn on a blinker light), both vehicles understand where the other is headed, and the vehicle with priority can advance while the other awaits. Both vehicles realize that it is better to wait while the priority vehicle crosses instead of crashing both. After some thousands of simulations, the agents’ policies stabilize and the environment is successfully completed.
Permutation invariant network architecture
Like previously stated, as the amount of agents in the environment increases, the scalability of A3C3 is hindered by the centralized critic. Since this network commonly aggregates observations from all agents (although it is not necessary to), its complexity is directly proportional to the amount of agents on the team. An additional issue, found with homogeneous teams, is that exchanging agents’ identifiers (and therefore the order with each each agent’s observation is fed to the critic) should also not affect the value estimation. However, a neural network approximator is not permutation invariant with respect to each agent’s observation. In an environment with at least two homogeneous agents, the order with which each agent’s observation is aggregated to the input of the network affects its value estimation. The problem is further accentuated when the amount of agents grows to large values.
For example, consider a team of two homogeneous agents, that observe their own positions as
In swarm scenarios, where there are usually tens of homogeneous agents [61], a critic must derive a value function
The question then becomes how to allow a critic network whose input contains a set of agent observations to become permutation invariant to the order of this set. We have demonstrated that it is possible to learn some kind of permutation invariance by feeding a network enough data that it learns to be robust to agent permutations. This is a slow method, and does not scale well.
Another possible option is to maintain the same network structure, and pre-process its input. By ordering the set of agents based on their observations, the network will output the same value estimation regardless of the original set’s order. When each agent’s observation is 1-dimensional, this is a trivial task. However, even in the simple case of a 2-dimensional observation of
This section describes a technique where the network architecture is itself permutation invariant, which has been independently researched as Deep M-Embeddings (DME) [62]. DMEs consist on an initial network
across agent observations. Other possible functions include the maximum function
or the weighted softmax function
where
The architecture of a permutation invariant Central Critic for agent 
The POC suite. (a) Agents (black) have access to their own coordinates and must locate a reward zone (red). That location should be shared with other agents to maximize the team reward. (b) At intersections, purple vehicles wish to turn and white vehicles wish to go forward. They should signal their intent to turn or not, and cross without colliding by following road rules. (c) Predators (squares) observe a small local area, and try to surround and capture prey (circles), which run from the closest predator. (d) Drones know beacon locations and their own position. They should share their positions and decide which drone covers which beacon.
Popular DL frameworks [63, 64] do not feature DME layers, and implementing them by hand can lead to unoptimized or incorrect code. Therefore, we take advantage of pre-existing architectures, specifically convolutional layers, to emulate the behavior of DME. Convolutional layers use a kernel, usually with
We can take advantage of the optimized convolutional layer implementations and shape them to act like DME layers. If each agent observation has a length of
While we integrate this network architecture within A3C3’s critic, it can be used elsewhere. For example, A3C3’s actor networks can benefit from this architecture if agents received broadcast messages from all their teammates. The architecture can also be used outside of A3C3, and algorithms with centralized critics (like COMA, QMIX, or VDN) can benefit from its incorporation.
A3C3 is tested in three environment suites with partially-observable state-spaces, represented as Dec-POMDP, where agents can only sample incomplete, noisy, or incorrect observations of the underlying state.
Three scenarios of the KiloBots environment. Agents (grey) know their own poses and the relative polar coordinates of other agents and objects (green). (a) Agents must push the object towards the yellow area. (b) Agents must push and join the objects together at any point in the map. (c) Agents must push and separate the objects as far away from each other as possible.
We use a set of diverse multi-agent environments, the Partially-Observable with Communication (POC) suite [65], each requiring information sharing between agents in order to overcome their partial observability. The four environments are shown in Fig. 4.
The Hidden Reward is a classical robotics exploration challenge. Agents explore the map until a target area is found or the time-limit is reached. Agents observe their own location and a sensor indicating whether they’re in the reward zone. The central observation is the concatenation of all agent observations. Ideally, agents spread out to explore the map as fast as possible, and alert teammates once the reward zone is found.
The Traffic environment simulates multiple intersections for a large amount of vehicles which must adhere to priority road rules and avoid crashes or traffic jams. Each vehicle senses cars immediately around themselves, and knows their preferred route. Vehicles can only communicate with the cars within their vision range. The central observation for each agent is its local observations concatenated with its neighbors preferred routes. Ideally, vehicles inform each other of their intent to turn, and the one with priority moves.
The Predator/Prey environment has two teams of agents. Predators with local partial observations must surround and capture intelligent prey, which always run form the closest predator. Predators know their location and can observe a small local area (less than 10% of the full map). Collisions are penalized, and predators are randomly placed when one occurs. The central observation is a global observation of predator and prey coordinates. Ideally, predators explore until a prey is seen, and they quickly converge on that prey, until all are caught.
The Navigation environment has multiple beacons which must be covered, and the team must decide which agent covers which beacon. Agents observe the beacon’s locations, as well as their own. Ideally, they communicate their own location with each other and coordinate to cover both as fast as possible.
The RCSSServer3d Soccer Simulator environment. (a) The Passing scenario, where agents maximize their reward by passing the ball as many times as possible. (b) The Keep-Away scenario, where agents maximize their reward by keeping the ball away from the opponent for as long as possible. (c) The 3dSSL framework, which deploys an RCSS Simulator and 
The KiloBots environment [57] is a set of local continuous observation- and action-space scenarios with simulated physics, emulating the KiloBot robot [66]. Each robot has a diameter of 3 centimeters and moves using two vibration motors. The environment is a great testbed for swarm learning algorithms, whose policies can later be applied to physical KiloBots. The scenarios, shown in Fig. 5, include:
Push – Agents push an object to a target location.
Assemble – Agents push and assemble multiple objects together.
Segregate – Agents push and separate multiple objects apart.
Each KiloBot observes its own pose, as well as the position of the obstacles in the environment (although not their shape or orientation). KiloBots can communicate with the two other closest agents. The central observation concatenates all agent positions, and thus scales linearly with the amount of agents in the environment. The swarm must coordinate to complete the task as fast as possible. In the Segregation and Assembly tasks, a divide and conquer approach is ideal, where the swarm splits in multiple groups.
The 3D Soccer Simulation League is part of the RoboCup initiative [67], an annual international robotics competition, whose goal is to have a team of fully-autonomous physical robots winning a soccer match against the world-champion human team, using FIFA standard rules, by the year 2050. It is a complex multi-agent environment where two teams of humanoid simulated robots play a ten-minute soccer match using realistic rules. Each team is comprised of eleven NAO robots [68] with multiple different models, each with different physical characteristics. Each agent perceives the environment through local partial observations consisting on spherical coordinates of other elements in the environment. These include landmarks like the goal posts, field lines, other agents, and the ball. Agents then act upon their own local joints, by sending commands to the environment simulator.
The hyper-parameters used for the tests conducted in this section. This table lists the amount of agents
, the amount of workers
, the future reward discount
, the learning rate
, the amount of communication channels (CC), the layer size modifier
, and the amount of training episodes. Critic and actor networks used two fully connected hidden layers of
and
nodes activated with a ReLU function. When using DME, the central critic had used two convolutional layers of
and
nodes activated with a ReLU function. The communicator network used a single hidden layer with
nodes activated with a ReLU function, and an output layer of CC nodes, activated with a hyperbolic tangent function. The non-received message
default value is all zeros
The hyper-parameters used for the tests conducted in this section. This table lists the amount of agents
Comparison of multiple algorithms and a heuristic baseline for the tested environments. The average reward (over 100 test runs) obtained by the team after each algorithm trained for the amount of episodes shown in Table 1. Each algorithm tries to maximize the obtained reward, and the best results are shown in bold. A3C3 is able to match or surpass the baseline heuristic in all environments
Using low-level controllers that abstract simple tasks, like kicking or walking, has been the de facto standard in the league, controllers which are then used by high-level decision-making modules. In other words, agents have a set of behaviors which are chosen according to their strategy. The behavior acts upon the agent’s joints, and the strategy defines which behavior to execute and with which parameters.
The 3D Soccer Simulation League features multiple game types, as shown in Fig. 6. One is the Passing game, where agents pass the ball between them, while receiving a reward for each successful pass. Another game type is the Keep-Away game, where three players keep the ball away from a single opponent for as long as possible, getting points for each pass and a large penalty if the opponent reaches the ball.
We propose the 3d Soccer Simulation Learning (3dSSL) framework, as shown in Fig. 6c, which models the 3D Soccer Simulation League with a Gym interface, and allows multiple environments to run simultaneously with fault-tolerance mechanisms enabled. It supports parallel learning with asynchronous DL or genetic algorithms, through socket communication implemented within the FCPortugal3D team [69], an award winning team in the League.
Agents observe the estimated positions of all players and the ball, as well as the estimated location the ball will stop at, their orientation, and the distances of all players to the ball. All estimated information is sampled by the mechanisms used in FCPortugal3D. The action-space is no longer a low-level continuous joint-control, but instead a discrete selection of high-level behaviors used in the team. The behaviors used were standing, where an agent simply stands in the same place, getting up, used when the agent falls on the ground, kicking to an ally, with the choice of which teammate to kick the ball towards, and using a kick for the adequate distance to the target, moving to position, where each agent has a specific default location based on the team’s formation. Agents may fall when changing behaviors abruptly, and all behaviors take multiple time-steps to complete. 3dSSL also provides additional information to be used by learning algorithms or on tests, but which is not normally accessible by agents in regular play. This includes the agent’s real position, orientation, acceleration, as well as the ball’s real current position. This information can be used during the learning phase of algorithms that benefit from data that are usually unaccessible to agents.
A3C3’s performance is evaluated in the three multi-agent environment suites described in the previous section. The first suite comprises multiple partially-observable environments, and requires agents to share information in order to successfully complete the task. The second is a swarm simulator, where many simple agents must coordinate to achieve the task. The third is a humanoid soccer simulator, where a team must learn a high-level strategy based on low-level behaviors to win the game.
In the conducted tests, the initial choice of network architecture was based on prior experience in deep reinforcement learning [60], where actor and critic networks had one to three layers of width 10 to 120. A grid-parameter search [22] was then conducted to find the best width (we tested layers of 10, 20, 40, 80, 120 nodes) and an adequate activation function (Logistic, Rectifier Linear Units [70], or Exponential Linear Units [71]). We chose, for each environment, the simplest architecture we found that consistently converged to proper policies. The Glorot initializer [72] is used with default parameters to compute the networks’ initial weights, the Adam optimizer [73] with default parameters to optimized them, and an entropy weight
Color-coded 
This section evaluates A3C3 against other algorithms, using the environments described in the POC suite. All algorithms were executed with similar parameters, to achieve a fair basis of comparison, and were trained for the same amount of steps. A3C and PPO are single-agent DL algorithms, and are run as independent agents, as if each agent were in a single-agent non-stationary environment. COMA, QMIX, and VDN are multi-agent DL algorithms that rely on a centralized critic to achieve coordination in a team. An ablation of A3C3 without a centralized critic, called A3C2 [65], is also evaluated here.
The results are shown in Table 2. For all environments in the POC Suite, A3C3 is consistently able to overcome partial-observability and learn optimal policies with relevant communication protocols. It also consistently outperforms other similar algorithms without specific over-tuning. Independent algorithms like A3C and PPO are unable to make agents coordinate and underperform in all scenarios. COMA, QMIX, and VDN also behave poorly, and show poor scalability in the Traffic Simulator environment. These algorithms are based in Q-Learning, and often perform worse than single-agent actor-critic algorithms like A3C or PPO. Only A3C2 and A3C3 achieve successful results in these environments, being the only algorithms that learn communication protocols between agents. A3C3’s centralized critic improves the speed and quality of achieved policies.
A3C3 either outperformed or matched our heuristic baselines. These baselines were achieved with a hard-coded centralized controller that output a joint-action for the entire team, with access to all agent observations. In the Traffic Simulator and Navigation environments, which require little map exploration, A3C3’s behavior closely follows the baseline. In the Hidden Reward and Pursuit environments, A3C3 optimized a better map exploration policy than the baseline, and completed the challenges faster, thus achieving a higher score.
Communication analysis in the POC suite
This section analyses the learned communication protocols in the partially-observable suite. While the protocol may not be fully understood by humans, it is possible to extract some interesting conclusions on what agents are actually communicating and how those messages can be interpreted by teammates.
For easier analysis, we convert the protocol channels into colors (each channel representing one of three RGB values composing a pixel). For example, a
Figure 7 shows a protocol learned in the Hidden Reward environment. Four messages are shown, representing the message an agent outputs in a corresponding location on the map. For instance, the top message is the message sent by an agent to its teammates when at the top right corner of the map. The policy learned by the team consisted on agents forming a vertical line and moving across the
The evolution of two separate 
Figure 8 shows the evolution of two protocols learned in the Traffic environment. Both populations learn to clearly distinguish the intention of turning or going forward. Interestingly, different populations learn opposite protocols, where population A signals their intention of going forward, and population B signals their intention of turning.
The 
Figure 9 showcases, for the Navigation environment, the difference between the initial random protocol that agents use, and the final protocol that is used after policies converge. Initially, each channel behaves without much correlation regarding the agent’s location. In other words, agents cannot properly decide which agent covers which beacon, as they cannot interpret the other agents’ locations from their messages. However, the protocol evolves in a way that agent coordinates can be extracted from their sent messages. When convergence is achieved, some channels have very obvious correlations with the
Color-coded 
The evolution of policies in the three tasks of the KiloBots environment. The plots represent the average reward and standard deviation (over 
Figure 10 shows a protocol learned in the Predator/Prey environment. Messages are shown and arranged geographically, representing the message an agent outputs when a prey is found at a corresponding location in its local observation, while the agent is at the center of the map. The center position is the predator itself, and no prey can be there. The policy learned by the team shows distinct patterns for each location, which alerts other predators to the prey’s location. At this point, predators then converge and surround the prey until it is captured. This effectively allows agents to handle the partial-observability of the environment.
This section describes the tests conducted to evaluate the scalability of A3C3 with large teams in the KiloBots environment, described in Section 6.2. The continuous action-space of the KiloBots is discretized into a simple set of actions, including rotation, stopping and moving forward. Communication is geographically limited, and agents can only broadcast their messages to the two closest agents.
The performance of teams is evaluated by comparing different methods to handle the large amount of agents, as described in Section 5. A3C3’s centralized critic concatenates the observations of all
Figure 11 shows the evaluation conducted on all five architectures and on all three scenarios. The analysis of the performance of the teams shows that all architectures can converge to successful solutions. In other words, even with a large amount of states, the centralized critic can learn an accurate value estimation for the team. Pre-processing and ordering the input helped in two of the three scenarios, which further demonstrates that it is not a generally applicable solution. However, the DME methods accelerated the learning phase by a large amount in all scenarios, and shortened the time taken for the teams to achieve optimal policies. The mean and softmax DME show the best results, and the mean DME is the fastest due to simpler optimizations.
The average reward (over
test runs) obtained by three different teams in the 3dSSL Framework. A3C3 was trained for the amount of episodes shown in Table 1, and is compared against the hard-coded FCPortugal3D team and a team of random agents. A3C3 tries to maximize the obtained reward and surpasses the original FCPortugal3D implementation in the Passing scenario. However, it is not as effected in the Keep-Away Scenario, where the heuristic baseline carefully calculates the expected time for the enemy to reach the ball and uses that information to better adjust and improve its behavior
The average reward (over
This section describes the tests conducted in the 3dSSL framework, described in Section 6.3. As previously stated, a sub-set of FCPortugal3D’s behaviors was used, such that A3C3 learns a high-level policy that takes advantage of the low-level behaviors already existing in the team.
The team’s performance is shown in Table 3, where agents learn effective strategies to complete each scenario. In the Passing challenge, agents group together such that passes are quicker and more accurate. On the Keep-Away scenario, agents do the opposite, and remain in corners of the field, such that the opponent takes a long time to reach the ball. They successfully keep the ball away from the opposite team until the episode reaches a given time-limit. When compared with the hard-coded policies used by the team in competition, the learned policies matched the performance of the Passing scenario, but did not outperform FCPortugal3D’s Keep-Away behavior, likely due to a smaller flexibility on the available behaviors.
Interestingly, agents learn to take advantage of the implementation details of the scenarios, and exploited the environments. In the Passing challenge, agents converged in the center of the map and learned to alternately move towards and away from the ball, which would make the environment score a successful pass (as a different agent was now closer to the ball). In the Keep-Away challenge, agents discovered that the opponent would not move beyond the field lines, so they converged to a policy where they just kicked the ball outside the field and would no longer need to move. Both scenarios were fixed with regards to these exploitations, by correctly evaluating a successful pass, and preventing the ball from leaving the field. After doing so, agents learn to perform actual passes and to play a fair game of keep-away soccer.
Conclusion
This manuscript proposes A3C3, a multi-agent deep reinforcement learning algorithm. It uses a centralized critic to implicitly share information during the learning stage, and can benefit from a permutation invariant network architecture to scale to large team sizes. A3C3 agents learn a communication protocol, where they share relevant information to team members. The protocols learned by each population are unique, and agents from different populations may be unable to coordinate with one another. All source-code, experiments, environments, and hyper-parameters have been published at
A3C3 is shown to outperform other state-of-the-art algorithms, and is able to match or outperform centralized heuristic controllers in a wide variety of environments. Agents learn high-level policies to complete the task, and their communication protocol is analyzed. Our analysis shows that agents share local information, allowing them to compensate for partial observability. A3C3 is also shown to be able to scale to swarm-size teams, and can be integrated with existing software agents to derive strategies from available behaviors. Given its robustness and generality, A3C3 represents a solid first step into multi-agent reinforcement learning, where independent agents learn to communicate with each other and coordinate to complete a cooperative task.
Future work directions include a distributed implementation protocol where agents and worker threads can be run on different machines, which will further increase the scalability of A3C3. Recursive neural networks can also be explored, such that agents can further handle partially-observable environments and also learn multi-step communication protocols. Exploring the effects of noise in the communication protocols can also help demonstrate the robustness of A3C3.
Footnotes
Acknowledgments
This work is supported by: Portuguese Foundation for Science and Technology (FCT) under grant PD/BD/113963/2015; Institute of Electronics and Informatics Engineering of Aveiro – IEETA (UIDB/ 00127/2020); Artificial Intelligence and Computer Science Laboratory – LIACC (UIDB/00027/2020).
