Abstract
This paper proposed a deep reinforcement learning algorithm for semi-cooperative multi-agent tasks.
Here, the word “semi-cooperative” means agents are equipped with their separate reward functions, yet with some willingness to cooperate.
The algorithm in the paper is called Peer Evaluation-based Dual DQN (PED-DQN), it proposes to give peer evaluation signals to observed agents.
Here, “peer evaluation signals” quantify how agents strategically value a certain transition and this could be regarded as agents’ ‘selfish’.
Introduction
The translation from single-agent to multi-agent settings is difficult, the authors explain this from this aspect:
“These simultaneous policy updates make the environment’s transition probabilities non-stationary, which in turn makes learning difficult for the agents because a slight change in one’s policy may cause another agent’s policy to perform sub-optimally.”
Fully cooperative cases
Agents agree to cooperate unconditionally, e.g. being controlled by a single authority. In these cases, a set of agents usually learn to maximize a global reward and thus the reward function itself can naturally induce a certain level of cooperation.
Here, ‘Global reward’ means that each agent receive the same (scalar) reward value for each step.
所以,完全合作的情况是指所有agent都无条件合作,受中央同一控制,目标是最大化统一的一个global reward。
Semi-cooperative
在其他的情况中,agent会考虑自身的利益(interests),但是也能从合作中获利,这种情况就称为“semi-cooperative“:
Agents may each have its own separate reward function, but is willing to cooperate if an incentive for cooperation is appropriately provided.
Semi-cooperative tasks are those with separate per-agent reward functions but whose agents may nonetheless benefit from cooperation.
The key challenge in semi-cooperative tasks is: we use separate reward functions to represent the selfishness of each agent, yet this often results in agent’s choice of non-cooperative actions.
这句话的意思是,我们通常用独立的reward function来表示每个agent的selfishness,但是这通常会导致agent选择非合作的动作,这也是半合作情况中的关键问题。
紧接着,作者提出了他们的目标:
The goal is for the agents to find cooperative policies that will maximize the social welfare which is defined as the sum of the rewards of each agent across the entire episode.
简言之,作者整篇文章的目标将会是实现社会福利最大化。
2 Model and Goal
Stochastic game: Each agent has its own individual reward by modeling it with a stochastic game.
Denotation:
$n$: number of agents.
$a\in A={1,2,…n}$: index of one agent.
$-a$: the set of other agents $A/a$.
$U_{a}$: set of actions of $a$.
$o_{a}$: agent a’s partial observation at each time step, it is determined by the observation function $O(s,a)$ where $s\in S$ is the true state.
$\pi_{a}(u_{a}|o_{a})$: policy, which is conditioned on its observation to decide its action $u_{a}\in U_{a}$.
$\pi = [\pi_{a}]{a\in A} $ (表示所有agent的policy).
$O$: the set of observations from all agents at a time.
$u\in U$: joint action of all the agents, where $U:=U{1}\times \cdots \times U_{n}$ is the set of all joint actions. Each agent will take an action and these actions will form a joint action u, and then a transition occurs.
$P(s^{‘}|s,u)$: transition function. $r_{a}(s,u)$: reward. $\gamma$: discount factor.
Semi-cooperative tasks: the set of tasks where each agent may have a separate reward function but may benefit from cooperative strategies.
Goal: Maximizing the social welfare.
social welfare: the sum of the discounted rewards of all agents over the entire episode. The authors want to maximize it and obtain the socially optimal policy:
\(\pi^{*}=\arg \max_{\pi}\sum_{a\in A}E_{s\sim\rho^{\pi},u\sim\pi}[r_{a}(s,u)]\)
where $E_{s\sim \rho^{\pi}}[·]$ denotes the expected reward with respect to discounted state distribution $\rho^{\pi}$ by initial state distribution and state transition dynamics distribution.
3 Methods
3.1 Change of Games via Reward Reshaping
Each agent only has access to its partial observation and chooses its action based on its local best response. The authors will change the game by reshaping agents’ reward functions over time, so that the agents’ actions from the local best responses become socially optimal ones that maximize the cooperation effect. These reshaped rewards are called well-coordinated rewards.
Then, run the following loop:
(Denote by $G_{t}$ and $\hat{\mathbf{r}}^{t}=(\hat{r}^{t}_{a}:a\in A)$ the game and its reward function at time step t = 0, 1, · · · .)
(i) Compute the optimal policy 𝜋𝑡 from 𝐺𝑡
(ii) Evaluate how well-coordinated 𝐺𝑡 is by evaluating 𝜋𝑡
(iii) Update from 𝐺𝑡 to 𝐺𝑡+1 by updating from ˆ𝒓𝑡 to ˆ𝒓𝑡+1, using
the ‘evaluation feedback’ from (ii)
(iv) Increment 𝑡 and go to (i).
The converge time will be large so the authors take an approach of running the following two updates in parallel:
Policy update: 𝝅𝑡+1 = 𝐹 (𝝅𝑡 , ˆ𝒓𝑡 )
Reward update: 𝒓𝑡+1 = 𝐻( ˆ𝒓𝑡 , 𝝅𝑡 )
In the policy update F , each agent’s policy 𝝅𝑡 is updated to maximize the reshaped reward ˆ𝒓𝑡 at that time that can be done by a reinforcement learning algorithm, e.g., DQN. In the reward update 𝐻, agents estimate the value of current policies and update the reshaped reward appropriately.
函数F对应policy update,通过最大化当前的reshaped reward r来更新每个agent的策略;函数H对应reward update,通过对当前策略进行估值来近似更新reshaped reward。
3.2 Reward Update with Peer Evaluation
each agent 𝑘 generates a counterfactual evaluation signal (CES) $z^{t}_{k}$ for the observed transition $o^{t}_{k}->o^{t+1}_{k}$.