Secure protocols for cumulative reward maximization in stochastic multi-armed bandits

Abstract

We consider the problem of cumulative reward maximization in multi-armed bandits. We address the security concerns that occur when data and computations are outsourced to an honest-but-curious cloud i.e., that executes tasks dutifully, but tries to gain as much information as possible. We consider situations where data used in bandit algorithms is sensitive and has to be protected e.g., commercial or personal data. We rely on cryptographic schemes and propose $UCB - MS$ , a secure multi-party protocol based on the UCB algorithm. We prove that $UCB - MS$ computes the same cumulative reward as UCB while satisfying desirable security properties. In particular, cloud nodes cannot learn the cumulative reward or the sum of rewards for more than one arm. Moreover, by analyzing messages exchanged among cloud nodes, an external observer cannot learn the cumulative reward or the sum of rewards produced by some arm. We show that the overhead due to cryptographic primitives is linear in the size of the input. Our implementation confirms the linear-time behavior and the practical feasibility of our protocol, on both synthetic and real-world data.

Keywords

Security in machine learning cumulative reward maximization honest-but-curious cloud UCB algorithm AES-GCM symmetric encryption scheme Paillier’s additive homomorphic asymmetric encryption scheme

1. Introduction

The stochastic multi-armed bandit game is a sequential learning framework where a learning agent aims at maximizing its cumulative reward while successively interacting with an uncertain environment. At each time step, the agent chooses an action (a bandit arm) from a fixed set of actions with unknown associated values. The environment responds with a stochastic feedback (reward) drawn from the distribution associated with the chosen action. The agent uses the received feedback to update its estimate of the values for the chosen action and to decide which action to choose next. The agent has to continuously face the so-called exploration-exploitation dilemma and decide whether to explore by choosing actions with more uncertain associated values, or to exploit the information already acquired by choosing the action with the seemingly largest associated value. Cumulative reward maximization has been already extensively studied for several multi-armed bandit settings (see [7] for a survey) and for various applications, from clinical trials, to online advertising and recommendation systems. In this paper, we address the security concerns that occur when outsourcing the cumulative reward maximization data and computations to the cloud.

Our scenario is inspired by the machine learning as a service cloud computing model, for which security is known as a major concern [6]. As a motivating example, assume:

∙ A data owner: a company that wants to monetize some collected data, while keeping ownership over it. The collected data may be a large quantity of surveys on customer preferences for several products. By product, we mean any type of object or service. The K bandit arms are the surveyed products and only the data owner knows their associated rewards, based on the collected surveys.

∙ A data client: a company that wants to spend some budget to use some of the data owner’s data. The data client may be a small company that cannot afford doing its own surveys, but wants to estimate the income that it could generate for the products surveyed by the data owner. The cumulative reward captures such information because it sums the rewards produced by each product. The budget N is the number of data owner’s surveys used to compute the cumulative reward and the bandit algorithm has to decide how to choose these N surveys in order to maximize the cumulative reward. A larger budget gives a higher accuracy for the largest cumulative reward. The data client only sees the cumulative reward, without knowing the values associated to each arm.

We assume that the interaction between the data owner and the data client is done using the cloud (as shown in Fig. 1), where both data and computations are outsourced. The data owner does the data outsourcing, and the data client interacts directly with the cloud, by sending the budget and receiving the obtained cumulative reward. The outsourced data may be sensitive (e.g., personal, commercial, or medical data). We want the outsourced learning algorithm to be run while protecting data against unauthorized access. The problem that we address is how to allow the data client to obtain precisely the same cumulative reward as with a standard bandit algorithm, Upper Confidence Bound (UCB) [4,7], within a reasonable computation time and while preserving the data security. Indeed, the outsourced data can be communicated over an untrustworthy network and processed on some untrustworthy machines, where malicious cloud users may learn sensitive data that belongs only to the data owner.

Fig. 1.

Outsourcing data and computations.

The privacy-preserving cumulative reward maximization is a hard problem. To solve it, the authors of [16,25,30] use differential privacy introduced in [13]. However, in these approaches, the returned reward is not the same reward obtained for the data client’s budget using standard algorithms. This happens because differential privacy guarantees depend on noise being injected in the input/output. We take a complementary approach by relying on cryptography instead of differential privacy. To the best of our knowledge, our approach is original and its goal is to give security guarantees, while obtaining the same output as standard (non-secure) algorithms. The security for obtaining the same output has a price because the computation time may increase because of cryptographic primitives that are time-consuming in practice. More precisely, we require that the data owner (which can be seen as an oracle knowing the reward functions associated with each arm) encrypts her data before outsourcing it to the cloud. Then, the cumulative reward maximization algorithm is run directly in the encrypted domain, and the (encrypted) output should be exactly the same as for standard UCB, with the cost of an increased computation time.

From a theoretical point of view, the problem could be straightforwardly solved by using a fully homomorphic encryption scheme [17], which allows to compute any function directly in the encrypted domain. However, it remains an open question how to make such a scheme work fast and be accurate in practice. Indeed, the state-of-the-art fully homomorphic systems (SEAL1

https://github.com/Microsoft/SEAL

and HElib2

http://homenc.github.io/HElib/

) yield only approximate results when they work with real numbers, by using the CKKS scheme [8]. Hence, it is not currently possible to program an algorithm such as UCB in a fully homomorphic system and obtain exactly the same result as in the standard, non-encrypted UCB. Moreover, even by assuming that approximate computations are acceptable, the running times of state-of-the-art fully homomorphic algorithms for basic computations are quite large. We provide concrete numbers on this point at the end of Section 4.

Consequently, our challenge is to rely on simpler cryptographic schemes and design a multi-party protocol with several cloud node participants such that each of them can only learn the specific data needed for performing its task and nothing else e.g., if a participant does in clear computations on real numbers, these computations concern data of only one arm, and no other participant has access to this piece of data. Our protocol returns exactly the same cumulative reward as UCB, while satisfying desirable security properties such as: only the data client can see the cumulative reward, which cannot be learned by any cloud node participant nor by an external observer. We precisely characterize our security model and security guarantees later on in the paper. To achieve our goals, we rely on indistinguishable under chosen-plaintext attack (IND-CPA) cryptographic schemes: symmetric encryption AES-GCM [1,28] and asymmetric partially homomorphic Paillier’s scheme [27]. We formally prove the security of our protocol and we precisely characterize the number of needed cryptographic operations.

Table 1

Summary of related work and positioning of our contribution

	Differential privacy	Cryptography
Cumulative reward maximization aka cumulative regret minimization	[16,25,30]	This paper
Best arm identification aka simple regret minimization	Not yet studied to the best of our knowledge	[11]

Related work. Each line in Table 1 corresponds to a standard problem in stochastic multi-armed bandits. The most popular problem is cumulative reward maximization and UCB is a standard algorithm for solving it [4,7]. There is a recent line of research on enhancing algorithms such as UCB with differential privacy [16,25,30]. There are some fundamental differences between this line of work and our work based on cryptography. On the one side, the running time overhead of differentially-private algorithms is negligible, whereas our approach has an overhead in computation time coming from the use of cryptographic primitives. On the other side, the cumulative reward returned by differentially-private algorithms is different from the output of standard UCB. Indeed, to obtain differentially-private guarantees for a bandit algorithm, noise is added to the algorithm input or output. Thus, the cumulative reward obtained using a differentially-private algorithm is different from that obtained by the algorithm without privacy guarantees. This is reflected in the regret analysis of the algorithms (where the regret is given by the difference in the cumulative reward obtained by a learning agent and the best cumulative reward possible obtained by always playing the best arm): the regret of differentially-private bandit algorithms have as overhead an additive [30] or multiplicative factor [16,25] with respect to the regret of their non-private version. In contrast, our cryptography-based algorithm is guaranteed to return exactly the same cumulative reward as standard UCB.

The second line in Table 1 corresponds to a different bandit problem that is best arm identification [3], equivalent to minimizing the simple regret, that is the difference between the values associated with the arm that is actually the best and the best arm identified by the algorithm. From the cryptography point of view, there exists a multi-party protocol [11] that enhances the Successive rejects algorithm [3] for best arm identification with security guarantees that are similar to the ones from this paper. Naturally, the algorithms that are secured (Successive rejects [3] in [11] and UCB [4] in this paper) solve different problems, thus the corresponding secure protocols are different and cannot be reduced to one another.

All related works discussed thus far are for standard stochastic bandit models. Securing cumulative reward maximization algorithms using cryptography has been recently studied for a different bandit model i.e., linear bandits [10], where the arms are vectors and the rewards are unknown linear functions of the arms. The corresponding secure protocols are again different and cannot be reduced to one another.

Regarding the secure multi-party computation literature, we are not aware of any other work considering multi-armed bandits or which could be easily reduced to be used in the context of multi-armed bandits, hence to the best of our knowledge there is no other protocol that is genuinely close to ours. According to Moti Yung’s keynote in CCS 2015 [31], an important factor that one should take into account when designing a secure multi-party computation protocol is “Generality vs. specificity: Secure computation is a general scheme; in reality one has to choose an application, starting from a very real business need, and build the solution from the problem itself choosing the right tools, tuning protocol ideas into a reasonable solution, balancing security and privacy needs vs. other constraints: legal, system setting, etc.” This is precisely our approach: we start with an important and interesting problem setting that to the best of our knowledge is not previously considered in the secure multi-party computation literature (i.e., cumulative reward maximization in stochastic multi-armed bandits), then we design secure multi-party protocols that are guaranteed to return exactly the same result as a standard algorithm, while aiming at a reasonable cryptographic overhead. Several partial or fully homomorphic encryption schemes have been designed to solve the following problem: perform computations over encrypted data [15]. For example, such cryptographic tools have applications to various problem settings from the control domain [14,22,24,29]. The main difference between this line of research and our paper is on the type of manipulated data and the subsequent needed computations, as we detail next. On the one hand, they manipulate vectors of encrypted elements on which they perform linear algebra operations. Their approach is to rely on some homomorphic encryption scheme (generally Paillier) that can already do some computations directly in the encrypted domain, and then find a way to securely run some other computations specific to the concrete problem that they study. On the other hand, our data is not vectors of elements that can be manipulated in the encrypted domain. Indeed, multi-armed bandits are a model of interactive learning, where data comes sequentially, based on an interaction with an unknown environment. In the case of the UCB algorithm that we secure in our paper, the unknown environment is K Bernoulli distributions associated to the bandit arms. To generate a reward for an arm i i.e., a call $pull (i)$ , we need to draw a random number and compare it with the expected value $μ_{i}$ of the distribution associated to the arm i. Hence, a challenge of our setting is that there is no practical homomorphic encryption scheme that allows such operations. This is why we chose to split the storage of the K expected values $μ_{i}$ among K nodes $R_{i}$ , each of them being able to manipulate in clear its value. Then, another challenge is how to compute an argmax among the K nodes $R_{i}$ such that these nodes do not share sensitive data with the others. In this context, we proposed a multi-party protocol (UCB-MS) that we have subsequently refined (UCB-MS2) for stronger security guarantees. The computation of the K updated arm scores (each computed by the corresponding $R_{i}$ node) and the computation of the argmax of the K arm scores are repeated for each round of the exploration-exploitation phase. We only use Paillier at the end of the protocols to sum up the local sums of rewards of each node.

Moreover, since multi-armed bandits are a reinforcement learning model hence a machine learning model, a problem setting close to ours is the one of federated learning [21]. Indeed, federated learning is a machine learning paradigm where multiple entities collaborate in solving a learning problem, under the coordination of a central orchestration server. In a very recent federated learning survey [21], secure multi-party computation and homomorphic encryption are mentioned as standard cryptographic techniques to deal with the privacy and security issues in the presence of honest-but-curious adversaries. As mentioned in [21], “to date, federated learning has primarily considered supervised learning tasks where labels are naturally available on each client. Extending FL to other ML paradigms, including reinforcement learning, semi-supervised and unsupervised learning, active learning, and online learning all present interesting and open challenges”. An example of such challenge that we address because it occurs in multi-armed bandits is that there is less data that is available and comes only by interaction with the unknown environment and under a fixed budget.

Summary of contributions and paper organization. This paper is an extension of our conference paper [12], which is to the best of our knowledge the first one to propose a secure multi-party protocol based on the UCB algorithm. We next discuss the organization and contributions of this paper.

In Section 2, we introduce some basic notions: standard UCB algorithm and some cryptographic tools. Then, Section 3 is the core of our contribution:

We propose $UCB - MS$ , a secure multi-party protocol for cumulative reward maximization that guarantees the same cumulative reward as standard UCB.

We show that $UCB - MS$ satisfies desirable security properties that we precisely characterize.

We analyze the theoretical complexity of $UCB - MS$ , by quantifying the number of needed cryptographic primitives: $O (N K)$ AES-GCM encryptions/decryptions, K Paillier encryptions, and one Paillier decryption.

We propose the $UCB - MS$ 2 refinement, with stronger security guarantees at the price of K more AES-GCM keys and $O (N K)$ more AES-GCM encryptions/decryptions, and the same number of Paillier encryptions/decryptions.

In Section 4, we include a proof-of-concept empirical evaluation that confirms the theoretical complexity, and shows the scalability and practical feasibility of our protocols, on synthetic and real-world data. Finally, we conclude our paper and outline directions for future work in Section 5.

The main novel content of this paper w.r.t. its conference version [ 12 ] consists of non-trivial theorems and proofs (Section 3.3.2 ), as well as the presentation of the $UCB - MS$ 2 protocol (Section 3.4 ). The aforementioned material represents significantly more than 30% of new material.

2. Preliminaries

We first recall the UCB algorithm [4]. Then, we briefly present two cryptographic schemes that we use to build our protocols: Paillier asymmetric encryption scheme and AES-GCM symmetric encryption, which are both IND-CPA secure. For ease of readability, we provide in Table 2 a summary of our notation, presented according to the order of appearance in the paper.

Table 2
Table of notation

Symbol Meaning

K number of bandit arms

N budget i.e., number of allowed arm pulls

$[[x]]$ set ${1, \dots, x}$

$μ_{i}$ expected value for arm $i, \forall i \in [[K]]$

$pull (i)$ Bernoulli reward generation function; a call to $pull (i)$ returns 1 w.p. $μ_{i}$ and 0 w.p. $1 - μ_{i}$

r current reward, obtained after a call to $pull (\cdot)$

t current number of observed rewards, $t \in [[N]]$

$n_{i}$ current number of observed rewards from arm $i, \forall i \in [[K]]$

$s_{i}$ current sum of observed rewards from arm $i, \forall i \in [[K]]$

$B_{i}$ current score of arm $i, \forall i \in [[K]]$

$B_{m}$ current maximum arm score over all $B_{i}$ , $i \in [[K]]$

$i_{m}$ index of the arm having score $B_{m}$ (argmax, ties broken at random)

$DC$ Data Client

$E_{DC} (\cdot)$ / $D_{DC} (\cdot)$ Paillier encryption/decryption with the $pk / sk$ of $DC$

$R_{i}$ Arm node $i, \forall i \in [[K]]$

$AC$ Arm Controller i.e., the controller of $UCB - MS$

$Enc (\cdot)$ / $Dec (\cdot)$ AES-GCM encryption/decryption with symmetric key shared between data owner, $AC$ and all $R_{i}$

${Enc}_{i} (\cdot)$ / ${Dec}_{i} (\cdot)$ AES-GCM encryption/decryption with symmetric key shared between $AC$ and a single $R_{i}$

$σ : [[K]] \to [[K]]$ permutation i.e., function for which every element occurs exactly once as an image value

$σ^{- 1}$ inverse of σ

$y ‖ z$ concatenation of y and z

Symbol	Meaning
K	number of bandit arms
N	budget i.e., number of allowed arm pulls
$[[x]]$	set ${1, \dots, x}$
$μ_{i}$	expected value for arm $i, \forall i \in [[K]]$
$pull (i)$	Bernoulli reward generation function; a call to $pull (i)$ returns 1 w.p. $μ_{i}$ and 0 w.p. $1 - μ_{i}$
r	current reward, obtained after a call to $pull (\cdot)$
t	current number of observed rewards, $t \in [[N]]$
$n_{i}$	current number of observed rewards from arm $i, \forall i \in [[K]]$
$s_{i}$	current sum of observed rewards from arm $i, \forall i \in [[K]]$
$B_{i}$	current score of arm $i, \forall i \in [[K]]$
$B_{m}$	current maximum arm score over all $B_{i}$ , $i \in [[K]]$
$i_{m}$	index of the arm having score $B_{m}$ (argmax, ties broken at random)
$DC$	Data Client
$E_{DC} (\cdot)$ / $D_{DC} (\cdot)$	Paillier encryption/decryption with the $pk / sk$ of $DC$
$R_{i}$	Arm node $i, \forall i \in [[K]]$
$AC$	Arm Controller i.e., the controller of $UCB - MS$
$Enc (\cdot)$ / $Dec (\cdot)$	AES-GCM encryption/decryption with symmetric key shared between data owner, $AC$ and all $R_{i}$
${Enc}_{i} (\cdot)$ / ${Dec}_{i} (\cdot)$	AES-GCM encryption/decryption with symmetric key shared between $AC$ and a single $R_{i}$
$σ : [[K]] \to [[K]]$	permutation i.e., function for which every element occurs exactly once as an image value
$σ^{- 1}$	inverse of σ
$y ‖ z$	concatenation of y and z

Upper confidence bound (UCB). UCB is a class of algorithms commonly used when facing the exploration-exploitation dilemma. Each bandit arm is associated with a distribution whose mean is unknown to the learning agent. When pulling an arm, the agent observes an independent reward drawn from the distribution associated to the chosen arm. Specifically, we consider rewards drawn from Bernoulli distributions with expected values $μ_{1}, \dots, μ_{K}$ unknown to the agent. For a chosen arm i, a call to the function $pull (i)$ randomly returns 0 or 1 according to the associated Bernoulli distribution, i.e., the probability of returning 1 is $μ_{i}$ and the probability of 0 is 1– $μ_{i}$ . The agent sequentially selects the N arms to be pulled with the goal of maximizing the sum of rewards.

To guide the choice of the learner, arm scores have been proposed [2] to construct upper confidence bounds (UCB) based on the empirical mean of arm-specific rewards and the number of arm pulls. In the class of UCB algorithms, an important breakthrough was the introduction of algorithms with a finite-time analysis [4]. Specifically, in the UCB algorithm [4] presented in Fig. 2, for each arm i, the score $B_{i}$ is an upper-confidence bound on $μ_{i}$ , obtained as the sum between (i) the exploitation term given by the empirical mean of rewards observed from arm i, and (ii) the exploration term, which takes into account the uncertainty. Notice that after each observed reward, scores for all arms are updated, since the exploration term $\sqrt{\frac{2 ln (t)}{n_{i}}}$ depends on the total number of rewards observed up to current round t. Thus, an arm i being pulled few times (i.e., with small $n_{i}$ ) will have a relatively large exploration term. The score $B_{i}$ is thus an optimistic estimate for the value associated to arm i, since it can be interpreted as the largest statistically plausible mean value associated to arm i, given the observed rewards. As shown in Fig. 2, UCB chooses to pull next the arm with the largest updated $B_{i}$ score, thus following the principle of optimism in the face of uncertainty. This principle suggests to follow what seems to be the best arm, based on the optimistically constructed scores. The same principle is employed in various sequential decision making problems (see [26] for a survey).

Fig. 2.

UCB algorithm [4].

Next, we introduce a few security tools, while aiming to provide enough background to formally prove the security of our protocols. Before introducing the two cryptographic schemes, we point out that each of them has a security parameter λ that is input to key generation. By $1^{λ}$ we denote the unary representation of λ, which is a standard notation in cryptography. Our security theorems are always asymptotic i.e., they describe the behavior when λ becomes infinitely large. In practice, the security parameter is the length of the keys, for both Paillier and AES-GCM.

Paillier asymmetric encryption [ 27 ]. It is an asymmetric partial homomorphic encryption scheme defined by a triple of polynomial-time algorithms $(G, E, D)$ and a security parameter λ such that:

$G (1^{λ})$ generates two prime numbers p and q according to λ, sets $n = p \cdot q$ and $Λ = lcm (p - 1, q - 1)$ (i.e., the least common multiple), generates the group $(Z_{n^{2}}^{*}, \cdot)$ , randomly picks $g \in Z_{n^{2}}^{*}$ such that $M = {(L (g^{Λ} mod n^{2}))}^{- 1} mod n$ exists, with $L (x) = (x - 1) / n$ . It sets $sk = (Λ, M)$ , $pk = (n, g)$ , it returns $(sk, pk)$ .

$E_{pk} (m)$ randomly picks $r \in Z_{n}^{*}$ , computes $c = g^{m} \cdot r^{n} mod n^{2}$ , and outputs c.

$D_{sk} (c)$ computes $m = L (c^{Λ} mod n^{2}) \cdot M mod n$ , and outputs m.

Paillier’s cryptosystem is additive homomorphic. Let

m_{1}

and

m_{2}

be two plaintexts in

Z_{n}

. The product of the two associated ciphertexts with the public key

pk = (n, g)

, denoted

c_{1} = E_{pk} (m_{1}) = g^{m_{1}} \cdot r_{1}^{n} mod n^{2}

and

c_{2} = E_{pk} (m_{2}) = g^{m_{2}} \cdot r_{2}^{n} mod n^{2}

, is the encryption of the sum of

m_{1}

and

m_{2}

. Indeed, we have:

\begin{array}{l} E_{pk} (m_{1}) \cdot E_{pk} (m_{2}) & = c_{1} \cdot c_{2} mod n^{2} \\ = (g^{m_{1}} \cdot r_{1}^{n}) \cdot (g^{m_{2}} \cdot r_{2}^{n}) mod n^{2} \\ = (g^{m_{1} + m_{2}} \cdot {(r_{1} \cdot r_{2})}^{n}) mod n^{2} \\ = E_{pk} (m_{1} + m_{2}) . \end{array}

AES-GCM symmetric encryption. AES [1] is a NIST standard for symmetric encryption that encrypts messages of 128 bits. To encrypt messages larger than 128 bits, we use AES with a symmetric encryption mode. Among all existing modes we chose GCM (Galois Counter Mode) [28], which has been recently added to TLS 1.3.3

https://datatracker.ietf.org/doc/html/rfc8446

The AES-GCM cryptosystem is defined by a triple of polynomial-time algorithms

(KeyGen, Enc, Dec)

and a security parameter λ such that

KeyGen (1^{λ})

generates

Key

, a uniformly random symmetric key of 128, 192 or 256 bits, according to λ. We denote

c = Enc (m)

the encryption of m and

m = Dec (c)

the decryption of c with the same symmetric key shared between the participants.

IND-CPA (INDistinguishability under chosen-plaintext attack) [ 5 ]. Let $Π = (KeyGen, Encrypt, Decrypt)$ be a cryptographic scheme. The probabilistic polynomial-time (PPT) adversary $A$ tries to break the security of Π. The IND-CPA game, denoted by $EXP (A)$ , works as follows: the adversary $A$ chooses two messages $(m_{0}, m_{1})$ and receives a challenge $c = Encrypt ({LR}_{b} (m_{0}, m_{1}))$ from the challenger who selects a bit $b \in {0, 1}$ uniformly at random, and where ${LR}_{b} (m_{0}, m_{1})$ is equal to $m_{0}$ if $b = 0$ , and $m_{1}$ otherwise. The adversary, knowing $m_{0}, m_{1}$ and c, is allowed to perform any number of polynomial computations or encryptions of any messages, using the encryption oracle, in order to output a guess $b^{'}$ of the encrypted message in c chosen by the challenger. Intuitively, Π is IND-CPA if there is no PPT adversary that can guess b with a probability significantly better than $\frac{1}{2}$ . By $α = Pr [b^{'} \leftarrow EXP (A); b = b^{'}]$ , we denote the probability that $A$ correctly outputs her guessed bit $b^{'}$ when the bit chosen by the challenger in the experiment is b. A scheme is IND-CPA secure if $α - \frac{1}{2}$ is negligible function in λ, where a function γ is negligible in λ, denoted $negl (λ)$ , if for every positive polynomial $p (\cdot)$ and sufficiently large λ, $γ (λ) < 1 / p (λ)$ .

Both cryptographic schemes mentioned earlier in this section are IND-CPA: (i) Paillier is IND-CPA under the decisional composite residuosity assumption [27], and (ii) AES-GCM is IND-CPA under the assumption that AES is a pseudo-random permutation [5]. In our theorems, the notion of “better than random” is consistent with the aforementioned IND-CPA property. We also point out an additional notation used in the proofs. Similarly to Landau Big O notation, where by convention $O (f)$ can describe any function bounded above by f, we abuse notation and denote by $negl (λ)$ any function negligible in λ. Notably, we have $negl (λ) + negl (λ) = negl (λ)$ and we may write $x + negl (λ)$ instead of $x - negl (λ)$ .

All theoretical security properties of our protocols also hold if we choose any other IND-CPA symmetric scheme instead of AES-GCM, and any other additive homomorphic IND-CPA asymmetric scheme instead of Paillier. Our choice to rely on the aforementioned schemes is due to practical reasons. AES-GCM is very efficient in practice and implemented in standard libraries for modern programming languages. Paillier is also supported by a number of libraries that can be used in practice.

UCB - MS

: A secure multi-party protocol based on UCB algorithm

We define the security model in Section 3.1. We propose our secure protocol $UCB - MS$ (Section 3.2), and we analyze its correctness, security, and complexity (Section 3.3). We introduce a refinement of $UCB - MS$ in Section 3.4.

3.1. Security model

As outlined in Introduction and in Fig. 1, we assume that the data (i.e., the reward functions associated to K bandit arms) and the computations (i.e., the cumulative reward maximization algorithm) are outsourced to an honest-but-curious cloud. This means that the cloud executes tasks dutifully, but tries to extract as much information as possible from the data that it sees. Our model follows the classical formulation in [19] (Ch. 7.5, where honest-but-curious is denoted semi-honest), in particular (i) each cloud node is trusted: it correctly does the required computations, it does not sniff the network and it does not collude with other nodes, and (ii) an external observer has access to all messages exchanged over the network.

The data client indicates to the cloud her budget N and receives the cumulative reward R that the cloud computes using the K arms outsourced by the data owner and the data client’s budget N. The data client does not have to do any computation, except for decrypting R when the data client receives this information encrypted from the cloud. We expect the following security properties:

No cloud node can learn the cumulative reward.

The data client cannot learn information about the rewards produced by each arm or which arm has been pulled at some round.

By analyzing the messages exchanged between different cloud nodes, an external observer cannot learn the cumulative reward, the sum of rewards produced by some arm, or which arm has been pulled at some round.

We give a brief intuition for each property. Property 1 implies that only the data client can see in clear the cumulative reward for which she spends a budget. Property 2 ensures that the data client can see only the information for which she pays, and nothing else. Otherwise, depending on the difficulty of the bandit problem, the data client could estimate the arm values based on the contribution of each arm to the cumulative reward, which would leak information that should be known only by the data owner. Property 3 states that if some curious cloud admin analyzes all messages exchanged over the network, then she should have no clue on any input, output, or intermediate data that is used by the cumulative reward maximization algorithm.

We design a multi-party protocol that satisfies the aforementioned properties by exchanging only encrypted messages, and by splitting the computations among several cloud node participants, each of them having access only to the specific data that it needs for performing its task and nothing else. The challenge is to efficiently split the computations among as few cloud participants as possible, while minimizing the time needed for cryptographic primitives.

Fig. 3.

Overview of $UCB - MS$ . The dashed rectangle is the cloud.

3.2. Overview of

UCB - MS

In Fig. 3, we present an overview of $UCB - MS$ . There are $K + 1$ cloud participants: K arm nodes $R_{i}$ and a node $AC$ (Arm Controller) that is the controller of the protocol. We assume that the data owner and all cloud participants share the same symmetric AES-GCM key,4

⁴
It suffices to do the AES key exchange among the concerned nodes only once before starting the actual protocol. This can be done by relying on a standard key exchange protocol e.g., Authenticated Key Exchange. The key agreement is already done when starting the actual protocol, and has no impact on our protocol’s steps detailed in the rest of the section.

used for encryption function

Enc

. The data client (

DC

) generates a Paillier’s key pair (

pk

sk

) and for sake of clarity we denote

E_{DC} (m)

for

E_{pk} (m)

. By

[[x]]

, we denote the set

{1, \dots, x}

, and by

y ‖ z

we denote the concatenation of y and z.

UCB - MS

works as follows:

Fig. 3(a) (steps 0 and 1). For $i \in [[K]]$ , the data owner outsources to arm node $R_{i}$ the reward function (encrypted with $Enc$ ) associated to arm i. The data client sends to the cloud her budget N.

Fig. 3(b) (steps 2, 3, and 4). This is the core of the protocol, being done during $1 + N - K$ iterations: once for the initialization phase of UCB and $N - K$ times for the exploration-exploitation phase of UCB cf. Fig. 2. At each iteration, $AC$ sends to each arm node $R_{i}$ some pieces of information that dictate what each $R_{i}$ should do. Among these pieces of information (detailed later on in this section) there is a bit indicating whether an arm should be pulled or not. Then, the arm nodes $R_{i}$ are the ones that determine which arm should be pulled, in a multi-party ring-like manner: each arm node $R_{i}$ stores data pertaining to a single arm, and all $R_{i}$ interact to decide which arm should be pulled next, by computing a multi-party argmax on their arm scores $B_{i}$ ; at the end of the ring computation, the last $R_{i}$ in the ring communicates with $AC$ the result of the argmax. The arm nodes communicate in a random order, which changes at each iteration. All messages exchanged between nodes are encrypted with $Enc$ . Although each arm node stores information about its rewards, it never reveals this information to other nodes.

Fig. 3(c) (steps 5 and 6). After spending the data client’s budget, each arm node sends to $AC$ the sum of rewards that it produced, encrypted with $E_{DC}$ . Due to the additive homomorphic property of Paillier cryptosystem, $AC$ is able to sum up the K partial rewards to compute the cumulative reward $E_{DC} (R)$ directly in the encrypted domain. Only the data client can decrypt this information.

We next detail each step and present pseudocode only when the step is not trivial.

Step 0. We recall (cf. Fig. 2) that the data owner knows $μ_{1}, \dots, μ_{K}$ defining K Bernoulli distributions associated to the K arms. The data owner sends to each arm node $R_{i}$ the encrypted value $Enc (μ_{i})$ , for $i \in [[K]]$ . Since the data owner and the cloud share the symmetric key, then each arm node $R_{i}$ can decrypt and obtain $μ_{i}$ . Moreover, each node $R_{i}$ initializes to 0 the following two variables that it later on updates during the protocol: $s_{i}$ (i.e., sum of rewards for arm i) and $n_{i}$ (i.e., number of times the arm i has been pulled). Additionally, each arm node $R_{i}$ initializes a variable $t = K - 1$ , which is later on updated and needed for the computation of $B_{i}$ .

Step 1. The data client sends her budget N to $AC$ .

Pseudocodes of Steps 2, 3, and 4 are presented in Fig. 4.

Step 2. It corresponds to everything except the last two lines in Fig. 4(a) and has $1 + N - K$ iterations. At each iteration, $AC$ sends to the $R_{i}$ nodes a bit $b_{i}$ indicating whether the arm i should be pulled or not. At the first iteration (that corresponds to the initialization phase of UCB cf. Fig. 2), $AC$ sends $b_{i} = 1$ to each arm, and at the next $N - K$ iterations (that correspond to the exploration-exploitation phase of UCB cf. Fig. 2), $AC$ sends $b_{i} = 1$ only to a chosen arm $i_{m}$ and sends $b_{i} = 0$ to all other arms. Moreover, at each iteration, $AC$ generates a permutation $σ : [[K]] \to [[K]]$ (i.e., a function for which every element occurs exactly once as an image value), based on which $AC$ computes two more components that it sends to $R_{i}$ : ${first}_{i}$ that indicates whether the arm node is the first of the ring hence it should initialize $B_{m}$ and $i_{m}$ , and ${next}_{i}$ that indicates to which node the updated $B_{m}$ and $i_{m}$ should be sent during Step 3. The arm node that receives 0 on the $next$ component is the last one of the ring and sends $i_{m}$ to $AC$ , which thus knows which arm should be pulled next. All information that $AC$ sends to $R_{i}$ are thus useful for the ring computation of $i_{m}$ in Step 3. The permutation changes at each $AC$ iteration because it is important to have a random order during the ring communication. Without a random order, it may happen that the last arm is much better than all others and it is almost always pulled, hence it has a very good estimate of the cumulative reward.

Fig. 4.

Pseudocode of $AC$ and $R_{i}$ during steps 2, 3, and 4 cf. Fig. 3(b).

Step 3. This step corresponds to everything except the last two lines in Fig. 4(b). Note that the variable t stores how many arm pulls have been done in total since the beginning of the protocol. As discussed for Step 0, each arm initialized $t = K - 1$ , hence $t = K$ after the first iteration of $AC$ , which allows to compute the first $B_{i}$ values at the end of the initialization phase. Then, during the next $N - K$ iterations of $AC$ , the variable t is incremented, which allows to compute $B_{i}$ values during the exploration-exploitation phase. To decide which arm has the highest $B_{i}$ and should be pulled at the next iteration, the arm nodes $R_{i}$ do a multi-party ring computation, where the first arm node according to permutation σ (i.e., the only arm node that received ${first}_{i} = 1$ ) initializes max value $B_{m}$ and argmax $i_{m}$ . At each ring iteration (Steps 3.1, …, 3.K-1, cf. Fig. 3(b)), the current arm node sends updated $B_{m}$ and $i_{m}$ to the next arm node cf. σ. Even though $B_{m}$ and $i_{m}$ do not change, it is important to re-encrypt $Enc (B_{m} ‖ i_{m})$ before sending it to the next node to prevent an external observer from knowing when there is a change in the max and argmax (and hence learn information about which arms are pulled more often). Finally, once the ring computation reaches the last arm node relative to σ (i.e., the only one that received ${next}_{i} = 0$ ), we go to Step 4.

Step 4. This step corresponds to the last two lines in Fig. 4(b) (the last arm node in the ring sends $Enc (i_{m})$ to $AC$ ), followed by the last two lines in Fig. 4(a) ( $AC$ receives and decrypts the index of the arm to be pulled at the next iteration).

Step 5. Once the budget is spent and no more arm has to be pulled, each arm node $R_{i}$ (for $i \in [[K]]$ ) encrypts with $E_{DC}$ its sum of rewards $s_{i}$ and sends the result $E_{DC} (s_{i})$ to $AC$ .

Step 6. The node $AC$ takes the K ciphertexts $E_{DC} (s_{i})$ received at Step 5, and computes $E_{DC} (R) = E_{DC} (\sum_{i = 1}^{K} s_{i}) = \prod_{i = 1}^{K} (E_{DC} (s_{i}))$ , thanks to the additive homomorphic property of Paillier cryptosystem. Then, $AC$ sends $E_{DC} (R)$ to the data client, who is able to decrypt using $sk$ and hence obtains R.

3.3. Analysis of

UCB - MS

Next, we analyze the correctness (Section 3.3.1), security (Section 3.3.2), and complexity (Section 3.3.3) of $UCB - MS$ .

3.3.1. Correctness

We point out that $UCB - MS$ outputs exactly the same cumulative reward as UCB. The computations done in Fig. 4 to maximize the reward are the same as the one done in Fig. 2. Indeed, if we take $UCB - MS$ and remove all encryptions/decryptions (both symmetric and asymmetric), and all messages are communicated in clear between participants, then we obtain a protocol that we call UCB-M, which outputs exactly the same result as $UCB - MS$ . This happens because of the consistency property of the chosen cryptographic schemes i.e., if we encrypt a message M using $Enc$ (or $E_{DC}$ , respectively) to obtain a ciphertext C, then if we decrypt C using $Dec$ (or $D_{DC}$ , respectively), then we obtain exactly M. Next, to reduce UCB-M to UCB, we simply remove the splitting of tasks among participants and rewrite UCB-M as a sequential algorithm to obtain exactly UCB. In particular, the random permutation σ (that is generated at each round to decide in which order to iterate over arms) reduces to the randomness in the argmax function used in standard UCB cf. Fig. 2 when, if several arms have maximal $B_{i}$ -value, then the argmax should be randomly picked among those arms.

3.3.2. Security

In Table 3, we summarize what each participant in $UCB - MS$ knows/does not know. The main properties of our protocol are:

No cloud node can learn the cumulative reward and additionally:

Only $AC$ and the pulled arm know which arm is pulled at each round. Arms that are not pulled can guess the pulled arm with average probability $\frac{1}{2} + \frac{1}{2 K}$ .

Only arm node $R_{i}$ knows the sum of rewards for arm i.

Only $DC$ knows the cumulative reward, and she knows nothing else.

An external observer cannot learn the cumulative reward, the sum of rewards for some arm, or which arm has been pulled at some round.

These properties subsume the list of desirable security properties listed in Section 3.1.

Table 3
What each participant of $UCB - MS$ knows and does not know, with pointers to the relevant theorems

Participant Knows Does not know

$AC$ ∙ Arm pulled at each round ∙ Sum of rewards for some arm and cumulative reward (Th. 1)

$R_{i}$ ∙ Sum of rewards for arm i∙ Arm pulled at each round, with average probability $\frac{1}{2} + \frac{1}{2 K}$ (Th. 2) ∙ Sum of rewards of other arm $j \neq i$ and cumulative reward (Th. 3)

$DC$ ∙ Cumulative reward ∙ Arm pulled at each round (Th. 4)∙ Sum of rewards for some arm (Th. 5)

External observer ∙ Nothing ∙ Arm pulled at each round (Th. 6)∙ Sum of rewards for some arm and cumulative reward (Th. 7)

Participant	Knows	Does not know
$AC$	∙ Arm pulled at each round	∙ Sum of rewards for some arm and cumulative reward (Th. 1)
$R_{i}$	∙ Sum of rewards for arm i∙ Arm pulled at each round, with average probability $\frac{1}{2} + \frac{1}{2 K}$ (Th. 2)	∙ Sum of rewards of other arm $j \neq i$ and cumulative reward (Th. 3)
$DC$	∙ Cumulative reward	∙ Arm pulled at each round (Th. 4)∙ Sum of rewards for some arm (Th. 5)
External observer	∙ Nothing	∙ Arm pulled at each round (Th. 6)∙ Sum of rewards for some arm and cumulative reward (Th. 7)

Next, we provide formal statements and proofs for the security properties of $UCB - MS$ outlined in Table 3. Before formally stating the theorems, we point out some assumptions.

We recall (cf. Section 3.1) that the participants are honest-but-curious and do not collude. By collusion we mean that cloud nodes put together all their data. If at least 2 of the $R_{i}$ nodes collude, they could learn their respective algorithm inputs (i.e., bandit arm values that only the data owner is supposed to know at the same time) and outputs (i.e., cloud nodes could sum up the partial sums of rewards known by each node), hence our protocol would not satisfy the desirable security properties. However, if at least 2 (but not all) nodes collude, they still cannot know the cumulative reward because they do not know the partial rewards of the nodes that do not collude. A natural approach to get rid of the non-collusion hypothesis is to rely on fully homomorphic encryption (FHE) and do all computations on a single cloud node. The problem is that, as briefly mentioned in the Introduction and detailed in Section 4, currently there is no FHE that works fast and is accurate, hence the FHE is not yet feasible in practice.

During the ring computation (cf. Step 3 in Section 3.2), each arm learns an intermediate max value $B_{m}$ , together with intermediate arm argmax $i_{m}$ ; we assume that the knowledge on intermediate $B_{m}$ and $i_{m}$ by each arm does not leak significant information on the sum of rewards. Our refinement $UCB - MS 2$ (cf. Section 3.4) hides $i_{m}$ during the ring computation to relax the second hypothesis.

Before discussing the security properties for each participant, we introduce some additional notation needed for the theorem statements:

$n_{i, t}$ = the number of times arm i has been pulled until round t.

$s_{i, t}$ = the sum of rewards obtained by arm i until round t.

${data}_{A}^{t}$ = the data to which participant $A$ has access until round t, where $A$ can be a participant from Fig. 3 or the external observer ( $ext$ ). If t is omitted, this denotes the data to which $A$ has access at the end of the protocol.

$A^{pb (.)} (d)$ = the answer of a Probabilistic Polynomial-Time (PPT) adversary $A$ that knows d and tries to solve the problem $pb$ . Depending on the problem, $pb$ can also take some input.

By negligible in λ, we denote that our security theorems are always asymptotic i.e., they describe the behavior when the security parameter λ of the cryptographic schemes becomes infinitely large.

We next provide theorems that state each non-trivial property from Table 3. We first state an useful lemma, which intuitively says that guessing the cumulative reward with probability better than random is equivalent to guessing the sum of rewards of some arm with probability better than random.

Lemma 1.

Let $A$ be a PPT adversary trying to find the cumulative reward R, and let $B$ be a PPT adversary trying to find the sum of rewards of some arm. Let d be some data, $cr (.)$ be the problem of guessing the cumulative reward, and $sum (.)$ be the problem of guessing the sum of rewards of some arm. We have the following statement: $A^{m (.)} (d)$ has a non-negligible advantage ⇔ $B^{sum (.)} (d)$ has a non-negligible advantage.

Proof.

⇐ Assume that $B$ can guess the sum of rewards of some arm with probability better than random. Then, $A$ can call $B$ , and hence get the sum of rewards of one arm with probability better than random. From this sum, $A$ can guess a lower bound on the cumulative reward, hence eliminating some possibilities, and thus guessing the cumulative reward with probability better than random.

⇒ If $A$ can guess the cumulative reward with probability better than random, then $B$ can use this cumulative reward as an upper bound on the sum of rewards of some arm, thus having a probability better than random of guessing the sum of rewards of some arm. □

Security of $AC$ . By construction of $UCB - MS$ , $AC$ knows the arm pulled at each round. We state that $AC$ cannot learn the sum of rewards produced by some arm.

Theorem 1.

For an arm $i \in [[K]]$ and a round $t \in [[N - K + 1]]$ , an honest-but-curious $AC$ cannot learn $s_{i, t}$ , given ${data}_{AC}^{t}$ , with a probability better than random. More precisely, for all PPT adversaries $A$ , $\begin{array}{l} | Pr [(i, {\hat{s}}_{i, t}) \leftarrow A^{sum (t)} ({data}_{AC}^{t}); {\hat{s}}_{i, t} = s_{i, t}] - p_{S} (n_{i, t}, s_{i, t}) | \end{array}$ is negligible in λ, where $A^{sum (t)} ({data}_{AC}^{t})$ returns $(i, {\hat{s}}_{i, t})$ in which ${\hat{s}}_{i, t}$ is $A$ ’s guess on $s_{i, t}$ for the arm i (chosen by $A$ ), and $p_{S} (n_{i, t}, s_{i, t})$ is the probability of obtaining a sum of rewards $s_{i, t}$ from $n_{i, t}$ pulls of arm i until round t.

Proof.

Before Step 5 of $UCB - MS$ , $AC$ has access at each round to the indices of the pulled arms. Thus, $AC$ knows $n_{i, t}$ i.e., the number of times the arm i has been pulled until round t. The set of all possible sums of rewards for arm i until round t is ${0, 1, \dots, n_{i, t}}$ . We denote by $p_{S} (n_{i, t}, s_{i, t})$ the probability of obtaining the sum of rewards $s_{i, t}$ from $n_{i, t}$ pulls of arm i until round t. Next, we show that the advantage of $AC$ based of ${data}_{AC}^{t}$ is $p_{S} (n_{i, t}, s_{i, t})$ plus an amount negligible in λ.

Since $AC$ has no knowledge on $μ_{i}$ , the property stated in the theorem is respected at each round before Step 5, i.e., for all $t < N - K + 1$ .

We next prove the property for the last round i.e., $t = N - K + 1$ . At the end of $UCB - MS$ , at Step 5, $AC$ receives the values $E_{DC} (s_{1, t}), \dots, E_{DC} (s_{K, t})$ . We prove that retrieving any information about any $s_{i, t}$ from these ciphertexts breaks the IND-CPA property of Paillier’s cryptosystem [27]. At this point of $UCB - MS$ , ${data}_{AC}^{t}$ consists of $E_{DC} (s_{1, t}), \dots, E_{DC} (s_{K, t})$ and the list of arms that have been pulled at each round. Assume there exists a PPT adversary $A$ able, from ${data}_{AC}^{t}$ to find $s_{i, t}$ for some i with non negligible advantage x: $\begin{array}{l} | Pr [(i, {\hat{s}}_{i, t}) \leftarrow A^{sum (t)} ({data}_{AC}^{t}); {\hat{s}}_{i, t} = s_{i, t}] - p_{S} (n_{i, t}, s_{i, t}) | = x + negl (λ) . \end{array}$ In the worst case, each $i \in [[K]]$ has an equal probability of being chosen by $A$ . We also assume that if ${data}_{AC}^{t}$ does not correspond to the data collected by $AC$ during a run of $UCB - MS$ (for instance, if one piece of ${data}_{AC}^{t}$ has been replaced by another unrelated message), then $A$ does not give any advantage. If such an adversary $A$ exists, then we show how to construct an adversary $B$ able to break the IND-CPA property of Paillier.

Let us build an IND-CPA game, in which $B$ chooses two values $m_{0}, m_{1}$ , and sends them to the challenger. The challenger randomly selects $b \in {0, 1}$ and answers with $E_{DC} (m_{b})$ . $B$ wins the IND-CPA game if $B$ guesses b with a non-negligible advantage.

To do so, $B$ first creates a simulation of a $UCB - MS$ execution i.e., $B$ creates nodes ${DC}^{'}$ , ${AC}^{'}$ , $R_{i}^{'}$ , and ${DO}^{'}$ , with Bernoulli distributions defined by $μ_{i}^{'}$ of its choice. Then, $B$ runs an execution of $UCB - MS$ on these nodes. Because $B$ controls all the nodes, it knows the sums of rewards $s_{1, t}^{'}, \dots, s_{K, t}^{'}$ , as well as a list L of arms pulled at each round.

As input for the IND-CPA game, $B$ chooses $m_{1} = s_{1, t}^{'}$ and another value $m_{0}$ , different from all $s_{i, t}^{'}$ , sends both values to the challenger, and receives $E_{DC} (m_{b})$ . Then, $B$ computes $E_{DC} (s_{i, t}^{'})$ for each i, and calls $A^{sum (t)} ([E_{DC} (m_{b}), E_{DC} (s_{2, t}^{'}), \dots, E_{DC} (s_{K, t}^{'}), L])$ . The strategy of $B$ is as follows: if $A$ returns $(1, m_{1})$ , then $B$ answers 1. Otherwise, $B$ answers randomly. We next derive the probability of a correct answer by $B$ .

If $i \neq 1$ (probability $1 - \frac{1}{K}$ ), then $B$ answers randomly and is correct with probability $\frac{1}{2}$ . Hence this branch offers a probability of success of $(1 - \frac{1}{K}) \frac{1}{2}$ .

If $i = 1$ (probability $\frac{1}{K}$ ), let us consider the value of b.

If $b = 0$ (probability $\frac{1}{2}$ ), then we have two cases:

If the output of $A$ is $(1, m_{1})$ (probability $p_{S} (n_{1, t}, s_{1, t})$ ), then $B$ answers 1 and it is wrong, hence the probability of success is 0.

Otherwise (probability $1 - p_{S} (n_{1, t}, s_{1, t})$ ), $B$ answers randomly and is correct with probability $\frac{1}{2}$ . The probability of success of this branch is $\frac{1}{K} \frac{1}{2} (1 - p_{S} (n_{1, t}, s_{1, t})) \frac{1}{2}$ .

If $b = 1$ (probability $\frac{1}{2}$ ), then we have two cases:

If the output of $A$ is $(1, m_{1})$ (probability $p_{S} (n_{1, t}, s_{1, t}) + x + negl (λ)$ ), then $B$ correctly answers 1. The probability of success of this branch is $\frac{1}{K} \frac{1}{2} (p_{S} (n_{1, t}, s_{1, t}) + x + negl (λ))$ .

Otherwise (probability $1 - p_{S} (n_{1, t}, s_{1, t}) - x - negl (λ)$ ), $B$ answers randomly and is correct with probability $\frac{1}{2}$ . The probability of success of this branch is $\frac{1}{K} \frac{1}{2} (1 - p_{S} (n_{1, t}, s_{1, t}) - x - negl (λ)) \frac{1}{2}$ .

By aggregating the aforementioned cases, the probability α of success of

B

is:

\begin{array}{l} α = & (1 - \frac{1}{K}) \frac{1}{2} + \frac{1}{K} \frac{1}{2} (1 - p_{S} (n_{1, t}, s_{1, t})) \frac{1}{2} + \frac{1}{K} \frac{1}{2} (p_{S} (n_{1, t}, s_{1, t}) + x + negl (λ)) \\ + \frac{1}{K} \frac{1}{2} (1 - p_{S} (n_{1, t}, s_{1, t}) - x - negl (λ)) \frac{1}{2} \\ = & \frac{1}{2} - \frac{1}{2 K} + \frac{1}{4 K} - \frac{p_{S} (n_{1, t}, s_{1, t})}{4 K} + \frac{p_{S} (n_{1, t}, s_{1, t})}{2 K} + \frac{x}{2 K} \\ + \frac{1}{4 K} - \frac{p_{S} (n_{1, t}, s_{1, t})}{4 K} - \frac{x}{4 K} + negl (λ) \\ = & \frac{1}{2} + \frac{x}{4 K} + negl (λ) \end{array}

Hence,

B

has an advantage of

\frac{x}{4 K}

in the IND-CPA game, which is non negligible. This is a contradiction with the fact that Paillier is IND-CPA secure. Consequently, there does not exist any PPT adversary

A

that violates the property stated in the theorem. □

As a corollary, by Lemma 1 and Theorem 1, we infer that $AC$ cannot learn the cumulative reward with probability better than random.

Security of $R_{i}$ . By construction of $UCB - MS$ , each arm node $R_{i}$ knows its sum of rewards. Moreover, due to the properties of the ring computation, $R_{i}$ knows with average probability $\frac{1}{2} + \frac{1}{2 K}$ the arm to be pulled at the next round (Theorem 2), but it cannot learn the sum of rewards of any other arm (Theorem 3).

Theorem 2.

At the end of round $t \in [[N - K]]$ and before the start of round $t + 1$ , given ${data}_{R_{i}}^{t}$ , the average probability that an honest-but-curious $R_{i}$ guesses the arm to be pulled at round $t + 1$ is $\frac{1}{2} + \frac{1}{2 K}$ .

Proof.

After round t, an arm i can either guess randomly (with a success probability of $\frac{1}{K}$ ), or use the data to which it has access: the partial max $B_{m}$ , the partial argmax index $i_{m}$ , and the next arm in the ring communication. The knowledge of the next arm is useless, as it does not bring any information about any B value. Similarly, the knowledge of $B_{m}$ does not leak more information than $i_{m}$ . Hence, the only useful piece is $i_{m}$ . Based on this only useful piece of data and on the earlier assumption that any information derived from partial argmax data from the previous rounds is negligible, we infer that the best policy for the arm is to bet that arm $i_{m}$ is the arm to be pulled at the next round. Let us consider an arm at position $σ^{- 1} (i)$ , where σ is the ring permutation used at the round t. Its guess is correct if and only if the next arm to be selected, say j, has position $σ^{- 1} (j) ⩽ σ^{- 1} (i)$ . Hence, the arm at position $σ^{- 1} (i)$ has a success probability of $\frac{σ^{- 1} (i)}{K}$ . On average, an arm has a success probability of $\begin{array}{l} \frac{1}{K} \sum_{i = 1}^{K} \frac{σ^{- 1} (i)}{K} = \frac{1}{K^{2}} \frac{K (K + 1)}{2} = \frac{K + 1}{2 K} = \frac{1}{2} + \frac{1}{2 K} \end{array}$ which concludes the proof. □

Theorem 3.

For an arm $i \in [[K]]$ and a round $t \in [[N - K + 1]]$ , an honest-but-curious $R_{i}$ cannot learn $s_{j, t}$ for some other arm $j \neq i$ , given ${data}_{R_{i}}^{t}$ , with a probability better than random. More precisely, for all PPT adversaries $A$ , $\begin{array}{l} | Pr [(j, {\hat{s}}_{j, t}) \leftarrow A^{sum (t)} ({data}_{R_{i}}^{t}); {\hat{s}}_{j, t} = s_{j, t}] - p_{R} (n_{i, t}, t, s_{j, t}) | \end{array}$ is negligible in λ, where $A^{sum (t)} ({data}_{R_{i}}^{t})$ returns a tuple $(j, {\hat{s}}_{j, t})$ in which $j \neq i$ is chosen by $A$ and ${\hat{s}}_{j, t}$ is $A$ ’s guess of the sum of rewards for arm j, and $p_{R} (n_{i, t}, t, s_{j, t})$ is the probability of arm j to have sum of rewards $s_{j, t}$ at round t seen that arm i has been pulled $n_{i, t}$ times.

Proof.

If an arm i has been pulled $n_{i, t}$ times until round t, then another arm j has been pulled at most $t - n_{i, t}$ times. Hence, a baseline probability of $R_{i}$ to guess the sum of rewards of any other arm j is the $p_{R} (n_{i, t}, t, s_{j, t})$ defined in the theorem statement. The arm node $R_{i}$ cannot possibly guess the sum of rewards for arm j with a better probability because it does not see any useful information that it can leverage. In particular, the only information that $R_{i}$ receives about the rewards of any other arm is the partial max value $B_{m}$ (derived from the sum of arm $i_{m}$ using the number of pulls of $i_{m}$ , to which $R_{i}$ does not have access) received during Step 3. As mentioned earlier, we assume that the information that one arm can derive from one such random B value does not provide any advantage. □

As a corollary, by Lemma 1 and Theorem 3, we infer that $R_{i}$ cannot learn the cumulative reward with probability better than random.

Security of $DC$ . The data client knows the cumulative reward that she can decrypt after Step 6. Moreover, the data client cannot learn the arms selected at some round (Theorem 4) or the sum of rewards for some arm (Theorem 5).

Theorem 4.

For each round $t \in {2, \dots, N - K + 1}$ , the data client $DC$ cannot guess which arm is pulled at round t with probability better than random.

Proof.

The data client does not receive any message until the end of $UCB - MS$ (Step 6). By construction of $UCB - MS$ , all arms are pulled at the first round, then from round 2 and until the end of $UCB - MS$ i.e., round $N - K + 1$ , there is a single arm pulled at each round. In particular, the data client does not receive any information on which arm is pulled at some round, hence her best strategy is to answer randomly, with a probability of success of $\frac{1}{K}$ . □

Theorem 5.

For an arm $i \in [[K]]$ , the data client $DC$ cannot guess the sum $s_{i}$ of rewards for the arm i with probability better that random.

Proof.

Similarly to the previous proof, we observe that the data client $DC$ does not receive any message until the end of $UCB - MS$ (Step 6). In particular, $DC$ does not get any information about which arm is selected at some round. Because all arm probability distributions are equiprobable to $DC$ , it is also true that all partitions of the cumulative reward R are equiprobable to $DC$ , thus $DC$ has no advantage in guessing the partition of rewards. Hence, the probability of $DC$ guessing a correct partition of the rewards is equal to $\frac{1}{p (R)}$ , where $p (R)$ is the number of partitions of R. This observation also proves that $DC$ cannot guess the individual sum of rewards of some arm i. If it was the case, then $DC$ would know that some of the partitions are more likely e.g., if $DC$ can guess the sum of rewards $s_{i}$ of the arm i, then all partitions not having $s_{i}$ as the value for arm i would be discarded, which is a contradiction. □

External observer. An external observer sees all messages exchanged between nodes, from which we show that she cannot learn which arm is pulled at some round (Theorem 6) or the sum of rewards for some arm (Theorem 7).

Theorem 6.

For each round $t \in {2, \dots, N - K + 1}$ , an honest-but-curious external observer cannot learn which arm is pulled at round t, given ${data}_{ext}^{t}$ , with probability better than random. More precisely, for all PPT adversaries $A$ , $\begin{array}{l} | Pr [A^{pa (t)} ({data}_{ext}^{t}) = i_{m}^{t}] - \frac{1}{K} | is negligible in λ, \end{array}$ where $A^{pa (t)} ({data}_{ext}^{t})$ returns the guess of $A$ on which arm is pulled at round t, and $i_{m}^{t}$ is the true arm pulled at round t.

Proof.

By construction of $UCB - MS$ , all arms are pulled at the first round, then from round 2 and until the end of $UCB - MS$ i.e., round $N - K + 1$ , there is a single arm pulled at each round. We next show that if there exists a PPT adversary with a non negligible advantage in guessing the arm pulled at some round $2 ⩽ t ⩽ N - K + 1$ , then this would break the IND-CPA property of AES-GCM.

An external observer (denoted $ext$ in the sequel) sees all encrypted messages that are exchanged among $UCB - MS$ participants. We denote by ${data}_{ext}^{t}$ this collection of data after round t. We assume, toward a contradiction, that there exists a PPT adversary $A$ able from ${data}_{ext}$ to find the arm $i_{m}^{t}$ pulled at some round t with a non negligible advantage x: $\begin{array}{l} | Pr [A^{pa (t)} ({data}_{ext}^{t}) = i_{m}^{t}] - \frac{1}{K} | = x + negl (λ) . \end{array}$ We also assume that if ${data}_{ext}^{t}$ does not correspond to an actual collection of encrypted messages that $ext$ sees, then the advantage for such an input is negligible.

We next show that by using the adversary $A$ , we can construct an adversary $B$ able to break the IND-CPA property of AES-GCM. To do so, $B$ creates a simulation of a $UCB - MS$ execution, similarly to the proof of Theorem 1. Even though the messages of such a simulation are encrypted, $B$ knows the keys hence the state of each arm. In particular, $B$ knows in plain text the message sent by $AC$ to the arm pulled at round t. This message is of the form $m_{1} = (1 ‖ {first}_{i, t} ‖ {next}_{i, t})$ , with 1 being the Boolean value saying the arm has to be pulled.

As input for the IND-CPA game, $B$ sends the aforementioned $m_{1}$ and another message $m_{0} = (0 ‖ {first}_{i, t} ‖ {next}_{i, t})$ that it generates based on $m_{1}$ . Then, $B$ receives back $Enc (m_{b})$ , where b is a random bit selected uniformly by the challenger. Next, $B$ calls $A^{pa (t)} ({data}_{ext}^{'})$ , where ${data}_{ext}^{'}$ is the collection of encrypted messages from the $B$ ’s simulation, except that it replaces $Enc (m_{1})$ by $Enc (m_{b})$ . The strategy of $B$ is: if $A$ returns the correct $i_{m}^{t}$ , then $B$ returns 1, otherwise answer randomly.

If $b = 0$ (probability $\frac{1}{2}$ ), then $A$ does not receive a correct simulation because no arm is pulled at round t. According to our assumption, $A$ does not give any advantage.

If $A$ returns the correct $i_{m}^{t}$ (probability $\frac{1}{K}$ ), then $B$ answers 1 and is wrong.

Otherwise (probability $1 - \frac{1}{K}$ ), then $B$ answers randomly and is correct with probability $\frac{1}{2}$ . This branch yields a probability of success of $\frac{1}{2} (1 - \frac{1}{K}) \frac{1}{2}$ .

If $b = 1$ (probability $\frac{1}{2}$ ), then the advantage given by $A$ can be leveraged by $B$ .

If $A$ returns the correct $i_{m}^{t}$ (probability $\frac{1}{K} + x + negl (λ)$ ), then $B$ correctly answers 1. The probability of success of this branch is $\frac{1}{2} (\frac{1}{K} + x + negl (λ))$ .

Otherwise (probability $1 - \frac{1}{K} - x - negl (λ)$ ), $B$ answers randomly and is correct with probability $\frac{1}{2}$ . This branch yields a probability of success of $\frac{1}{2} (1 - \frac{1}{K} - x - negl (λ)) \frac{1}{2}$ .

By aggregating the aforementioned cases, the probability α of success of

B

is:

\begin{array}{l} α = & \frac{1}{2} (1 - \frac{1}{K}) \frac{1}{2} + \frac{1}{2} (\frac{1}{K} + x + negl (λ)) + \frac{1}{2} (1 - \frac{1}{K} - x - negl (λ)) \frac{1}{2} \\ = & \frac{1}{4} - \frac{1}{4 K} + \frac{1}{2 K} + \frac{x}{2} + \frac{1}{4} - \frac{1}{4 K} - \frac{x}{4} + negl (λ) \\ = & \frac{1}{2} + \frac{x}{4} + negl (λ) \end{array}

Hence,

B

has an advantage of

\frac{x}{4}

in the IND-CPA game, which is non negligible. This contradicts the fact that AES-GCM is IND-CPA secure. Hence, we conclude that there does not exist any PPT adversary

A

that violates the property stated in the theorem. □

Theorem 7.

For an arm $i \in [[K]]$ and a round $t \in [[N - K + 1]]$ , an honest-but-curious external observer cannot learn $s_{i, t}$ , given ${data}_{ext}^{t}$ , with a probability better than random. More precisely, for all PPT adversaries $A$ , $\begin{array}{l} | Pr [(i, {\hat{s}}_{i, t}) \leftarrow A^{sum (t)} ({data}_{ext}^{t}); {\hat{s}}_{i, t} = s_{i, t}] - p_{Q} (t, s_{i, t}) | \end{array}$ is negligible in λ, where $A^{sum (t)} ({data}_{ext}^{t})$ returns $(i, {\hat{s}}_{i, t})$ in which ${\hat{s}}_{i, t}$ is $A$ ’s guess on $s_{i, t}$ for the arm i (chosen by $A$ ), and $p_{Q} (t, s_{i, t})$ is the probability of obtaining a sum of rewards $s_{i, t}$ from at most t pulls of arm i until round t.

Proof.

The external observer collects ${data}_{ext}^{t}$ , which consists of several encrypted messages, some of them being encrypted with $Enc$ (AES-GCM) and some other being encrypted with $E_{DC}$ (Paillier). We prove that these messages do not provide an advantage bigger than the advantage of an adversary in a classical IND-CPA game on $Enc$ or $E_{DC}$ . For simplicity, we assume that the ${data}_{ext}^{t}$ only contains two encrypted messages, $Enc (m)$ and $E_{DC} (n)$ . The proof can obviously be adapted if ${data}_{ext}^{t}$ consists of more than two messages.

The goal of the adversary is to extract at least a bit of information from either m or n. The entropy of this system is minimal when $m = n$ . Hence, when $m = n$ , the adversary has the highest probability of guessing at least a bit from either m or n (which are the same in this case). As a consequence, in the general case, the advantage of an adversary having to guess a bit about m or n, knowing $Enc (m)$ or $E_{DC} (n)$ is bounded above by the advantage of an adversary having to guess a bit about m, knowing $Enc (m)$ and $E_{DC} (m)$ .

Let us prove that the advantage of a PPT adversary in this latter case (having to guess a bit about m from $Enc (m)$ and $E_{DC} (m)$ ) is negligible.

We assume, toward a contradiction, that there exists a PPT adversary $A$ able to win the game where, given $Enc (m)$ and $E_{DC} (m)$ , $A$ recovers a bit of information about m with a non-negligible advantage x: given $Enc (m)$ and $E_{DC} (m)$ , the probability that $A$ outputs a correct guess about a bit of m is equal to $\frac{1}{2} + x + negl (λ)$ .

We use this adversary to create another adversary $B$ able to break the IND-CPA property of the encryption schemes $Enc$ (or $E_{DC}$ , respectively). As usually in the IND-CPA game, $B$ chooses two messages $m_{0}$ and $m_{1}$ , and sends them to the challenger. Then, $B$ receives the challenge $Enc (m_{b})$ (or $E_{DC} (m_{b})$ , respectively), and calls $A (Enc (m_{b}), E_{DC} (m_{0}))$ (or $A (Enc (m_{0}), E_{DC} (m_{b}))$ , respectively). If $A$ returns a correct guess about $m_{0}$ , then $B$ returns 0. Otherwise, it returns 1.

If $b = 0$ (happens with probability $\frac{1}{2}$ ), then $A$ has a non negligible advantage in guessing a bit about m.

$A$ outputs a correct guess about one bit of $m_{0}$ with probability $\frac{1}{2} + x + negl (λ)$ . In this case, $B$ is correct. This branch happens with probability $\frac{1}{2} (\frac{1}{2} + x + negl (λ))$ .

If $A$ does not answer correctly (happens with probability $\frac{1}{2} - x - negl (λ)$ ), then $A$ is correct with probability $\frac{1}{2}$ . This branch happens with probability $\frac{1}{2} (\frac{1}{2} - x - negl (λ)) \frac{1}{2}$ .

If $b = 1$ (happens with probability $\frac{1}{2}$ ), then $A$ has no advantage.

If $A$ returns a correct guess about one bit of $m_{0}$ (happens with probability $\frac{1}{2}$ ), then $B$ is wrong.

If not (happens with probability $\frac{1}{2}$ ), then $A$ returns a random guess and is correct with probability $\frac{1}{2}$ . This branch of events happen with probability $\frac{1}{2^{3}}$ .

By aggregating these cases, the probability α of success of

B

is:

\begin{array}{l} α & = \frac{1}{2} (\frac{1}{2} + x + negl (λ)) + \frac{1}{2} (\frac{1}{2} - x - negl (λ)) \frac{1}{2} + \frac{1}{8} \\ = \frac{1}{4} + \frac{1}{2} x + \frac{1}{8} - \frac{1}{4} x + \frac{1}{8} + negl (λ) \\ = \frac{1}{2} + \frac{1}{4} x + negl (λ) \end{array}

Hence,

B

has a non-negligible advantage of

\frac{1}{4} x

in the IND-CPA game against

Enc

(or

E_{DC}

, respectively), which is a contradiction with its IND-CPA property. Guessing a bit about the encrypted message is equivalent to guessing the reward with a probability better than random (i.e., better than

p_{Q} (t, s_{i, t})

cf. our theorem statement), which concludes our proof. □

As a corollary, by Lemma 1 and Theorem 7, we infer that the external observer cannot learn the cumulative reward with probability better than random.

3.3.3. Complexity

We detail in Table 4 the number of cryptographic operations used in each step of $UCB - MS$ . By summing up, we obtain $O (N K)$ AES-GCM encryptions/decryptions, K Paillier encryptions, and one Paillier decryption. Hence, we have a number of AES-GCM operations linear in N, whereas the number of Paillier operations does not depend on N. These are desirable complexity properties. In particular, the number of Paillier operations (which are quite slow to evaluate in practice) depends only on K that is typically much smaller than N in bandit scenarios. Our implementation (cf. Section 4) follows the aforementioned theoretical analysis and confirms the linear time behavior and the scalability of $UCB - MS$ .

Table 4
Number of cryptographic operations used in $UCB - MS$

Encryptions Decryptions

AES-GCM K (step 0) $(N - K + 1) K$ (step 2) $(N - K + 1) (K - 1)$ (step 3) $(N - K + 1)$ (step 4) K (step 0) $(N - K + 1) K$ (step 2) $(N - K + 1) (K - 1)$ (step 3) $(N - K + 1)$ (step 4)

Paillier K (step 5) 1 (step 6)

	Encryptions	Decryptions
AES-GCM	K (step 0) $(N - K + 1) K$ (step 2) $(N - K + 1) (K - 1)$ (step 3) $(N - K + 1)$ (step 4)	K (step 0) $(N - K + 1) K$ (step 2) $(N - K + 1) (K - 1)$ (step 3) $(N - K + 1)$ (step 4)
Paillier	K (step 5)	1 (step 6)

3.4. Refinement

We propose the $UCB - MS 2$ refinement, which adds slightly stronger security guarantees to $UCB - MS$ , for few more cryptographic operations (but the similar asymptotic behavior as $UCB - MS$ ). A property of $UCB - MS$ (cf. Table 3, stated in Theorem 2) is that an arm node $R_{i}$ knows with average probability of $\frac{1}{2} + \frac{1}{2 K}$ what arm is pulled at the next round. This happens because during the ring computation, every arm sees in clear the partial argmax $i_{m}$ . The $UCB - MS 2$ refinement of $UCB - MS$ removes the aforementioned leakage and hence allows relaxing the second hypothesis from Section 3.3.2.

Fig. 5.

to $UCB - MS$ pseudocode cf. Fig. 4 to obtain $UCB - MS 2$ .

The idea of $UCB - MS 2$ is that, in addition to $UCB - MS$ , we also encrypt the partial argmax $i_{m}$ during the ring computation. This modification requires to introduce new keys. We recall that $UCB - MS$ assumes an AES-GCM key that is shared between the data owner and all cloud participants and that is used for the functions $Enc / Dec$ . For $UCB - MS 2$ , if we want that an arm node $R_{i}$ cannot decrypt the partial argmax $i_{m}$ received from the previous arm node in the ring, we need to encrypt $i_{m}$ with some other key. This is why in $UCB - MS 2$ we introduce K new AES-GCM keys, each of them shared between $AC$ and a single $R_{i}$ arm node. Each such key defines functions ${Enc}_{i}$ / ${Dec}_{i}$ .

We show in Fig. 5 the modifications to Step 3 and 4 of $UCB - MS$ cf. Fig. 4 that allow to obtain $UCB - MS 2$ . In the worst case, these modifications cost $(N - K + 1) (K - 1)$ encryptions at Step 3 and $(N - K + 1) K$ decryptions at Step 4, which does not change the overall asymptotic behavior outlined in Section 3.3.3. All theorems from Section 3.3.2 also hold for $UCB - MS 2$ , except Theorem 2 that is replaced by the next theorem, which formally states the stronger security guarantees of $UCB - MS 2$ .

Theorem 8.

In $UCB - MS$ 2, at the end of round $t \in [[N - K]]$ and before the start of round $t + 1$ , given ${data}_{R_{i}}^{t}$ , an honest-but-curious arm node $R_{i}$ cannot learn the arm to be pulled at round $t + 1$ with probability better than random.

Proof.

At each round t, the arm node $R_{i}$ receives $Enc (B_{m} ‖ {Enc}_{i_{m}} (i_{m}))$ and decrypts into $B_{m} ‖ {Enc}_{i_{m}} (i_{m})$ . By hypothesis, $B_{m}$ does not leak any information about the next arm to be pulled. The only way for $R_{i}$ to guess the next arm with probability better than random is to use some information contained in ${Enc}_{i_{m}} (i_{m})$ . However, since ${Enc}_{i_{m}}$ is IND-CPA, it is impossible to learn any information on $i_{m}$ with non negligible advantage. Hence, the strategy of $R_{i}$ to guess the arm pulled at round $t + 1$ is not better than random. □

4. Experiments

We show that the overhead due to cryptographic primitives is reasonable, hence our protocols are feasible. More precisely, we show the scalability of our protocols with respect to both parameters N and K through an experimental study using synthetic and real data. We compare:

UCB = Standard UCB [4], outlined in Fig. 2.

UCB-M = UCB with multiple participants among which the computations are split, in the spirit of $UCB - MS$ cf. Section 3.2, but with all messages exchanged in clear (i.e., UCB-M does not use any cryptographic primitive). The only overhead w.r.t. UCB is due to splitting the computation tasks among participants.

$UCB - MS$ = Multi-party Secure UCB cf. Section 3.2.

$UCB - MS 2$ = Refinement of $UCB - MS$ cf. Section 3.4.

We implemented the algorithms in Python 3. For AES-GCM we used the Cryptography library5

⁵
https://cryptography.io/en/latest/

and keys of 256 bits. For Paillier, we used the phe library6

⁶

https://python-paillier.readthedocs.io/en/develop/

in the default configuration with keys of 2048 bits. We did our experiments on a laptop with CPU Intel Core i7 of 2.80 GHz and 16GB of RAM, running Ubuntu. In each run, we executed all algorithms using the same random seeds, needed for drawing arm rewards and for generating the permutation used to iterate in a random order over the arms when choosing the argmax arm to be pulled at the next round. We make available on a public GitHub repository7

⁷

https://github.com/radu1/secure-ucb

our source code, together with the data that we used, the generated results from which we obtained our plots, and scripts that allow to install the needed libraries and reproduce our plots.

As expected, in each experiment, all four algorithms output exactly the same cumulative reward. The property that our secure algorithms return exactly the same cumulative reward as standard UCB is in contrast with differentially-private multi-armed bandit algorithms [16,25,30], where the returned cumulative rewards are different from that of standard UCB. Consequently, a shallow empirical comparison between these works and ours boils down to comparing apples and oranges: (i) on the one hand, the running time of differentially-private bandit algorithms is roughly the same as for standard UCB and is never reported in their experiments, whereas (ii) on the other hand, for our algorithms the cumulative reward is always the same as for standard UCB and consequently there is no point for us in doing any plot on the cumulative reward. Nevertheless, we carefully analyzed all experimental settings (N, K, μ) used in the related work, that we adapt for our scalability experiments, as we detail next.

Scalability with respect to N. In this experiment, we rely on scenarios from the related work [16,30] to fix K and μ, and to vary N. In Fig. 6, we show the results for two such scenarios. We repeated the same experiment for four other scenarios, which yielded very similar results that we do not include here to avoid redundancy. We vary N from $10^{2}$ to $10^{5}$ that is also the maximum budget from [16,30]. UCB and UCB-M have very close running times, and up to two orders of magnitude smaller than $UCB - MS$ and $UCB - MS$ 2, which are also very close. All algorithms have a similar linear time behavior. The overhead between secure and non-secure algorithms comes naturally from the cryptographic primitives. Moreover, the two lines corresponding to the secure algorithms are not parallel with the other two lines because, cf. Section 3.3.3, the overhead due to Paillier encryptions depends only on K (that is fixed in the figure) and not on N (that varies in the figure), hence the Paillier overhead is more visible for small N. The running times of $UCB - MS$ / $UCB - MS$ 2 for the largest considered budget $N = 10^{5}$ is of ∼150 seconds, which remains practical. In Fig. 6, we also zoom on the time taken by each participant of $UCB - MS$ for $N = 10^{5}$ . We observe that $AC$ takes the lion’s share, which is expected because at each round $AC$ sends encrypted messages to all $R_{i}$ participants, whereas each $R_{i}$ sends an encrypted message only to one other participant. The share of $AC$ is only somewhere between one half and one third of the whole computation time. In other words, more than half of the whole computation time is split among other cloud nodes than $AC$ . As expected, all $R_{i}$ take roughly the same time. The shares taken by the data owner and the data client are the smallest among all participants, which is a desirable property because we require them to do as few computations as possible, whereas the bulk of the computation is outsourced to the cloud.

Fig. 6.

Scalability with respect to N. In the zoom, we do not show $DO$ (its share is close to 0).

Scalability with respect to K. In this experiment, we fix $N = 10^{5}$ , and we vary $K \in {5, 10, 15, 20}$ and implicitly μ with $μ_{1} = 0.9$ and $μ_{2 ⩽ i ⩽ K} = 0.8$ . We present results in Fig. 7. We observe, as in the previous experiment, a linear time behavior and a similar zoom on the time taken by each participant.

Fig. 7.

Scalability with respect to K, for fixed $N = 10^{5}$ . In the zoom (labels not shown because they would be colliding): $AC$ takes the lion’s share, $R_{1 ⩽ i ⩽ 20}$ take the 20 equal shares, $DC$ is barely visible, and $DO$ is not shown since its share is close to 0.

Real-world data. We also stress-tested our algorithms on real-world data, using the same data and experimental setup as [23].

More precisely, we use data from Jester8

⁸

http://eigentaste.berkeley.edu/dataset/

[18], a collection of ratings ranging from −10 (very not funny) to 10 (very funny), given by 25K users on 100 jokes. Exactly as [23], we pre-process this dataset by assigning the lowest score to the unrated jokes, and then we extract two bandit scenarios:

Jester-small: $K = 10$ , corresponding to the 10 most rated jokes, where $μ_{i}$ = (# of ratings ⩾ threshold 3.5 for joke i) / (# of users).

Jester-large: $K = 100$ , corresponding to all 100 jokes, where $μ_{i}$ is computed similarly as for Jester-small, except that the threshold here is set to 7.

Moreover, we use data from MovieLens9

⁹

https://grouplens.org/datasets/movielens/

[20], more precisely the “MovieLens 100K Dataset” that contains ratings ranging from 1 (bad) to 5 (very good) given by 1K users on a set movies, from which, exactly as [23], we look only at the first 100 movies and derive the following bandit scenario:

MovieLens: $K = 100$ , corresponding to the first 100 movies, where $μ_{i}$ = (# of ratings ⩾ threshold 4 for movie i) / (# of users).

We ran each of these scenarios with

N = 10^{5}

that is the largest budget considered in [23]. Our results (cf. Fig. 8) essentially confirm the behavior observed in the synthetic experiments i.e., there are roughly two orders of magnitude between non-secure and secure algorithms. In the largest considered scenarios (Jester-large and MovieLens, both with

K = 100

), where standard UCB takes ∼20 seconds, both

UCB - MS

and

UCB - MS

2 take ∼25 minutes, that we believe acceptable as waiting time for the data client before getting the cumulative reward result for which she pays.

Fig. 8.

Running times on three real-world data scenarios from [23].

Positioning with respect to FHE. The main computations done by the UCB algorithm are: pull a bandit arm (i.e., draw a random number and compare it with the expected value to decide whether the reward is 0 or 1), compute arm scores (i.e., sum, multiplication, sqrt, ln), and decide what arm to pull next (i.e., compute an argmax over arm scores). A fully homomorphic encryption (FHE) scheme basically supports homomorphic addition and multiplication, and all other computations require some effort to be specified since they should be encoded using additions and multiplications.

A possible FHE version of the UCB algorithm is to encrypt the input $μ_{i}$ and then do all computations on a single cloud node, as the reviewer suggests. The random draws (needed to pull a bandit arm) can be done in clear and then the results encrypted, and all other computations are done directly in the encrypted domain.

To give a rough idea on a loose lower bound of the UCB computation time in a FHE scheme, we focus on the comparisons required for arm pulls and argmax computations. Running the UCB algorithm for some K and N requires N comparisons for the pulls, and $(N - K) (K - 1)$ comparisons to decide what arm to pull next (because there are $N - K$ exploration-exploitation time steps when an argmax should be computed, each of such time steps requiring $K - 1$ comparisons). Hence, UCB requires a total number of $N + N K - N - K^{2} + K = K (N - K + 1)$ comparisons.

We analyzed a state-of-the-art article for efficient homomorphic comparison methods, published at ASIACRYPT 2020 [9]. Among all algorithms and variants presented there, if we take the one that provides the best compromise between fast running time and small chance to make an error in the comparison result (i.e., NewCompG with $α = 20$ ), it needs 1.43 milliseconds per comparison, in amortized running time, on a system comparable to ours.

Take for instance $K = 10$ and $N = 10^{5}$ . For this input, UCB needs 999910 comparisons, which means that ∼1400 seconds would be required only for the (approximate) comparisons if one decides to implement UCB using state-of-the-art FHE techniques. Then, one should add the time needed for the other UCB computations, which would increase even more the total computation time.

We recall (cf. experiment Scalability with respect to N) that for $K = 10$ and $N = 10^{5}$ , our secure multi-party protocols take ∼150 seconds for securely running all UCB computations. We believe that these numbers strengthen our approach to rely on splitting the computation tasks among several nodes instead of relying on FHE systems for securely and efficiently implementing an algorithm such as UCB.

5. Conclusions and future work

We tackled the problem of cumulative reward maximization in multi-armed bandits, in a setting where data and computations are outsourced to some honest-but-curious cloud. We proposed $UCB - MS$ , a secure multi-party protocol based on UCB, which yields exactly the same cumulative reward as UCB while enjoying desirable security properties that we precisely characterize. In particular, no cloud node or external observer can learn the cumulative reward, which can be seen only by the data client who pays a budget. We rely on cryptographic schemes to achieve the security properties of $UCB - MS$ , and we characterize the overhead of cryptography from both theoretical and empirical points of view. Our experiments show the scalability and practical feasibility of $UCB - MS$ , and of its refinement $UCB - MS 2$ .

As future work, we plan to extend our scenario such that multiple data clients concurrently submit budgets to the cloud and receive corresponding cumulative rewards. In such a scenario, parallelism between nodes could be leveraged to improve the system’s throughput.

Footnotes

Acknowledgments

We thank the anonymous reviewers whose suggestions helped improve and clarify this manuscript. This work was mostly done while Radu Ciucanu and Marta Soare were affiliated with INSA Centre Val de Loire / Univ. Orléans / LIFO, France. This work has been partially supported by MIAI@Grenoble Alpes (ANR-19-P3IA-0003) and two projects funded by EU Horizon 2020 research and innovation programme (TAILOR under GA No 952215 and INODE under GA No 863410).

References

Advanced Encryption Standard (AES), 2001, FIPS Publication 197.

Agrawal, Sample mean based index policies with

O (log (n))

regret for the multi-armed bandit problem, Advances in Applied Probability 27(4) (1995), 1054–1078. doi:10.2307/1427934.

Audibert,

Bubeck and

Munos, Best arm identification in multi-armed bandits, in: COLT, 2010, pp. 41–53.

Auer,

Cesa-Bianchi and

Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning 47(2–3) (2002), 235–256.

Bellare,

Desai,

Jokipii and

Rogaway, A concrete security treatment of symmetric encryption, in: FOCS, 1997, pp. 394–403.

Bourse,

Minelli,

Minihold and

Paillier, Fast homomorphic evaluation of deep discretized neural networks, in: CRYPTO, 2018, pp. 483–512.

Bubeck and

Cesa-Bianchi, Regret analysis of stochastic and nonstochastic multi-armed bandit problems, Foundations and Trends in Machine Learning 5(1) (2012), 1–122. doi:10.1561/2200000024.

J.H.

Cheon,

Kim,

Kim and

Y.S.

Song, Homomorphic encryption for arithmetic of approximate numbers, in: ASIACRYPT, 2017, pp. 409–437.

J.H.

Cheon,

Kim and

Kim, Efficient homomorphic comparison methods with optimal complexity, in: ASIACRYPT, 2020, pp. 221–256, https://eprint.iacr.org/2019/1234.pdf .

10.

Ciucanu,

Delabrouille,

Lafourcade and

Soare, Secure cumulative reward maximization in linear stochastic bandits, in: ProvSec, 2020, pp. 257–277.

11.

Ciucanu,

Lafourcade,

Lombard-Platet and

Soare, Secure best arm identification in multi-armed bandits, in: ISPEC, 2019, pp. 152–171.

12.

Ciucanu,

Lafourcade,

Lombard-Platet and

Soare, Secure outsourcing of multi-armed bandits, in: TrustCom, 2020, pp. 202–209. https://ieeexplore.ieee.org/abstract/document/9343228.

13.

Dwork, Differential privacy, in: ICALP, 2006, pp. 1–12.

14.

Farokhi,

Shames and

Batterham, Secure and private control using semi-homomorphic encryption, Control Engineering Practice 67 (2017), 13–20. doi:10.1016/j.conengprac.2017.07.004.

15.

N.M.

Freris and

Patrinos, Distributed computing over encrypted data, in: Allerton, 2016, pp. 1116–1122.

16.

Gajane,

Urvoy and

Kaufmann, Corrupt bandits for preserving local privacy, in: ALT, 2018, pp. 387–412.

17.

Gentry, Fully homomorphic encryption using ideal lattices, in: STOC, 2009, pp. 169–178.

18.

K.Y.

Goldberg,

Roeder,

Gupta and

Perkins, Eigentaste: A constant time collaborative filtering algorithm, Information Retrieval 4(2) (2001), 133–151. doi:10.1023/A:1011419012209.

19.

Goldreich, The Foundations of Cryptography – Volume 2: Basic Applications, Cambridge University Press, 2004.

20.

F.M.

Harper and

J.A.

Konstan, The MovieLens datasets: History and context, ACM TiiS 5(4) (2016), 19:1–19:19.

21.

Kairouz,

H.B.

McMahan et al., Advances and open problems in federated learning, Foundations and Trends in Machine Learning 14(1–2) (2021), 1–210. https://arxiv.org/abs/1912.04977.

22.

Kogiso and

Fujita, Cyber-security enhancement of networked control systems using homomorphic encryption, in: CDC, 2015, pp. 6836–6843.

23.

Kohli,

Salek and

Stoddard, A fast bandit algorithm for recommendation to users with heterogenous tastes, in: AAAI, 2013.

24.

Lu and

Zhu, Privacy preserving distributed optimization using homomorphic encryption, Automatica 96 (2018), 314–325. doi:10.1016/j.automatica.2018.07.005.

25.

Mishra and

Thakurta, (Nearly) optimal differentially private stochastic multi-arm bandits, in: UAI, 2015, pp. 592–601.

26.

Munos, From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning, Foundations and Trends in Machine Learning 7(1) (2014), 1–129. doi:10.1561/2200000038.

27.

Paillier, Public-key cryptosystems based on composite degree residuosity classes, in: EUROCRYPT, 1999, pp. 223–238.

28.

Recommendation for BlockCipher Modes of Operation: Galois/Counter Mode (GCM) and GMAC, 2007, NIST Special Publication 800-38D.

29.

Shoukry,

Gatsis,

Al-Anwar,

G.J.

Pappas,

S.A.

Seshia,

M.B.

Srivastava and

Tabuada, Privacy-aware quadratic optimization using partially homomorphic encryption, in: CDC, 2016, pp. 5053–5058.

30.

A.C.Y.

Tossou and

Dimitrakakis, Algorithms for differentially private multi-armed bandits, in: AAAI, 2016, pp. 2087–2093.

31.

Yung, From mental poker to core business: Why and how to deploy secure computation protocols?, in: CCS, 2015, pp. 1–2.

Secure protocols for cumulative reward maximization in stochastic multi-armed bandits

Abstract

Keywords

1. Introduction

3.1. Security model

3.3.1. Correctness

3.3.2. Security

5 https://cryptography.io/en/latest/

Footnotes

Acknowledgments

References

⁵
https://cryptography.io/en/latest/