Reinforcement Learning for Dynamic Pricing of Shared-Use Autonomous Mobility Systems Considering Heterogeneous Users: Model Development and Scenario Testing

Abstract

A key aspect of the success of shared-use autonomous mobility systems will be the ability to price rides in real time. As these services become more prevalent, it becomes of high importance to detect shifts in behavior to quickly optimize the system and ensure system efficiency and economic viability. Therefore, (1) pricing algorithms should be able to price rides according to complex underlying demand functions with heterogeneous customers and (2) the algorithm should be able detect nonstationary behavior (e.g., changing customers’ willingness to pay) from its previously learnt decisions and alter its pricing mechanism accordingly. We formulate a dynamic pricing and learning problem as a Markov decision process and subsequently solve it through a reinforcement learning (RL) algorithm, with heterogeneous customers accepting the trip characteristics (price, expected wait time) probabilistically. Insights from a fixed fleet operation of an autonomous private ridesourcing system in Chicago are presented. Given our formulation of the demand model, the algorithm learns in 25 days, increasing revenue by 90% and decreasing customer wait times by 90% compared to day 5. After gathering insights from the RL algorithm and applying optimal static pricing (i.e., a constant specific surge multiplier), we find that RL can achieve near 90% optimality in revenue. The RL algorithm, nevertheless, proves to be robust. Two scenarios are tested where a sudden shock occurs or customers slowly change their willingness to pay, illustrating that RL can quickly adapt its parameters to the situation.

Keywords

planning and analysis algorithms mathematical modeling optimization transportation supply

Dynamic pricing emerged as a form of revenue management in the late 1970s, after the deregulation of the airline industry ( 1 ). The main goal of dynamic pricing is to adjust the product price at specific times such that service is allocated to the right customer who is willing to pay, in an attempt to maximize total revenue (a typical objective of for-profit companies). Dynamic pricing also helps address situations in which capacity is limited, which can be resolved by raising the price to reduce demand, also known as market clearing.

Application of dynamic pricing is prevalent in the online retailing sector, airline sector, hospitality sector, and intelligent transportation systems ( 2 , 3 ). More recently, dynamic pricing has emerged in ridesourcing/ridesharing systems (Uber, Lyft) to balance supply and stochastic demand, by using “surge multipliers” that are applied to the base fares at times of high demand. This plays an important role for profit and/or user satisfaction maximization in these systems. In current-day human-driven vehicle systems, dynamic pricing affects both the supply and demand, as drivers are more likely to join the system when higher profit is expected, while customers become less likely to accept rides. Autonomous vehicles (AVs) coupled with ridesharing applications will most likely provide a fixed vehicle fleet ( 4 ) that safely and efficiently traverses the road infrastructure, picks up passengers, drops them off without the hassle of looking for parking spots, and engages in the same activity repeatedly—defined as shared-use autonomous mobility systems (SAMSs).

A key aspect of the success of these autonomous systems will be the ability to price rides in real time. Moreover, as these services become more prevalent, it becomes of high importance to detect shifts in behavior to quickly optimize the system and ensure high profitability and good system efficiency. Specifically, (1) pricing algorithms should be able to price rides according to complex underlying demand functions with heterogeneous sensitivities and (2) the algorithm should be able to detect nonstationary behavior (i.e., when the true value of actions changes over time [ 5 ], e.g., because of changing customers’ willingness to pay) from its previously learnt decisions and alter its pricing mechanism accordingly. Thus, the main objectives of this study are as follows.

Formulate a dynamic pricing and learning problem as a Markov decision process (MDP) and subsequently solve it through a reinforcement learning (RL) algorithm, with heterogeneous customers accepting the trip characteristics (price, expected wait time) probabilistically. The RL algorithm will be developed such that it supports continuous actions (i.e., surge multipliers), which is a novel methodology used for continuous problems ( 5 ). Policies will be formulated using a Gaussian parametrization that depends on the state of the system.

Derive insights from fixed fleet operation of an autonomous private ridesourcing system, with respect to the following questions.

Does dynamic pricing help with system efficiency and customer service metrics?

Does it increase profit compared to optimal static pricing?

Can the algorithm learn in nonstationary environments? This will provide evidence as to the robustness of the algorithm compared to static pricing.

Literature Review

Dynamic pricing has been tackled in a variety of methodologies. Queueing approaches and mathematical analysis have been prevalent, but recently, computer scientists have been implementing machine learning methods. Unlike mathematical analysis that attempts to find closed-form solutions (which require many assumptions), computer scientists typically use simulation to update model parameters and optimize specific objective functions. This is the case with RL, a class of algorithms that obtain optimal policies through exploration and exploitation. Three research areas/communities have contributed to modern RL, as documented by Sutton and Barto ( 5 ). The first originated in the psychology of animal behavior, concerning learning through trial-and-error. The second is the study of optimal control, mainly through dynamic programming techniques originated by Richard Bellman. Bellman also originated the discrete stochastic version of optimal control called a MDP. The MDP framework has been used for dynamic pricing by many researchers ( 6 , 7 ) Dynamic programming techniques are governed by a backwards induction formula that goes backwards in time to find an optimal solution, in a fully known environment and with known transition probabilities. This was therefore hard to consider as a learning process in a forward-going process ( 5 ). The third research area is temporal difference learning by adjusting predictions to match later predictions about the future before the final outcome is known. Recently, there have been many advancements in the field, such as policy-gradient methods, Q-learning, average-reward formulations, deep Q-learning, and actor–critic methods. These algorithms use combined knowledge from all three research areas.

A fundamental work in ridesourcing is that done by Banerjee et al. ( 8 ), who formulate the problem of two-sided ridesharing platforms as a stochastic model that captures the dynamics of drivers and passengers. Specifically, the authors model the system as a continuous-time queuing process and analytically present equations of equilibrium that captures the likelihood of drivers to join and riders accepting a ride. The two-queue system is modeled using state-dependent (price-dependent) service rates, referred to as an M/M(k)/1 queue, where the service rate of passengers depends on the current number of available drivers. The authors impose dynamic pricing based on thresholds, whereby a high price is offered to passengers if the number of drivers is less than that threshold, and a lower price when the number of drivers is greater. Moreover, drivers do not react to instantaneous prices but rather focus on their long-term time-average earnings, which incentivizes their participation. The authors present two results, both under a single region and under large-market limits: (1) unexpectedly, dynamic pricing cannot produce better throughput or revenue than optimal static pricing, but (2) dynamic pricing is more robust to errors in model parameters (e.g., demand rate, driver availability rate, etc.). The authors, however, assume a single demand regime and not a periodically changing demand pattern. Moreover, they assume a rider quits the system if not matched right after requesting a ride. These assumptions and results are a major focus of the current paper, albeit in a SAMS.

Research by Pandit et al. ( 9 ) relaxes the above assumption that drivers only react to long-term average earnings, and instead assumes an instantaneous effect of increasing the price on the driver arrival rate. This assumption is investigated by Uber ( 10 ), which shows potential evidence of driver-partners instantaneously joining the platform because of the increased surge. The authors combine queuing models with a MDP to represent the system, with the state space being the difference between the queue length of drivers and customers. They assume the arrival rate of customers is a decreasing concave function of the price, while that of drivers is increasing. The platform’s goal is to maximize the expected time-averaged revenue, which is determined by a weighted average (by stationary distribution of the Markov chain) of the rate of matching multiplied by the price offered at a state and the distance traveled by a customer. Solving this problem through a value iteration technique, the authors show that dynamic pricing strategies outperform static pricing strategies if drivers and customers react instantaneously to price.

Haliem et al. ( 11 ) study the interconnected decisions of matching, pricing, and dispatching for ridesourcing. Firstly, a heatmap of supply and demand (as well as predicted demand) is inputted to the system. Greedy vehicle-to-customer matching is done of ride requests at time t, with each driver notified of the price of the trip. Vehicles next perform an insertion operation, where they propose a price based on a learnt value function of expected discounted rewards at the ride’s destination using an estimated deep Q-network. Customers then have the ability to reject a ride based on their own utility function. Finally, the authors propose rebalancing the network via a deep Q-network whereby vehicles are dispatched to zones with predicted high demand. The authors’ framework incorporates pricing from the driver’s point of view (e.g., profitability, going to a low demand area), but not from the viewpoint of balancing supply and demand. Moreover, pricing is not the decision process modeled as a RL problem. Rather, the expected discounted reward associated with each move on the map is estimated using a deep Q-network, which drives the price decisions set by the drivers themselves (rather than a central controller) based on a ranking of zones. The authors do not specifically study the effect of dynamic pricing compared to static pricing, but present promising results with respect to customer wait times, profitability, and vehicle utilization.

Recently, Chen et al. ( 12 ) applied RL for spatial-temporal pricing for a ridesourcing platform. They use Didi’s explicit data on a specific day to simulate the availability of drivers and ride orders received. They differentiate between two pricing techniques: (1) a dynamic pricing scheme that offers the same price and wage applied in all zones, and (2) spatial-temporal pricing, which assigns a price and wage at each zone separately from others. The state information they consider are the number of waiting passengers and number of idle vehicles in each region at time step t, and the number of occupied vehicles heading to each region. The authors argue that the above information does not reflect the environment, and suggest adding historical demand information at time step t + 1 to state information, which is a complete information case. However, passengers are not given expected wait times when ordering and demand only reflects the current price. Nevertheless, they assume the customer cancels the ride if not matched within 15 min—an excessively long time for urban mobility applications. The authors also price rides based only on the trip origin location, regardless of destination, rather than applying surge multipliers to the distance and time of the trip. The authors adopt a model-free method called proximal policy optimization (PPO) with the policy defined as a deep neural network that selects actions based on given state information. This is an actor–critic policy-gradient method that supports continuous state-action spaces. The reward is equal to the total revenue minus the driver salary at time step t. The authors show that dynamic pricing could generate 25% more profit compared to static prices, while spatial-temporal pricing could attain up to 85% more profit, but do not provide improvement details for customer wait times.

Surge pricing in traditional ridesourcing systems (human drivers) is distinctly different than in SAMSs. Surge pricing in traditional systems affects both the supply and demand, as it encourages drivers to join the platform and reduces the rider’s probability of accepting a ride. This is not the case when AVs are used; therefore, extracting insights from this system is novel and important. Moreover, the envisioned fixed fleet size is another important difference that needs to be modeled and analyzed.

The closest line of research to this paper was recently conducted by Turan et al. ( 13 ), who seek to optimize a system of electric SAMSs, specifically considering joint routing (assignment to customers and rebalancing), battery charging, and pricing. The authors characterize the network as a fully connected graph where each node corresponds to a zone, which is set as the starting/ending position of rides and the location of charging stations. They model the arrival process between origin–destination (OD) pairs as a Poisson process that is dependent on the current price and consider the operational and charging costs of vehicles. In their model, customers are assigned to a queue for each OD pair and are served on a first-come-first-served basis after being matched. The authors first formulate and solve the “optimal” static problem (average arrival rates, no stochasticity, perfect knowledge of environment parameters) through a network flow model. Next, the authors formulate a real-time policy using RL. The state information that they consider consists of vehicle locations and energy levels, queue lengths at each OD pair, and electricity prices at each node. The actions are the prices for each OD pair and routing/rebalancing/charging decisions for each vehicle. The reward consists of revenue gained by arrivals, queue costs, and operational/charging costs. The authors adopt the model-free method, PPO. They conduct simulations using Manhattan and San Francisco datasets (of only 2 h of a specific day) with Manhattan split into 10 regions and San Francisco split into seven. They show that RL maximizes profit by 10% and minimizes queues by 70% compared to the optimal static problem. The authors took the approach of aggregating trips into zones. The present paper uses disaggregated data with the vehicle-to-customer assignment algorithm explicitly taking into account the distance between them using bipartite matching techniques. This further enables studying the wait times experienced by customers. Moreover, the areas are not divided into zones but rather the entire system is studied such that pricing (the surge multiplier) is determined based on the system state.

Karamanis et al. ( 14 ) also study utility-based dynamic pricing in a SAMS that offers private and pooled trips. The authors assume complete knowledge of the customers’ choice process of the two modes and public transit (i.e., exact probabilities of choosing each), and model it using a nested logit model. Next, the authors apply dynamic pricing by using a surge multiplier (for each customer) that maximizes the expected revenue (probability of private trip choice multiplied by fare + probability of pooled trip choice multiplied by fare). The authors create simulated OD data from data of Greater London and use it to run simulations with varying fleet sizes. However, in their formulation, the authors do not specify how waiting time is calculated for each rider, nor vehicle-to-customer matching details. Interestingly, they find that dynamic prices occur at non-peak times. They show that in a monopoly, dynamic pricing provides a higher revenue than static pricing (however, they do not compare against the optimal static surge multiplier to serve as a fair comparison).

To summarize, compared to previous work, this paper presents a complete framework using a realistic ridesourcing simulator, with details on customer wait times (and learning them), the matching algorithm, and testing of behaviorally challenging scenarios using the developed RL algorithm. Specifically, we use RL to price trips in a fixed fleet system to extract insights. Demand is exogenous, with probabilistic arrivals, and the user decides probabilistically whether to order a ride based on the price and the expected wait time. Matching is also implemented in a microscopic simulator, with the service rate determined through a Manhattan-based grid-road assuming fixed vehicle speed.

Problem Statement

SAMSs consist of a fleet of driverless vehicles serving customer requests. These fleets are likely to be centrally controlled by a single entity in the future. In this paper, the controller operates a fixed fleet whereby user requests come immediately for a point-to-point service. From the customer’s standpoint, the customer opens the app, inputs the pickup and drop-off locations of the trip, and sees the expected wait time and the price. The controller accordingly needs to choose the correct price to maximize system performance (broadly defined). Next, the customer either accepts or rejects the trip, after which the customer joins the queue of customers to be matched to vehicles.

This paper further assumes the following:

users want AVs to come to their pickup location as soon as possible;

the fleet controller cannot reject user requests and users will not cancel trips if wait times are higher than the provided expected wait time;

no en route detours can occur to pick up or drop off other users;

AVs in the fleet are homogeneous;

the fleet size is fixed in the short term;

the fleet controller has complete control over each AV.

The fleet controller has a fleet of AVs $j \in J = {1, 2, \dots, | J |}$ that are required to serve a set of customers $i \in I = {1, 2, \dots, | I |}$ over a time horizon of $T$ . The fleet is initially assumed to be at a depot that can accommodate fleet size $| J |$ . The vehicles traverse a Manhattan plane defined as a rectangular area $G = {(x, y) | x \in [0, L_{1}], y \in [0, L_{2}]}$ at a fixed speed $v$ , where $L_{1}, L_{2}$ are the legs of the rectangle. On-demand ride requests arrive from an unknown stochastic process with the following characteristics: request time $t_{r}^{i} \in T$ , pickup location $l_{p - ODR}^{i} \in G$ , drop-off location $l_{d - ODR}^{i} \in G$ .

In this paper, we utilize Uber’s pricing strategy in Chicago (Summer 2020). Specifically, the cost of a ride is given as follows:

\begin{matrix} Cost = Surge multiplier \times \max {\begin{matrix} Minimum Fare + Surcharges + BookingFee \\ BaseFare + α_{1} PerMinute + α_{2} PerMile + Surcharges + BookingFee \end{matrix} \\ = Surge multiplier \times \max {\begin{matrix} $ 4.98 + $ 3.00 + $ 1.85 \\ $ 1.87 + $ 0.29 PerMinute + $ 0.84 PerMile + $ 3.00 + $ 1.85 \end{matrix} \end{matrix}

where “surge multiplier” is the action to be learnt online. Moreover, we assume there is a function f through which customers probabilistically accept a ride, based on trip characteristics (e.g., expected wait time and cost). This function could follow any formulation, since, as later shown, the reward function does not use any of its information, but one has to note the algorithm’s performance may change across choice models (e.g., rule-based functions are more challenging to learn). In this paper, we use a mixed logit model that inherently considers the heterogeneity of customers, whereby the systematic utility of accepting a ride is modeled as follows:

\begin{matrix} V_{Accept} = β_{0} - β_{1} (Price) - β_{2} (Expected Wait Time) \\ β_{0} ~ N (35, 0.5) \\ β_{1} = 1 \\ β_{2} ~ Lognormal (3.184, 0.465) \end{matrix}

and the systematic utility of not accepting the ride is set to zero. With this formulation, the mean value of wait time is $27/h, and it follows a lognormal distribution. To simulate whether a customer accepts a ride, we draw the coefficients from their respective distributions, compute the systematic utility of accepting, and add to it a random draw from a Gumbel distribution with parameters (0, 1). If this draw is greater than another random draw from a Gumbel distribution with parameters (0, 1), then the customer accepts the ride. Otherwise, they reject the ride.

Model Framework and Online Solution Method

We discuss below the three main algorithms used to drive the simulation. This includes the RL algorithm for pricing (the main objective of this paper), wait time estimation, and matching of vehicles to customers.

Algorithm 1: Pricing Model Using Reinforcement Learning

The controller is the agent that is trying to learn pricing decisions based on the current state of the system for a specific horizon length (every 5 min in this paper). The MDP contains four elements: a set of states S, a set of actions A, a reward function $R : S \times A$ , and transition probabilities between the states $P : S \times A \times S' \in [0, 1]$ . The agent selects actions at each state of the MDP according to a learnt optimal policy $π^{*} : s \to a$ that maximizes its gain.

State

In this paper, the state of the system is characterized by features $x (s)$ = { $U_{t - d_{p}}, U_{t - d_{p}}^{2}, N R_{t - d_{p}}, S_{t - d_{p}}, S_{t - d_{p}}^{2}, AEW T_{t - d_{p}}, D I_{t}$ } where $U_{t - d_{p}}$ is average driver utilization over the last pricing interval $d_{p}$ (5 min in this paper), $N R_{t - d_{p}}$ is number of requests received over the last pricing interval, $S_{t - d_{p}}$ is the surge multiplier chosen over the last the pricing interval, $AEW T_{t - d_{p}}$ is average of expected wait times provided to customers over the pricing interval, and $D I_{t}$ represents whether demand is expected to increase (binary); we simply check whether demand has increased in the previous interval compared to the interval before.

Action

The action to be applied is the surge multiplier. We let the action be global, that is, the same for all OD pairs, but future work can investigate localized pricing. The surge multipliers are modeled to be continuous rather than discrete (see the Solution Methodology section).

Reward

Based on the surge multiplier chosen, the overall reward is the revenue gathered from the customers who accepted the rides. However, because of the time-changing behavior of demand, this is normalized by the number of requests received over the past horizon, that is, the demand:

r = \frac{\sum_{i = 1}^{| I_{accepted} |} Pric e_{i}}{Deman d_{t - h}}

The reasoning behind this is that the RL algorithm does not realize demand is changing temporally and therefore the reward associated with specific actions may overestimate that action’s importance (e.g., receiving a high reward during a high demand time). This can be fixed by normalizing using the observed demand.

Transition Function

Collect state features (s) from the simulator to find the transition of the state because of the action.

Solution Methodology

We utilize the policy-gradient method with an average-reward formulation ( 5 ). Specifically, we learn a policy (that maps a state to an action) by directly parameterizing the policy, that is:

π (a | s, θ) = \Pr (A_{t} = a | S_{t} = s, θ_{t} = θ)

where a is the action, s is the state, θ represents the parameters of the policy, and t refers to the current time. In this paper, we follow a Gaussian policy parameterization, as follows:

π (a | s, θ) = \frac{1}{σ (s, θ) \sqrt 2 π} \exp (- \frac{{(a - μ (s, θ))}^{2}}{2 σ {(s, θ)}^{2}})

where

\begin{matrix} μ (s, θ) = \exp (θ_{μ}^{T} x (s)), \\ σ (s, θ) = \exp (θ_{σ}^{T} x (s)) \end{matrix}

where $x (s)$ are features of the state and $θ_{μ}, θ_{σ}$ are parameters. In other words, the mean and standard deviation of the Gaussian action selector are modeled based on an exponential of a linear function of state features (which therefore restricts these to be positive), with respective parameters that need to be updated in real time.

The policy attempts to maximize the average reward, defined as follows:

r (π) = \sum_{s} μ_{π} (s) \sum_{a} π (a | s) \sum_{s', r} p (s', r | s, a) r

The policy-gradient theorem gives an update rule, whereby:

\begin{matrix} \nabla r (π) = \nabla \sum_{s} μ (s) \sum_{a} π (a | s, θ) \sum_{s', r} p (s', r | a, s) r \\ = \sum_{s} μ (s) \sum_{a} \nabla π (a | s, θ) q_{π} (s, a) \end{matrix}

Since the states are generated via policy π, then the stochastic gradient ascent updates policy parameters as follows:

θ_{t + 1} = θ_{t} + α \nabla \ln π (A_{t} | S_{t}, θ_{t}) q_{π} (S_{t}, A_{t})

where $q_{π} (S_{t}, A_{t})$ is the future action-value associated with conducting action $A_{t}$ at state $S_{t}$ and afterwards following policy π.

For our selected Gaussian policy, the gradient of the logs are as follows:

\nabla \ln π (a | s, θ_{μ}) = \frac{1}{σ {(s, θ_{μ})}^{2}} (a - μ (s, θ_{μ})) μ (s, θ_{μ}) x (s)

\nabla \ln π (a | s, θ_{σ}) = (\frac{{(a - μ (s, θ_{μ}))}^{2}}{σ {(s, θ_{σ})}^{2}} - 1) x (s)

We therefore need to approximate $q_{π} (S_{t}, A_{t}) or v_{π} (S_{t})$ . To do this, we utilize the recent advances of actor–critic algorithms. The actor is the policy, which we have formulated above, while the critic is the state value function approximator. Using a one-step bootstrapped return, we have the following:

θ_{t + 1} = θ_{t} + α \nabla \ln π (A_{t} | S_{t}, θ_{t}) (R_{t + 1} + \hat{v} (S_{t + 1}, w) - \bar{\bar{R}})

where $\hat{v} (s, w)$ is the value function approximator. Since our system has a high (continuous) number of states, it is approximated using a linear function (note that deep learning methods could also be used here but with harder guarantees of convergence) as follows:

\hat{v} (s, w) = w^{T} x (s)

We then subtract off a baseline from the temporal difference error to reduce the variance of the update. This is typically chosen as $\hat{v} (S_{t}, w)$ :

\begin{matrix} θ_{t + 1} = θ_{t} + α \nabla \ln π (A_{t} | S_{t}, θ_{t}) \\ (R_{t + 1} + \hat{v} (S_{t + 1}, w) - \bar{\bar{R}} - \hat{v} (S_{t}, w)) \end{matrix}

This algorithm is used continuously (updates every 5 min) with the simulator running continuously as well. Note that we use “episode” and day interchangeably in this paper to refer to a full day of ridesourcing service, but the algorithm does not stop after the end of the day but rather continues on to the next day (and is therefore a continuous problem). This is unlike episodic problems, such as chess.

Algorithm 2: Wait Time Estimation

We utilize a data-based approach to estimate wait times given to customers. After each episode of training, we utilize the following variables for training a linear regression model:

Predictors.

At the time at which the customer requested the ride.

For each non-overlapping surrounding area (see Figure 1, e.g., the blue area), find the following:

the number of idle vehicles;

the number of non-idle vehicles;

the number of unassigned customers;

the number of assigned customers.

Dependent variable:

actual wait time of the customer (retrieve once the simulation of the episode is over).

Assume a linear function:

W_{i} = β_{0} + β x + ϵ ϵ ~ N (0, σ^{2})

Note that the non-overlapping areas could be set to any dimensions to achieve better predictions; in this paper, we use 1, 2, 6, and 10 mi and infinity. This approach was compared to a prediction based on the nearest vehicle to the customer and achieved lower mean absolute errors (∼3 min). Machine learning methods rather than linear regression will be tested in future work.

Figure 1.

Non-overlapping areas for data gathering for wait time estimation (color online only).

Algorithm 3: Matching Vehicles to Customers

We utilize Hyland and Mahmassani’s ( 4 ) formulated optimization algorithm (Strategy #3) to match idle vehicles ( $j \in J')$ to unassigned customers ( $i \in I')$ every matching interval $d_{m}$ (30 s in this work). This is briefly explained below.

If the number of unassigned customers is greater than the number of idle vehicles:

min \sum_{i \in I ″} \sum_{j \in J'} (d^{ij} x^{ij} - γ (ew t^{i}) (x^{ij}))

where $d^{ij}$ is the distance between customer i and vehicle j, $ew t^{i}$ is elapsed wait time of customer i, γ is the conversion coefficient from time (wait time) to distance, and $x^{ij}$ is a decision variable, 1 if vehicle j is assigned to a customer i, 0 otherwise.

Subject to the following constraints:

\sum_{j \in J'} x^{ij} \leq 1 \forall i \in I'

\sum_{i \in I'} x^{ij} = 1 \forall j \in J'

x^{ij} \in {0, 1} \forall i \in I', j \in J'

If the number of unassigned customers is less than or equal to the number of unassigned vehicles:

min \sum_{i \in I'} \sum_{j \in J'} (d^{ij} x^{ij})

Subject to the following constraints:

\sum_{j \in J'} x^{ij} = 1 \forall i \in I'

\sum_{i \in I'} x^{ij} \leq 1 \forall j \in J' .

x^{ij} \in {0, 1} \forall i \in I', j \in J'

Full Simulation Framework

The full simulation framework is described both below and in Figure 2.

Figure 2.

Simulation framework.

Input: Simulation length, vehicle speed, matching parameters (Algorithm 3)

Input: Gaussian policy parametrization

π (a | s, θ) = \frac{1}{σ (s, θ) \sqrt 2 π} \exp (- \frac{(a - μ {(s, θ)}^{2})}{2 σ {(s, θ)}^{2}})

Input: Define

μ (s, θ) = \exp (θ_{μ}^{T} x (s))

σ (s, θ) = \exp (θ_{σ}^{T} x (s)

)

Input: State-value function parameterization

\hat{v} (s, w) = w^{T} x (s)

Input: Matching interval

d_{m}

, pricing interval

d_{p}

Input: Choose step-size coefficients

α^{w}

α^{θ_{μ}}

α^{θ_{σ}}

α^{\bar{\bar{R}}}

Initialize

\bar{\bar{R}}

Initialize S

Initialize surge multiplier = 1

For every matching epoch

d_{m}

(e.g., every 30 s):

Collect customer information (status, elapsed wait time, pickup location, drop-off location)

Collect vehicle information with state variables (status, location)

d_{m} % d_{p} = 0

(e.g., every 5 min):

Observe

State S'

Reward R

Calculate error:

δ = R + \hat{v} (S', w) -

\bar{\bar{R}}

\hat{v} (S, w)

Update estimate of average reward:

\bar{\bar{R}}

\bar{\bar{R}}

α^{\bar{\bar{R}}}

Update value function parameters:

w = w + δ α^{w} x (s)

Update policy parameters:

θ_{μ} = θ_{μ} + α^{θ_{μ}} δ \frac{1}{σ {(s, θ_{μ})}^{2}} (a - μ (s, θ_{μ})) μ (s, θ_{μ}) x (s),

θ_{σ} = θ_{σ} + α^{θ_{σ}} δ

(\frac{{(a - μ (s, θ_{μ}))}^{2}}{σ {(s, θ_{σ})}^{2}} - 1) x (s)

Update state:

S = S'

Sample surge multiplier from gaussian distribution:

a ~ π (. | S, θ)

Set surge multipliera,

Reset reward: r = 0,

Update customer information who accepted price and given ETA (predicted using Algorithm 2)

Match vehicles to customers (using Algorithm 3)

Update reinforcement learning metrics

Simulation Results and Discussion

A randomly selected day of a taxi dataset in Chicago was used for simulation. The same dataset is repeated over different days (episodes) and convergence is checked afterwards. This dataset has information on the pickup and drop-off locations of trips as well as the request time of rides. There are 29,981 rides in the dataset to be served by 150 AVs with an average speed of 27.2 mph.

The spatial distribution of the trips is shown in Figure 3 (those with requested pickup times from 8 to 9 a.m.). Most of the trips are located near the Loop and along Lake Michigan with some trips originating from the suburbs and the airport (at the top left). Figure 4 shows the temporal distribution of requests. Dynamic pricing is performed every 5 min and matching of vehicles to customers is performed every 30 s.

Figure 3.

Spatial distribution (8–9 a.m.) of pickup locations (left) and drop-off locations (right).

Figure 4.

Temporal distribution of requests.

Table 1 lists the simulation parameters and hyperparameters used. Note that numerous hyperparameters of the RL algorithm were tested to find the most suitable ones. As a starting point, we used Sutton and Barto’s ( 5 ) rule of thumb for setting the parameter of linear stochastic gradient descent methods as follows. Assume you want to learn in τ experiences with a reasonably similar feature vector, then:

α = (τ E {[x^{T} x]}^{- 1})

where x is a random feature vector (one can use an expected feature vector or check throughout the simulation). Moreover, we follow Sutton and Barto’s ( 5 ) recommendation whereby we use a greater step-size coefficient for the critic ( $α^{w}$ ) to allow it to learn faster than the actor and accurately “criticize” the actor.

Table 1.

Simulation Parameters/Hyperparameters and Values

Simulation parameter	Value	Units
Simulation length	50	Days
Area size	625	Miles²
Matching interval	30	Seconds
Pricing interval	5	Minutes
Vehicle speed	27.2	mph
Drop-off time, pickup time	15, 45	Seconds
Weight of elapsed wait time $γ$	50	Feet/second
Reinforcement learning step-size coefficients $α^{w}$ , $α^{θ_{μ}}$ , $α^{θ_{σ}}$ , $α^{\bar{\bar{R}}}$	0.000002, 0.000000002, 0.000000002, 0.2	na

Note: na = not applicable.

The simulation was performed on a 24 Core Intel XEON CPU (3.0 GHz) machine with 128 GB RAM. Simulating 50 ridesharing days with the complete framework of Figure 2 took approximately 23 h. The simulation is coded in Python with the optimization problem solved with Gurobi 9.1.0., while linear regression is implemented with scikit-learn 1.1.0.

Figure 5 shows the learning process of the model’s pricing decisions. It takes near 20 days for the model to converge, reaching a maximum possible revenue of ∼$520,000. The figure shows the learnt means of the Gaussian parameterization of the surge multipliers (to show the patterns more clearly) over days 1 (Figure 5a), 5 (Figure 5b), 15 (Figure 5c), and 25 (Figure 5d). These are accompanied with specific metrics that we track: (1) revenue; (2) wait time (mean and standard deviation); (3) number of unassigned customers by the end of day (sanity check); (4) number of served customers at the end of the day; and (5) fleet utilization. Interestingly, we observe that the model learns by itself to price high when demand is high, and low when the demand is low, resembling the demand pattern shown in Figure 4. The plots also show the advantage of dynamic pricing over static pricing. Comparing days 5 and 25 (note that we are using day 5 as a reference as a conservative estimate that avoids arbitrary initial conditions), we see that revenue increases from $274,850 to $522,169 (90% increase) while wait time substantially decreases from 28 min to nearly 3 min (90% decrease), reaching an equilibrium that reflects the customers’ willingness to accept ride characteristics. Moreover, the number of unassigned customers at the end of the day reaches near zero at convergence, which serves as a sanity check that there are no huge wait time outliers, while the number of served customers stabilizes to fulfill near 72% of the demand.

Figure 5.

Mean of surge multiplier across days: days 1 (a), 5 (b), 15 (c), and 25 (d).

Scenario Testing

We develop two scenarios to test the robustness of the RL algorithm to detect nonstationary behavior, as follows.

Scenario 1

In this scenario, a sudden shock in the system occurs whereby the customers’ willingness to pay drastically changes (e.g., COVID-19). In our framework, we do this by changing the coefficient of price $β_{1}$ from −1 to −1.6 starting on day 35.

Scenario 2

In this scenario, a slow change occurs to customers’ willingness to pay over time. In our framework, we do this by decreasing the coefficient of price $β_{1}$ by 0.05 every day starting on day 35.

Figures 6 and 7 summarize the results. Note that on the first day, various initialization conditions for the RL algorithms and because the prediction models do not know how to estimate wait times well (yet) leads to the sudden drops in the figures. We see that the model takes nearly 20 days to converge. Scenario 1 shows that although the shock greatly decreased revenue on the 35th day, the algorithm was able to detect changes in the reward structure and adjust the algorithm’s parameters quickly. The robustness of the algorithm is not very clear in scenario 2 and therefore we run further simulations to compare the performance. The first is the full information case, that is, knowing the customers’ choice probabilities. Therefore, the reward in this is defined over all customers considering the choice probabilities, as follows:

r = \frac{\sum_{i = 1}^{| I |} Pric e_{i} (p_{i})}{Deman d_{t - h}}

where $p_{i}$ is customer i’s probability of accepting the ride.

Figure 6.

Scenario 1 revenue and wait time results.

Figure 7.

Scenario 2 revenue and wait time results.

The second case is using the converged RL model without any sampling of the surge multiplier after day 25 (i.e., no learning since the policy parameters do not get updated). This case also serves to show the power of sampling actions.

Figures 8 and 9 show the results. In scenario 1, the full information case performs the best as it provides accurate expected revenue even when the demand model’s parameters change; this results in updates of the RL’s parameters in the right direction and size. Nevertheless, our formulated version of the reward structure that does not rely on any information of the demand performs comparatively well. Finally, we see that RL pricing (both converged and full information case) can learn quickly (within one day) and increase revenue compared to what the converged model can achieve. These show that continuously sampling actions (i.e., continuous experimentation) can be a powerful tool and can increase robustness to system shocks. In scenario 2, the RL algorithm performs similarly to the full information case, with both of the algorithms showing superior performance over a converged modeling tactic. This further showcases the robustness of RL algorithms to optimize under changing environments.

Figure 8.

Comparison of reinforcement learning to converged pricing and full information cases in scenario 1.

Figure 9.

Comparison of reinforcement learning to converged pricing and full information cases in scenario 2.

Finally, we use results of the RL model’s pricing decisions (Figure 5) to guide us in selecting multiple static pricing multipliers (i.e., the same surge multiplier across the day). Note that these surge multipliers are decided after-the-fact, that is, after receiving insights from the dynamic pricing surge multipliers. The results are shown in Table 2. Comparing to the (optimal) surge multiplier of 3, we see that RL can achieve near 90% of the revenue achieved by optimal static pricing, which is quite high. Nevertheless, RL was able to serve more customers with reasonable wait time. There are a few points to consider with respect to the above results. Firstly, the revenue result resembles that of Banerjee et al. ( 8 ) with respect to dynamic pricing not outperforming static pricing, but disagrees with most of the other literature. One difference to highlight is that this paper studies a fixed fleet system, which is likely to provide different insights because of the vehicles being always available. One also needs to note that RL attempts to optimize under an uncertain environment with no knowledge of the future. The static pricing surge multipliers were chosen using the guidance of the results from dynamic pricing, after knowing where to search. Specifically, since the static pricing methodology does not incorporate any online learning capability, RL’s value further increases when considering robustness. Further, note that the specific setup of the problem may affect results. For example, the customers’ choice model is the same across the entire day, which may not be entirely realistic. Finally, note that deep learning methods should be investigated in future work with such fixed fleets, either at the actor stage or the critic, since these methods may better capture nonlinearities in the system and provide higher revenue.

Table 2.

Comparing Reinforcement Learning with (Optimal) Static Pricing

	Revenue	Mean wait time	No. of customers served	Fleet utilization
Reinforcement learning	~$525,000	~215 (s)	~22,000	~0.66
Static pricing (surge multiplier = 2)	~$479,000	~480 (s)	~22,800	~0.65
Static pricing (surge multiplier = 2.5)	~$555,000	~210 (s)	~21,800	~0.53
Static pricing (surge multiplier = 3)	~$581,000	~110 (s)	~19,400	~0.38
Static pricing (surge multiplier = 3.5)	~$280,000	~90 (s)	~8200	~0.14

Conclusion

This paper formulates a continuous pricing problem as a MDP and presents a simulation of SAMSs with algorithms to price rides, provide customer with wait time estimates, and match vehicles to customers. An average-reward RL formulation is presented to solve the MDP with a complex demand model that includes heterogeneous customers accepting the trip characteristics (price, expected wait time) probabilistically.

Insights from a fixed fleet operation of an autonomous private ridesourcing system in Chicago are presented. The converged model achieves an increase in revenue of 90% relative to day 5 of the simulation with a corresponding decrease of 90% in waiting time (note that we are using day 5 as a reference as a conservative estimate that avoids arbitrary initial conditions). However, after gathering insights of the RL algorithm, optimal static pricing (at a specific surge multiplier) shows that RL can achieve near 90% optimality in revenue with reasonable wait times offered to customers. Considering that RL attempts to optimize under an uncertain environment with no knowledge of the future, its performance is reliable and its value further increases when considering robustness. Two robustness scenarios are tested. In the first, a sudden shock occurs to the system whereby the coefficient of price changes drastically. In the second, customer’s willingness to pay changes slowly. In both scenarios, RL can quickly adapt its parameters to increase revenue, showcasing the robustness of the algorithm.

This work introduces several research aspects that can be addressed in future work. Firstly, a global pricing mechanism was employed in this paper, rather than an OD-based method. The latter may improve revenue, customer wait times, or both. Secondly, one day of data was used to learn across days/episodes; however, the question remains whether the algorithm can learn using different data across days, and if so, how to assess convergence. Future research can also study demand spikes (e.g., special events) and how these should be considered in training models, that is, should the model train on such outlier events or should it learn to adapt its parameters in real time? Fourthly, the customer wait time algorithm may be improved with machine learning techniques rather than linear regression. Fifthly, future research should take into consideration customer cancellation probabilities and how to correctly design algorithms to minimize cancellations resulting in system disturbances. Moreover, although this paper assumes fixed vehicle speed, the effect of congestion could readily be incorporated in optimizing such ridesourcing systems. Finally, this paper does not implement deep learning methods, which may better capture nonlinearities in the system. Other reward structures may also be studied to assess other aspects of the system (e.g., fairness, fleet utilization, etc.). A comparison of the continuous methodology with other RL-based algorithms should also be done in future work.

Footnotes

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: H. Abkarian, H. Mahmassani; data collection: H. Abkarian; analysis and interpretation of results: H. Abkarian, H. Mahmassani; draft manuscript preparation: H. Abkarian, H. Mahmassani. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is funded through the Northwestern University Transportation Center’s Dissertation Year Fellowship.

ORCID iDs

Hoseb Abkarian

Hani Mahmassani

The authors remain solely responsible for all contents of this paper

References

Morrison

Winston

The Economic Effects of Airline Deregulation. Brookings Institution Press, Washington, D.C., 2010.

Den Boer

A. V.

Dynamic Pricing and Learning: Historical Origins, Current Research, and New Directions. Surveys in Operations Research and Management Science, Vol. 20, No. 1, 2015, pp. 1–18.

Saharan

Bawa

Kumar

Dynamic Pricing Techniques for Intelligent Transportation System in Smart Cities: A Systematic Review. Computer Communications, Vol. 150, 2020, pp. 603–625.

Hyland

Mahmassani

H. S.

Dynamic Autonomous Vehicle Fleet Operations: Optimization-Based Strategies to Assign AVs to Immediate Traveler Demand Requests. Transportation Research Part C: Emerging Technologies, Vol. 92, 2018, pp. 278–297.

Sutton

R. S.

Barto

A. G.

Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2018.

Liu

K. R.

Multi-Stage Pricing Game for Collusion-Resistant Dynamic Spectrum Allocation. IEEE Journal on Selected Areas in Communications, Vol. 26, No. 1, 2008, pp. 182–191.

Aviv

Pazgal

A Partially Observed Markov Decision Process for Dynamic Pricing. Management Science, Vol. 51, No. 9, 2005, pp. 1400–1416.

Banerjee

Riquelme

Johari

Pricing in Ride-Share Platforms: A Queueing-Theoretic Approach. SSRN, 2015. http://doi.org/10.2139/ssrn.2568258.

Pandit

V. N.

Mandar

Hanawal

M. K.

Moharir

Pricing in Ride Sharing Platforms: Static vs Dynamic Strategies. Proc., 11th International Conference on Communication Systems & Networks (COMSNETS), Bengaluru, India, IEEE, New York, 2019, pp. 208–215.

10.

Hall

Kendrick

Nosko

The Effects of Uber’s Surge Pricing: A Case Study. The University of Chicago Booth School of Business, Chicago, IL, 2015.

11.

Haliem

Mani

Aggarwal

Bhargava

A Distributed Model-Free Ride-Sharing Approach for Joint Matching, Pricing, and Dispatching Using Deep Reinforcement Learning. IEEE Transactions on Intelligent Transportation Systems, Vol. 22, No. 12, 2021, pp. 7931–7942.

12.

Chen

Yao

Zhu

Chen

X. M.

Spatial-Temporal Pricing for Ride-Sourcing Platform with Reinforcement Learning. Transportation Research Part C: Emerging Technologies, Vol. 130, 2021, p. 103272.

13.

Turan

Pedarsani

Alizadeh

Dynamic Pricing and Fleet Management for Electric Autonomous Mobility on Demand Systems. Transportation Research Part C: Emerging Technologies, Vol. 121, 2020, p. 102829.

14.

Karamanis

Angeloudis

Sivakumar

Stettler

Dynamic Pricing in One-Sided Autonomous Ride-Sourcing Markets. Proc., 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, IEEE, New York, 2018, pp. 3645–3650.