Abstract
The development of an origin–destination (OD) demand matrix is crucial for transit planning. With the help of automated data, it is possible to estimate a stop-level OD matrix. We propose a novel method for estimating transit route OD matrix using automatic passenger count (APC) data. The method uses
To understand the travel patterns of passengers, transit agencies require an origin–destination (OD) flow matrix of the passengers. This is a fundamental element of interest that helps in designing new routes and schedules, understanding and forecasting demand on the transit network, adjusting marketing strategy, and so forth. The transit OD flow matrix is the quantification of the flow of passengers from one transit stop to another. To evaluate such a matrix, the agencies conduct on-board surveys, which collect data about passenger boarding and alighting stops, the purpose of travel, and so on. These surveys are expensive to conduct and cover only a small sample of passengers ( 1 ). However, with recent advances in automated data collection systems (ADCS), it is possible to mine the full OD matrix. Automated data, such as automatic fare collection (AFC) and automatic passenger count (APC) data, are a rich source of information on passenger travel over a continuous period, making it possible to estimate OD matrices more frequently.
The OD estimation problem has attracted the attention of many researchers over several decades. More recently, the use of automated data such as AFC, APC, or cell phone data has become popular for OD estimation. For example, the AFC data can be used to estimate a stop-level OD matrix. It usually lacks the passenger alighting stop, which can be inferred using a trip-chaining algorithm based on several assumptions (2–5). The inference rate depends on the quality of data, the percentage of passengers using the smart card, and assumptions involved in the trip-chaining algorithm. On the other hand, APC systems collect information about the number of passengers boarding and alighting at each transit stop. OD estimation using the boarding and the alighting counts poses a classic problem, which is hard to solve. The problem requires solving an underdetermined system of equations, in which the number of unknowns to solve is far higher than the number of equations available. Usually, multiple solutions are possible for this problem, which satisfy the given equations. Other information is supplemented to produce the accurate OD flows. To deal with this underdetermined problem, various methods have been proposed in the literature, which are summarized below:
Iterative Proportional Fitting (IPF) method: This is a popular and easy to apply method to evaluate the OD matrix using count data (6, 7). The method starts with a base matrix, which is improved iteratively by multiplying the columns and rows of the matrix by a constant factor. The base matrix can be taken as a null matrix or any other seed matrix. Mishalani et al. found that using on-board survey data as a base matrix gives more accurate results than using a null base matrix ( 8 ). The method has several issues such as the problem of non-structural zeros ( 6 ), meaning that a zero entry remains zero in every iteration. The method also fails to converge if the number of zero entries becomes large in the matrix.
Bayesian inference methods: These methods use a Bayesian approach to evaluate an OD matrix by formulating the problem as a partially observed Markov chain and utilizing prior information along with current observations of count data (9–12).
Optimization methods: As there are multiple solutions possible for this system of equations, these methods try to find the one which optimizes an objective. The objective can be maximizing entropy ( 13 ) or the likelihood (14–16) function. With isotropic Gaussian noise, the maximum likelihood estimation turns into a classic least squares problem.
Another class of optimization method considers the above objectives along with a regularizer. The regularizer helps to mitigate the ill-posedness of the system of equations. The regularization can be included as a least square term between the unknown and a prior OD matrix obtained from a survey or from domain knowledge. This technique is quite popular in the literature. For example, Cascetta and Nguyen minimized generalized least square objective with a prior matrix ( 7 ), Van Zuylen and Willumsen maximized the relative entropy or minimized the Kullback-Leibler (KL) divergence of unobserved and observed flow distributions ( 13 ). This approach tries to force the solution, as close to the prior matrix as possible which may result in poor estimates if the prior or seed matrix used is not reliable.
In this research, we evaluate the transit route OD matrix using APC data. The problem is the estimation of the flow of passengers between stops for a single trip. The route matrix problem has a special structure that provides an extra piece of information to reduce the ill-posedness of the system of equations. The estimation requires the selection of the correct estimate out of the multiple solutions. We use an estimation method that encourages the sparse OD matrix using
The rest of the paper is structured as follows. The next section presents the methodology for sparse matrix recovery followed by results of the experiments in the subsequent section. We then look at the limitations of this research and discuss directions for future research. We finish with some conclusions.
Methodology
In this section, we present the method for estimating the route level OD matrix using boarding and alighting counts available from APC data. The notations used throughout the paper are shown in Table 1.
Notations
Note: OD = origin–destination.
Let

Transit route origin–destination flow.
Constraints
If we sum the values of
2. Similarly, if we sum the values of
3. The total number of passengers boarding at all the stops should be equal to the total number alighting:
4. The numbers boarding and alighting at the same stop is zero, which means the diagonal elements of the matrix
5. As the transit vehicle runs in a single direction, a passenger boarding at one stop cannot alight at the previous stops that the vehicle has already visited. This means:
6. The total load on a link between two stops is equal to the passengers boarding between those stops:
By imposing these constraints, the structure of the matrix will be as shown in Figure 2.

Origin–destination matrix for a route in a single direction.
We can express the linear constraints (1)–(6), in form of a matrix as:
where
Transit Route OD Estimation using the Compressed Sensing Technique
As discussed, Equation 7 is usually an ill-posed problem for which one can expect multiple solutions. A generic regularizer can help in mitigating the ill-posedness of the problem. One such regularizer is the generalized least square of the difference between the unkown and a prior matrix obtained from the survey data. The quality of the solution depends upon the availability of a good prior matrix as the optimal solution is forced to be as close as possible to the prior matrix. We can use other regularizers based on the domain knowledge on the space of the plausible OD flows in the network (
17
). To use one such regularizer, we make the following assumption: The planted OD matrix in the set of linear equations is sparse which means that the flow between many of the OD pairs should be equal to zero. The observed flow is only because of a small subset of
The intuition behind this assumption is that there are many OD pairs for a transit route, but the travel happens only along a few pairs. For example, during the morning peak hours, there are only a few popular origin stops such as residential locations and few destination stops such as central business areas, park and rides, and so forth. Moreover, it is unlikely that passengers boarding at the initial stops of the route will alight at all the following stops. This makes the flow between most of the OD pairs equal to zero. This is opposite to the solution evaluated using entropy maximization, which tries to achieve a solution as uniform as possible to minimize the errors. The sparsity as a regularizer has been used before for highway network OD estimation and has found promising results (17, 21–24). For example, ( 17 ) leverages sparsity in highway OD matrices to estimate a set of suitable traffic analysis zones (TAZs) and uses those zones to evaluate an OD matrix. The method proposed in ( 17 ) has a bi-level structure with sparse OD estimation on upper level and traffic assignment using user equilibrium at lower level. The use of non-negativity constraints for improving the solution is also emphasized. This paper uses similar optimization for the transit route OD estimation problem, which has a special structure as we get an extra set of constraints because of transit movement in one direction. We also describe the conditions under which sparse recovery is possible.
Using Sparsity as the Regularizer for OD Estimation
To achieve the sparsity in the solution, we minimize the number of non-zero entries in the solution, which can be done by minimizing
where,
Equation 9 tries to find the sparse vector
where,
Restricted Isometry Property
The linear map
RIP matrices are extremely common in practice and most of the random matrices satisfy this property. Based on the above definition, a theorem is proposed by Candès and Tao (
25
): If
As the passenger flow cannot be negative, we can replace the
We use the optimization program in Equation 8 with the
Results
In this section, we present two numerical examples of OD estimation using the proposed methodology. First, simulation is used to assess the consistency and accuracy of the estimation method. Second, the OD estimation of a bus route in Twin Cities, MN is presented.
OD Estimation using Simulation
We use the APC data provided by Metro Transit, which is a primary transit service provider in the Twin Cities, MN area offering an integrated network of buses, light rail, Bus Rapid Transit (BRT), and a commuter train. To prepare a synthetic OD matrix, we make some assumptions on the probability distribution of the arrival of the passengers at different stops. We consider 10 stops along a transit route to facilitate the presentation of results. Passenger arrivals at the stop are assumed to follow a Poisson distribution:
where
where
Figure 3 shows the histogram of


Box plot for the errors in estimation of origin–destination flows.
Figure 5a shows the average load profile of the passengers on the transit route. The width of the 95% confidence interval is small which shows that the method is reliable in estimating demand and therefore in deciding the adequate frequency to handle the load of the passengers. We can also observe that the errors in estimating the load of the passengers is also quite small (Figure 5b).

(a) Average load profile of the transit route; (b) box plot of error between actual load and estimated load.
To understand the effect of sparsity, we solved the problem for several levels of sparsity and calculated the root mean square error (RMSE) between the estimated and actual OD matrix. Figure 6 shows the RMSE value with respect to the sparsity in the matrix. We can observe that the RMSE value is reduced with increased sparsity. For example, when the OD matrix has only 10% non-zero values, the corresponding RMSE value was found to be less than 0.35, which is quite impressive. This shows that the accuracy of the method is improved when there is more sparsity. Comparing the results to a common least squares solution (Figure 6), the proposed method is able to recover solutions with lower RMSE values. It can also be observed that when the sparsity is low, the proposed method is more efficient than least squares as the gap between two lines is high but when there are a greater number of non-zero entries, the RMSE gap between these two methods reduces.

Root mean square error (RMSE) versus sparsity in origin–destination estimation (sparsity in relation to proportion of non-zero values).
To see how different demand patterns affect the OD estimation, we performed a similar simulation for several mean arrival rates

Comparing root mean square error (RMSE) with sparsity for different mean arrival rates (each panel represents a different mean arrival rate of passengers).
OD Estimation of A Line BRT Route in Twin Cities
We use the APC data from Twin Cities, MN to calculate the OD flow of a route. The APC data used for this research contain transit trip information, such as date and time of the operation, routeID, stopID, departure and arrival times, numbers boarding and alighting at each stop, and the geographical coordinates of the stops. We selected A Line, which is a BRT route in Twin Cities for this analysis. It serves 20 stations along Snelling Av and 46th St. We selected a trip from the data during peak hour. The numbers boarding and alighting at different stops in the northbound direction is shown in Figure 8. We can observe the popular boarding locations such as 46th Street Station, 46th & Minnehaha Station, and Snelling & Highland Station and alighting stops such as Rosedale Transit Center, Snelling & Highland Station and Snelling & Clair St. Station. We use the optimization program (Equation 10) to solve the given problem with a value of

Boarding and alighting count for A Line.
The total ridership of the trip is 16. Because of low ridership, flow along most of the OD pairs should be equal to zero. We apply the proposed method to the given data and calculate the OD flows. Figure 9 shows the OD flows between different OD pairs. We can see that the flow occurred only between 11 OD pairs out of 400 pairs (2.75%). The highest flow was observed between Snelling & Highland Av. and Rosedale Transit Center, which is the last station along this route. Other popular OD pairs are 46th St. and Snelling & St. Clair, Snelling & Minnehaha and Snelling & Highland Av. Because of the low ridership, the sparse matrix recovery seems to perform well.

Origin–destination flow for A Line, Twin Cities, MN.
Limitations and Discussion
In this section, we discuss the limitations of the proposed method and provide some recommendations for future research to address these limitations. Various studies use a prior matrix as a regularizer which can also be included in the proposed framework as follows:
The choice of parameters
The problem of OD estimation has been well studied in the literature in the context of both road and transit networks. This is an interesting problem with solution methods using both optimization and statistical techniques. Depending on the available data, the problem can be formulated as an underdetermined or overdetermined system of equations. The classic techniques such as entropy maximization, least squares, and so forth, produce some good results but may not evaluate the correct solution as there can be infinitely many or no solutions, which again depends on the quality of data and the set of equations obtained from the setup. Recent work in the field of CS has established that under suitable conditions, we can evaluate a unique sparse solution out of a given ill-posed system of linear equations. We tried to solve the problem using this approach and achieved impressive results but not an exact solution. We also did not prove that the linear mapping produced by a set of linear equations in this case satisfied the RIP, which may be the source of error in our results. The condition is hard to prove and could be a topic for future research.
The CS technique can also be used to design a linear map
This research can also be expanded in multiple directions. The method can be used to estimate a full transit network OD matrix. The problem can be formulated as a bi-level program with sparse recovery optimization at the upper level and transit assignment (28, 29) at the lower level to capture route choice behavior in the model. We believe that the network level OD will also be sparse because it is unlikely that passengers boarding at one stop can alight at all other stops in the network. The concept can also be extended to matrix sensing which will be helpful in estimating a time-dependent transit OD matrix. As the boardings and alightings follow a regular pattern during various hours of the day, data from several days can be used to learn this pattern. This means the high dimensional data for several days can be used to minimize the rank of the matrix to extract a regular pattern. This can be done by minimizing the nuclear norm of the matrix, which is a convex surrogate for the rank of the matrix. The problem is computationally challenging and needs further attention.
Conclusions
In this research, we proposed a method for estimating an OD matrix for a transit route along one direction. The problem can be formulated as an undetermined system of linear equations. The adopted strategy was to estimate a sparse OD matrix, using the
Footnotes
Acknowledgements
This research was conducted at the University of Minnesota Transit Lab (
), currently supported by, but not limited to, the following projects:
National Science Foundation, award CMMI-1637548 Minnesota Department of Transportation, Contract No. 1003325 Work Order No. 15 Minnesota Department of Transportation, Contract No. 1003325 Work Order No. 44 Transitways Research Impact Program (TIRP), Contract No. A100460 Work Order No. UM2917
The authors are grateful to Metro Transit for sharing the data. We are also grateful to the anonymous referees for their constructive input to improve the quality of this article. Any limitation of this study remains the responsibility of the authors.
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: PK, AK, GAD; data collection: PK, AK, GAD; analysis and interpretation of results: PK, AK, GAD; draft manuscript preparation: PK, AK, GAD. All authors reviewed the results and approved the final version of the manuscript.
The Standing Committee on Transportation Network Modeling (ADB30) peer-reviewed this paper (19-05925).
