Building a longitudinal geospatial dataset of micro-businesses in Mexico City

Abstract

We introduce a methodology and open-source code to construct a longitudinal dataset using georeferenced business data from Mexico’s National Statistical Directory of Economic Units (DENUE). The data includes the business’ coordinates and other geographic characteristics, but imprecision in the coordinates across years, lack of business IDs before 2015, and concerns with these recently introduced IDs make longitudinal analysis of business continuity difficult. Focusing on food businesses in Mexico City, we analyze data from 2010 (n = 416,898) and 2020 (n = 470,363) to build a longitudinal dataset that allows us to track businesses over time. We address key issues, including imprecise geographic coordinates, missing or incorrect values in critical variables, and discrepancies in business names across years. Our approach combines spatial and string-matching techniques to track businesses over time, achieving an 84% F1 score (a measure that balances precision and recall) when validated against official identifiers. Although our case study centers on food businesses, the methodology can be adapted to other industries, geographies, and similar snapshot-based datasets lacking unique identifiers, serving as a replicable tool for broader urban analytics. The dataset and code are publicly available on GitHub, providing researchers and practitioners with valuable resources to analyze economic and spatial dynamics in cities.

Keywords

business data food retail geospatial data point-level data open data DENUE Mexico

Introduction

Understanding the spatial patterns of retail has been a long endeavor among economists and geographers (Christaller, 1966; Isard, 1956). These patterns reflect broader trends in urban structure (Berry, 1967) and competition over space (Alonso, 1960), consumer behavior (Glaeser et al., 2001; Thurstain-Goodwin and Unwin, 2000), urban vitality (Jacobs, 1961; Montgomery, 1998), economic growth (Birch, 1987; Neumark et al., 2011), business resilience (Credit et al., 2023; Singleton et al., 2016; Wrigley and Dolega, 2011), and social inequalities (Deener, 2007; Zukin et al., 2009). In the last three decades, the field of retail geography has evolved through advancements in spatial modeling techniques (Birkin et al., 2003; Fotheringham et al., 2000), computational methods (Comber et al., 2020; Pavlis et al., 2018), and new data sources (Lansley and Longley, 2017). Consequently, retail business spatial data has become an essential tool for interdisciplinary urban research and practical applications. In this context, we propose a methodology to construct a longitudinal dataset from a public spatial database containing retail information for all of Mexico.

The National Statistical Directory of Economic Units (DENUE, Directorio Estadístico Nacional de Unidades Económicas) is a public dataset in Mexico maintained by the National Institute of Statistics and Geography (INEGI, Instituto Nacional de Estadística y Geografía) since 2010 (INEGI, 2020). It offers data on the identification, location, economic activity, and size of over five million businesses across the country. DENUE is updated regularly to remain relevant. Microbusinesses are updated every 6 months, while large businesses and key economic sectors are updated annually using administrative records and national surveys. A full update occurs every 5 years with the Economic Census being conducted in years ending in four and nine with data being updated and published the following year, ensuring that DENUE provides a reliable snapshot of Mexico’s evolving economic dynamics. Similar datasets exist in other countries, such as the U.S. National Establishment Time Series (NETS), Data Axle, or the UK Business Structure Database, but these are either proprietary or lack precise georeferencing. By contrast, DENUE’s public, spatially detailed data make it uniquely valuable for replicable urban research in the Global South.

Despite ongoing data collection, the snapshots were largely independent across different years until recently, making it difficult to track individual businesses over time. To address this limitation, we outline how to construct a longitudinal dataset from the 2010 and 2020 snapshots. We highlight challenges in data cleaning and tracking businesses over time. We validate our methodology using 2015 and 2025 data where official identifiers are available, demonstrating strong matching performance (see Appendix). Finally, we examine food retail establishments to characterize the “birth,” “death,” or “survival” of establishments to understand business demographics, as shown in previous studies (Anselin and Farah, 2023; Carroll and Hannan, 2000; Van Wissen, 2002).

We create this methodology and code as a guide for researchers to replicate the process, enabling them to construct their own longitudinal dataset using Mexican business data. Specifically, we provide a starting point for DENUE users, highlighting data cleaning challenges and methods for creating a longitudinal dataset, which can improve dataset quality and open new research possibilities for future analyses (Arribas-Bel et al., 2021). Our method provides an alternative, flexible, and longitudinal approach to defining business continuity that does not rely on official identifiers. This is crucial for longitudinal analysis before 2015 and because of notable differences in the locations and names of some businesses with the same official identifier across years in recent DENUE datasets. To support reproducibility and verification, we share the original 2010, 2015, 2020, and 2025 datasets¹ from INEGI along with our R code² and final longitudinal data product³ in a public GitHub repository⁴. Researchers in geography and related fields can use this dataset to compare business survival models, analyze retail distribution over time, and integrate it with other datasets at various geographical scales, further expanding research opportunities.

Challenges of building a longitudinal dataset with DENUE

The DENUE dataset faces several data collection and consistency challenges. Enumerators walk through the city to gather information and record latitude and longitude coordinates for businesses, but these coordinates vary between years, even for the same store, which makes it difficult to match businesses over time. Additionally, key variables such as postal codes and phone numbers are often missing or contain incorrect values, complicating consistent analysis. For example, both the 2010 and 2020 datasets include a variable called “edificio” (building), which indicates if a business is located inside a public market, but the content within this variable is largely missing in the 2010 dataset, making some analyses much more difficult.

Moreover, identifying the same business across years is challenging due to name inconsistencies, in addition to geographic inconsistencies. Similar difficulties have been previously documented through reliance on manual matching of DENUE records, underscoring the need for automated and replicable methods (Reyes and Suárez, 2022). In densely populated areas like Mexico City, where many businesses are located on the same block, it becomes difficult to identify and track the existence of a business over time, since there might be several stores of the same type within one block. Additionally, the same business is often coded with different names in different years. For instance, as shown in Figure 1, an establishment selling chicken appears as “Pollería Lupita” in the 2010 dataset, while in the 2020 dataset it is listed simply as “Bachoco.” These variations in geographic coordinates and business names require nuanced matching methods to accurately track businesses across different years.

Figure 1.

Example of name inconsistencies across the years.

Building the longitudinal dataset

This section outlines the process of constructing a longitudinal dataset using the 2010 and 2020 DENUE data for Mexico City, with a focus on food businesses. We carry out multiple steps to clean and prepare the data, including handling inconsistencies in business classifications and geographic coordinates.

General data cleaning

We begin by importing the 2010 and 2020 datasets, standardizing variable names, normalizing textual data (e.g., converting entries to lowercase and removing accents), and adding unique identifiers to rows in the dataset. By looking at the NAICS (North American Industry Classification System) codes relevant to food businesses⁵, we then perform an initial descriptive analysis to understand the distribution of businesses across different categories in 2010 and 2020. Through analyzing establishment names, we also reclassify NAICS codes when appropriate; this applies most often to initially misclassified convenience and dairy stores.

Inconsistent establishment names

As we have previously mentioned, the names of the establishments are particularly messy and we found that the names that are modified the most across years are for small neighborhood stores, usually called “abarrotes,” “abarroterías,” or “misceláneas.” These establishments are small grocery stores that cater to locations ranging from residential neighborhoods to busy commercial streets, offering convenient access to everyday essentials. They typically sell a variety of basic goods, including packaged foods, beverages, cleaning supplies, personal care products, and sometimes fresh produce or household items. Despite their small size, these establishments are crucial to study, as they account for the largest share of food and beverage purchases among all types of establishments in Mexico (Farah et al., 2022).

The first challenge we face when cleaning data on small neighborhood stores are the typos that enumerators introduce when registering the stores, not only of the names of the stores, but by misspelling “abarroterías,” or “misceláneas.” Since there is little consistency across years, we homogenize store names by re-coding all possible spellings and versions of “abarrotes.” We also homogenize the names of all fruit and vegetable stores and their alternative names (e.g., recaudería), as well as other common stores where we identify alternate names. To include other stores that might have misclassified NAICS codes, we reclassify stores that contain “abarrotes” as NAICS code 461110 (small neighborhood stores), but we then filter out wholesale stores (431110), restaurants with abarrotes (722511), or pharmacies explicitly without minisupers (464111). After we extensively clean the names of the establishments, we subset the data within our NAICS categories of interest relevant to food businesses.

Deduplicating and handling public markets

We analyze duplicate coordinates corresponding to stores in the exact same location or very similar locations due to being located inside public markets. Within exact coordinate duplicates, we first find chain stores (such as convenience stores), and we deduplicate that data, retaining only one representation of each establishment.

Public markets in Mexico City contain many establishments that are of interest, yet stores in these markets operate under different business dynamics compared to standalone businesses outside the markets. So next, we eliminate establishments that are part of public markets by excluding stores (103 in 2010 and 20,242 in 2020) whose names or building contain specific keywords (e.g., plaza, hallway, market) that indicate their location within markets, or those that have an entry in the “number of stalls” for that business. It is important to note again that these variables on the building type and the number of stalls have high missingness for the 2010 dataset. Then, we use open data on public markets from the city’s open portal⁶ to identify establishments within a strict 100-m buffer around each public market and remove them (17,688 in 2010 and 5532 in 2020). This initial distance approximates the core footprint of most public markets in Mexico City, including both the main structure and immediately adjacent stalls. Then, we expand the buffer to 300 m, eliminating all establishments that have exact coordinate duplicates within that catchment area (2,273 in 2010 and 579 in 2020). The expanded buffer captures the influence of larger markets that physically extend beyond 100 m from their center, and it helps remove persistent duplicates that were still present after the first round of removal. This second buffer distance was determined through visual inspection of maps around known markets where duplicates persisted. Lastly, we remove remaining duplicates by checking for redundancy across geographic (coordinates, census tract, block) and business characteristics (name, industry code), removing 1,274 stores in 2010 and 101 in 2020.

Building a longitudinal dataset using fuzzy matching

At this point in the data cleaning process, there are no duplicate establishments within a given year’s data according to coordinates and the names of the establishments are reasonably clean, so we join the data first using exact matches of geographic variables—municipality, census tract (AGEB), and block—as well as the NAICS codes. However, this process results in a many-to-one match, as some establishments within a block in one year may initially correspond to multiple establishments in another year. We resolve the many-to-one matches by employing a fuzzy matching technique. We calculate the string distance between establishment names and the Haversine geographic distance between coordinates. Before performing these calculations, we further clean the establishment names by removing the word “abarrotes,” which improves matching accuracy.

Sensitivity analysis

As mentioned, the determination of survival is based partially on matching of the location (coordinates) and name of an establishment across the two datasets from 2010 to 2020. These are matched based on joint cutoffs for the Haversine distance between two locations (in meters) and the string distance between the names of the establishments (calculated using the Jaro–Winkler method which outperform other methods, as we show in the Appendix). We test the sensitivity of our results to these cutoff values, as shown in Figure 2. We vary both the cutoff for string distance and for Haversine distance jointly, and we find that after a certain point (approximately 200 m), geographic distance has little additional impact on match rates. However, the string distance is much more impactful across the range of values tested (0.015, 0.45). In Figure 2, the point representing the combination of Haversine distance and string distance for our initial matched dataset is shown in black (at 0.3 string distance, 150 m Haversine distance). Since we also find that some establishments have a larger footprint (e.g., supermarkets), we accept matches from another cutoff combination that is stricter in terms of string distance (0.2), but less strict in terms of geographic distance (350 m). Additionally, there are some businesses that have more moderate but acceptable differences in names (string distance less than 0.5) due to the nature of the data collection across years, but are located less than 15 m from a potential match. We also include these businesses as additional matches. We choose this combination of additional cutoffs in our analysis due to the validation with official IDs shown in the Appendix, where these cutoffs show to produce optimal match-recovery performance. We include the string and Haversine distances as variables in our matched dataset, so users can change these cutoffs based on desired precision. These thresholds can have important implications depending on the research question. For example, stricter thresholds may be appropriate for studies of firm-level behavior, while more relaxed thresholds may be suitable for broader spatial trends.

Figure 2.

Sensitivity analysis for defining string and geographic distance cutoffs.

Final business continuity analysis

Once we identify the matches, we create a final dataset that determines businesses that “died” (ceased to exist), those that were “born” (newly established), and those that “survived” (continued to exist in a similar location and with a similar name) during the study period. This classification helps to understand the dynamics of business turnover. For the analysis of survival, we assign a unique reference number to each establishment after cleaning and we identify 25,261 surviving establishments, 45,216 that ceased to exist, and 44,862 new establishments. Our overall survival rate for food businesses from 2010 to 2020 is estimated to be approximately 36%. As shown in the Appendix, our method estimates the survival rate to improve between 2015 and 2025 to 44%. These results align with our expectations and extend findings from prior studies on patterns in small-retail turnover that consider shorter time periods than ours (Anselin and Farah, 2023; Coad, 2018).

We also categorize businesses into four categories: small neighborhood stores, specialized stores, supermarkets, and convenience stores. Table 1 shows a decline in the number of small neighborhood stores and an increase in the other types of establishments, particularly in convenience stores. The survival rates also differ by category, with 39% survival for small neighborhood stores, 31% for specialized stores, 47% for supermarkets, and 48% for convenience stores. With this dataset, future analysis can focus on retail spatial analysis to better understand the economic geography of the city.

Table 1.

Business category counts and percentage changes from 2010 to 2020.

Business category	2010 count	2020 count	% change	Births	Deaths	Survivals
Small neighborhood stores	38,119	34,746	−8.85	19,881	23,254	14,865
Specialized stores	30,700	32,765	+3.47	22,166	21,101	9599
Supermarkets	247	451	+82.60	333	129	118
Convenience stores	1411	3161	+124.03	2482	732	679

Conclusion

This article describes and builds upon the publicly available business data product which INEGI maintains. Our code matches food businesses across 2010 and 2020 and highlights the challenges of constructing longitudinal datasets from the DENUE database, particularly in terms of geographic inconsistencies and business name changes. Our method can be extended to any two time periods to aid researchers in longitudinal analysis of business trends; it is especially useful in keeping consistent definitions of business continuity before and after the introduction of official IDs. This process is important because longitudinal business data can deepen understanding of consumer behavior, economic vitality, business resilience, and social inequalities.

Beyond Mexico, our methodology offers a replicable framework for constructing longitudinal datasets from snapshot-based business registries common in many countries. Similar challenges exist in business datasets worldwide, from Points of Interest data to local business registries in Global South cities, where independent temporal snapshots lack persistent identifiers. Our string and spatial matching approach can be adapted to different languages, business naming conventions, and geographic contexts.

To improve the usability of DENUE for longitudinal research, we recommend evaluating the current business identifier system based on our analysis. DENUE introduced official IDs in 2015, which allowed matching across releases but did not provide consistent or guaranteed longitudinal continuity. It was only in 2021 that INEGI introduced the Statistical Business Key (CLEE, Clave Estadística Empresarial), a more comprehensive and permanent identifier system intended to follow businesses over time. Using 2015 IDs in our initial analysis of these official IDs, our validation in the Appendix reveals limitations: 34% of matched businesses through official IDs show coordinate shifts exceeding 10 m between the 2015 and 2025 censuses, and some businesses receive new IDs despite relative spatial and name continuity. Similar issues persist even after the introduction of the CLEE between 2021 and 2025: 4% of establishments matched through CLEE and coded in the ID as not experiencing a geographic relocation are more than 100 m apart, raising questions about how continuity is operationalized within the system. Although the CLEE encodes information on changes to location, NAICS, and establishment size, details about these changes remain confidential and unclear for research purposes (INEGI, 2025). Thus, a robust approach to analysis of business survival would incorporate the new CLEE system alongside thorough analysis of additional researcher-defined business survivals/deaths. Our methodology encourages careful examination of businesses that have changed location or name and is especially crucial when considering a longitudinal analysis that includes time periods before the introduction of CLEE identifiers.

Importantly, our validation shows that relying uncritically on official IDs can lead to misleading definitions of business survival, especially in spatial or firm demographic analyses. Our method recognizes that apparent shifts in location or name may reflect data inconsistencies rather than actual relocations or changes, but defines business continuity as maintaining relatively similar location and name. We encourage users to approach the dataset critically and consider these nuances when conducting longitudinal analyses. Our validation exercise demonstrates both the value of the current ID system and areas for improvement. Incorporating our method would enhance the reliability of longitudinal studies and expand their temporal scope, allowing for more precise analysis of business dynamics and supporting deeper insights into urban and economic development in Mexico City and beyond.

Footnotes

Acknowledgments

The authors thank the editor and anonymous reviewers for their constructive feedback, which improved the manuscript.

ORCID iDs

Irene Farah

Claire Kelling

Yihan Zhao

Ethical considerations

This study is compliant with ethical standards as this study was conducted using open access data from Mexico.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This material is based upon work supported by the National Science Foundation under Award Number 2316857.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The DENUE data product is available to download on the INEGI website here: https://www.inegi.org.mx/app/descarga/default.html. All data and codes used to create the dataset can be found on GitHub at this repository: .

Yihan Zhao is now at the School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, USA, but was at Carleton College when the research was conducted. Xinyi Leng is now at the School of Data Science, University of Virginia, Charlottesville, VA, USA, but was also at Carleton College when the research was conducted.

Notes

Author biographies

Irene Farah is an Assistant Professor in Urban and Regional Planning at the University of Illinois Urbana-Champaign. She earned her Ph.D. in City and Regional Planning from UC Berkeley. Her research examines governance, informality, and the political economy of urban space, focusing on street vending and retail regulation in Latin America and the U.S. Using geospatial modeling and ethnography, she explores the relationship between multilevel state actors, informal workers, and urban policy. Previously, she worked as a Research Associate at the University of Chicago’s Center for Spatial Data Science, where she developed spatial accessibility metrics and studied firm dynamics.

Claire Kelling is an Assistant Professor of Statistics at Carleton College. She received her Dual PhD in Statistics and Social Data Analytics from Penn State. Her research engages statistics, sociology, and data science in order to study and develop statistical methods and analyses that inform evidence-based policy. Her focus is on the development of methods in spatial statistics for the analysis of policing data, driven by questions from and partnership with community members and organizations.

Xinyi Leng is a PhD student in Data Science at the University of Virginia. She holds a Bachelor’s degree in Mathematics and Statistics from Carleton College, with a background in data science, algorithmic fairness, and natural language processing. She has conducted research on spatial statistics, computational social science, and interpretable large language models. Xinyi has extensive experience in data analytics, machine learning, and digital humanities and is interested in applying computational methods to drive positive societal change.

Yihan Zhao is a Master’s candidate in Financial Engineering at Cornell University. He holds a Bachelor’s degree in Statistics from Carleton College. His academic and professional experiences center on data science and its applications in marketing, business, and finance. Beyond developing statistical models, he focuses on effectively communicating data-driven insights for broader accessibility and practical use. His previous research and projects include studying the impact of certain geographical features on urban property appraisals and developing predictive models for marketing analytics.

Appendix

References

Alonso

(1960) A theory of the urban land market. Papers in Regional Science 6(1): 149–157. https://doi.org/10.1111/j.1435-5597.1960.tb01710.x

Anselin

Farah

(2023) The life and death of retail: insights from firm demography. In: Kickert

Talen

(eds) Streetlife: Urban Retail Dynamics and Prospects. University of Toronto Press, pp. 27–46.

Arribas-Bel

Green

Rowe

, et al. (2021) Open data products-A framework for creating valuable analysis ready data. Journal of Geographical Systems 23(4): 497–514. https://doi.org/10.1007/s10109-021-00363-5

Berry

BJL

(1967) Geography of Market Centers and Retail Distribution. Prentice Hall.

Birch

DGW

(1987) Job Creation in America: How Our Smallest Companies Put the Most People to Work. New York: Free Press.

Birkin

Clarke

, et al. (2003) Using spatial models to solve difficult retail location problems. In: Applied GIS and Spatial Analysis. John Wiley & Sons, Ltd, pp. 35–54. https://doi.org/10.1002/0470871334.ch3

Carroll

Hannan

(2000) The Demography of Corporations and Industries. Princeton University Press.

Christaller

(1966) Central Places in Southern Germany. Prentice Hall.

Coad

(2018) Firm age: a survey. Journal of Evolutionary Economics 28: 13–43. https://doi.org/10.1007/s00191-016-0486-0

10.

Comber

Arribas-Bel

Singleton

, et al. (2020) Building hierarchies of retail centers using Bayesian multilevel models. Annals of the Association of American Geographers 110(4): 1150–1173. https://doi.org/10.1080/24694452.2019.1667219

11.

Credit

Farah

Anselin

(2023) The ups and downs of retail, 2000-2015. In: Kickert

Talen

(eds) Streetlife: Urban Retail Dynamics and Prospects. University of Toronto Press, pp. 47–73.

12.

Deener

(2007) Commerce as the structure and symbol of neighborhood life: reshaping the meaning of community in Venice, California. City & Community 6(4): 291–314. https://doi.org/10.1111/j.1540-6040.2007.00229.x

13.

Farah

Stern

Ramírez

, et al. (2022) Food and beverage purchases at formal and informal outlets in Mexico. Public Health Nutrition 26: 1–28. https://doi.org/10.1017/S1368980022002324

14.

Fotheringham

Brunsdon

Charlton

(2000) Quantitative Geography: Perspectives on Spatial Data Analysis. Sage Publications.

15.

Glaeser

Kolko

Saiz

(2001) Consumer city. Journal of Economic Geography 1(1): 27–50. https://doi.org/10.1093/jeg/1.1.27

16.

INEGI (2020) Directorio Estadístico Nacional de Unidades Económicas (DENUE). Microdatos. [Csv]. https://www.inegi.org.mx/contenidos/productos/prod_serv/contenidos/espanol/bvinegi/productos/nueva_estruc/702825194987.pdf

17.

INEGI (2025) Directorio Estadístico Nacional de Unidades Económicas 2025, Información para la actualización e incorporación de unidades económicas al DENUE; datos a mayo de 2025. INEGI, Aguascalientes, Mexico. Available at: https://www.inegi.org.mx/rnm/index.php/catalog/1103

18.

Isard

(1956) Location and Space-Economy: A General Theory Relating to Industrial Location, Market Areas, Land Use, Trade, and Urban Structure. Cambridge, MA: MIT Press.

19.

Jacobs

(1961) The Death and Life of Great American Cities. Random House.

20.

Lansley

Longley

(2017) Opportunities for retail data and their geographic integration in social science. In: Big Data for Regional Science. Routledge.

21.

Montgomery

(1998) Making a city: urbanity, vitality and urban design. Journal of Urban Design 3(1): 93–116. https://doi.org/10.1080/13574809808724418

22.

Neumark

Wall

Zhang

(2011) Do small businesses create more jobs? New evidence for the United States from the national establishment time series. The Review of Economics and Statistics 93(1): 16–29. https://doi.org/10.1162/REST_a_00060

23.

Pavlis

Dolega

Singleton

(2018) A modified DBSCAN clustering method to estimate retail center extent. Geographical Analysis 50(2): 141–161. https://doi.org/10.1111/gean.12138

24.

Reyes

Suárez

(2022) Firm survival and location preference: evidence from Mexico city. Heliyon 8(4): 1–10. Available at: https://doi.org/10.1016/j.heliyon.2022.e09219

25.

Singleton

Dolega

Riddlesden

, et al. (2016) Measuring the spatial vulnerability of retail centres to online consumption through a framework of e-resilience. Geoforum 69: 5–18. https://doi.org/10.1016/j.geoforum.2015.11.013

26.

Thurstain-Goodwin

Unwin

(2000) Defining and delineating the central areas of towns for statistical monitoring using continuous surface representations. Transactions in GIS 4(4): 305–317. https://doi.org/10.1111/1467-9671.00058

27.

van der Loo

MPJ

(2014) The stringdist package for approximate string matching. The R Journal 6(1): 111–122.

28.

van Wissen

LJG

(2002) Demography of the firm: a useful metaphor? European Journal of Population / Revue Européenne de Démographie 18(3): 263–279. https://doi.org/10.1023/a:1019750727018

29.

Wrigley

Dolega

(2011) Resilience, fragility, and adaptation: new evidence on the performance of UK high streets during global economic crisis and its policy implications. Environment and Planning A: Economy and Space 43(10): 2337–2363. https://doi.org/10.1068/a44270

30.

Zukin

Trujillo

Frase

, et al. (2009) New retail capital and neighborhood change: boutiques and gentrification in New York City. City & Community 8(1): 47–64. https://doi.org/10.1111/j.1540-6040.2009.01269.x