1 Introduction
The Stata command xtset (see [XT] xtset) is the requirement to access the xt suite of commands, which was developed to deal with datasets having both a cross-sectional (or N) and a time-series (or T ) dimension (that is, panels) (Cameron and Trivedi 2005, 2022; Wooldridge 2020).
A panel dataset can be xtset in five ways. One of them allows the panel dataset to be xtset via the panelvar only:
Syntax 1
xtset
panelvar
The code above tells Stata that the dataset is composed of panels, but the order of the observations belonging to each panel is irrelevant. The remaining four ways to xtset the panel dataset require a timevar too, with or without some additional options, to tell Stata how frequently observations are collected (for example, every two years):
Syntax 2
xtset panelvar timevar [ , tsoptions ]
2 Why does error r(451) occur?
From 2014, the Stata forum reports 500 queries (keyword: “repeated time values within panel”; last check July 6, 2022) concerning the error r(451), the description of which can be accessed by typing the following from within Stata:
Often, the error r(451) occurs because at least one panel in the dataset has two or more observations that share the same date, dates are not detailed enough to allow these observations to coexist, or both.
In the following toy example, xtreg, fe (see [XT] xtreg) is fit to a short panel dataset (N > T ) composed of six subsidiaries of the Bank of Alfa that settle their mutual transactions in foreign currencies (values in $2021) at fixed time slots during the first two weeks of November 2021 (table 1):
Legend: daily_inc = daily income of the bank subsidiary; op_type = operation type; op_amnt = operation amount.
Transactions are registered via two releases of the same software:
an old-fashioned release that accounts only for day/month/year (eventdate)
1
and
an updated release that also registers hour/minute/second for each transaction (eventdate2).
As expected, with the old-fashioned release, Stata warns about repeated dates:
The error r(451) occurs because eventdate shows calendar ties that make it impossible for Stata to sort the dates unambiguously.
Conversely, the software updated release fixes the calendar ties via a more detailed timevar (eventdate2), and consequently Stata does not issue the error message r(451) (table 2):
2
Legend: op_amnt = operation amount.
When Stata throws the error r(451), the usual fix is to xtset the dataset as in syntax 1. However, this fix comes at the cost of making time-series operators (such as lags and leads) unavailable because they require observations within each panel to be ordered according to timevar. Therefore, if time-series operators must be included in the regression equation, the dataset should be xtset as in syntax 2.
3 Can timevar still be used as a predictor after error r(451)?
Provided that no variable is differenced, lagged, or led, running xtreg, fe as in syntax 1 is perfectly appropriate. It also allows the timevar to be plugged in as a categorical predictor in the regression equation despite the error r(451) (table 3):
Legend: op_amnt = operation amount
4 Switching from xtreg, fe to areg when xtset returns error r(451): A good idea?
A tempting work-around for the error r(451) is switching from xtreg, fe to areg (see [R] areg) because the latter does not require xtset.
Unfortunately, this is not a good idea even in the absence of error r(451), because of the consequences for cluster–robust SE calculation (Cameron and Miller 2015).
Let’s expand on this issue using a well-known Stata dataset (table 4):
xtreg, fe and areg produce identical point estimates but different cluster–robust estimates of the variance matrix (Cameron and Miller 2015), because they make different assumptions about whether the number of panels increases with the sample size. While xtreg, fe gives back the correct cluster–robust estimates of the variance matrix, areg does not, because it uses the wrong degrees-of-freedom correction (Cameron and Miller 2015). This difference, which is particularly apparent when the number of observations per cluster is small, does not hold for default SEs.
3
5 Leaving out timevar and exploiting the xt commands capabilities: The case of xtgee
The Stata command xtgee (see [XT] xtgee) fits both linear and nonlinear populationaveraged panel-data models via generalized estimating equations (Hardin and Hilbe 2013). Being as flexible as generalized linear models (Deb, Norton, and Manning 2017; Hardin and Hilbe 2018), xtgee allows different within-panel correlation structures (via the corr() option), various link functions that relate the outcome to the linear index function in the right-hand side of the regression equation (via the link() option), and a set of theoretical probability distributions from which the regressand is generated (via the family() option). xtgee, which does not need a timevar, is asymptotically equivalent to xtreg, re and xtreg, mle (table 5).
4
When panel datasets are balanced, xtgee and xtreg, mle produce identical results. This equivalence does not hold when panels are unbalanced, because these two Stata commands deal with lack of panel balance differently.
6 Repeated cross-sectional studies and xt commands
In repeated cross-sectional (RCS) studies, a different sample of units per wave is measured on the same set of variables at a defined time point, as in a survey (Lebo and Weber 2015).
5
Provided that the regressand is continuous, RCS studies are composed of multiple waves of data to be append ed (see [D] append) before running regress (see [R] regress).
According to the characteristics above, RCS studies fall outside the xtset framework.
However, their analysis can benefit from some of the xt commands that are frequently used to study panel datasets before running xt-related regressions.
A series of one-year RCS data was created by slightly tweaking the nlswork.dta file:
The RCS dataset has been xtset with panelvar only to summarize the continuous variables (table 6):
6
As expected, the overall and between standard deviations overlap (because N = n), whereas the within one is zero (because T = 1).
The xtsum outcome table mirrors the wide range of the continuous variables.
In addition, RCS datasets allow time-fixed effects to account for variations over time (table 7):
7,8
7 Conclusion
This tip started from the evidence of frequent complaints about the error r(451) posted on the Stata forum and then expanded to other xt-related issues.
xtset has two dimensions to be addressed: the cross-sectional one (panelvar), which is mandatory because it tells Stata that the researcher is dealing with a panel dataset, and an optional one, that is, the time-series dimension (timevar).
Therefore, how to xtset the panel dataset is strictly related to the study goals.
Unlike the xtabond case (see [XT] xtabond), various panel-data commands that provide useful information without the need of a time variable, for example, xtsum for RCS studies, can give the researcher more information on standard deviation than summarize.