Using Stata’s bysort command for panel data in time series analysis


Sorting information in panel data is crucial for time series analysis. For example, sorting by the time for time series analysis requires you to use the sort or bysort command to ensure that the panel is ordered correctly. However, when it comes to panel data where you may have to distinguish a patient located at two different sites or a patient with multiple events (e.g., deaths), it’s important to organize the data properly.

You can download the sample data and Stata code at the following links:





In this example, we have a data set with time (months) in the column and patients in the rows (this is called a wide format data set). For each month, there are different numbers of observations. For instance, in Month 1, there were 5 observations. But in Month 7 there were only three.

The highlighted boxes indicate a patient was observed at two different sites. There are two ways to approach this: (1) remove the patient from Site B or (2) keep the patient by distinguishing it at each sight. Removing the patient will result in a loss of information for Site B, but keeping the patient complicates the panel data when we convert from wide to long format.

Figure 1.png

Converting this from wide to long format would result in the following panel data. Review each patient, in particular, the months of observations reported for the months. Notice that not all patients have observations for all the months (Months 1 to 7). Some patients have observations for scattered months (e.g., Patient 3). Of note is Patient 2 who has observations at Sites A and B. Since we opted to keep Patient 2 data for Sites A and B, we need to distinguish a method to ensure that the panel data is ordered correctly. Interestingly, Patient 8 has an observed event  (Death) three times at Months 5, 6, and 7. Since a patient should experience death only once, this may be a coding error and should be removed. Using the Stata sort and bysort command will allow us to fix this problem.

Figure 2.png

The bysort command has the following syntax:

bysort varlist1 (varlist2): stata_cmd

Stata orders the data according to varlist1 and varlist2, but the stata_cmd only acts upon the values in varlist1. This is a handy way to make sure that your ordering involves multiple variables, but Stata will only perform the command on the first set of variables.



First, we want to make sure we eliminate the repeated deaths from Patient 8. We can do this using the bysort command and summing the values of Death. Since Death == 1, we can sum up the total Deaths a patient experiences and drop those values that are greater than 1—because a patient can only die once.

***** Identify patients with repated death events. 
bysort id site (month death): gen byte repeat_deaths = sum(death==1)
drop if repeat_deaths > 1 

The alternative methods use the sort command:

* Alternative Method 1:
by id site (month death), sort: gen byte repeat_deaths = sum(death==1)
drop if repeat_deaths > 1

* Alternative Method 2:
sort id site (month death)
by id site (month death): gen byte repeat_deaths = sum(death ==1)
drop if repeat_deaths > 1
Figure 3.png

Now we have a data set without the unnecessary death values for Patient 8. Therefore, Patient 8 will not be counted in months 6 and 7 because they are no longer contributing to the denominator.



Suppose we want to perform a single group time series analysis. We would want to sum up the number of deaths across the months. We can do this using the bysort command.

First, we have to think about how we want to count death. Since Death == 1, we want to add up the number of Death for each month. Initially, we were worried that Death would be counted two more times for Patient 8, but we solved this problem by removing these events from Patient 8.

Figure 4.png

The following command will yield the above results in a long format.

bysort month: egen byte total_deaths = total(death)

We use the egen command because we are using a more complex function. Detailers on when to use gen versus the egen commands are located at this site.



Next, we want to determine that number of patient observations that are contributed to each month. To do this, we can use the bysort command again.

***** Determine the denominator -- using bysort and counter variable
gen counter = 1
bysort month: egen byte total_obs = total(counter)

This should yield the following results:

Figure 5.png


Currently, the data is set up using the patient-level. We want to change this to the single-group level or the aggregate monthly level. To do this, we have to eliminate the repeated month measurements for our total deaths (numerator) and total observations (denominator).

***** Drop duplicate months
bysort month: gen dup = cond(_N==1, 0, _n)
drop if dup > 1

We can visualize this by plotting two separated lines connected at the values for each month.

****** Plot the total number of deaths and total number of observations
graph twoway (connected total_deaths month, lcol(navy)) ///
             (connected total_obs month, lcol(cranberry) ytitle("Number") ///
	      xtitle("Months") ylab(, nogrid) graphregion(color(white)))
Figure 6.png

We can take this a step further and calculate the prevalence.

***** Estimate the prevalence (per 100 population) and plot
gen prev = (total_deaths / total_obs ) * 100	

graph twoway connected prev month, ytitle("Prevalence of Death" "per 100 population") ///
	     xtitle("Months") ylab(, nogrid) graphregion(color(white))
Figure 7.png


Using the bysort command can help us fix a variety of data issues with time series analysis. In this example, we have patient-level data that contained deaths for one patient and a patient who was observed at different sites. Using the bysort command to distinguish between sites allowed us to properly identify the patient as unique to the site. Additionally, we used the bysort to identify the patient with multiple deaths and eliminated these values from the aggregate monthly values. Then we finalized out single-group data set by summing the total deaths and observations per month and removing the duplicates.

You can download the Stata code from my Github site.



I used the following references to write this blog.

Stata commands: bysort:


Stata commands: gen versus egen:

Counting and Data Manipulation for an ITSA


In time series analysis, we are interested on the impact of some exposure over a time period. Exposure can be coded as event==1. If this is time-varying, then the event can occur at any time across a time period. Time series analysis requires us to identify the time when the event first occurred. In most cases this is also considered the post period. In this example, we will label the exposure of interest as event.

Longitudinal data can come in either a wide or long format. However, it is easier to perform longitudinal data analysis in the long format. This assumes that you declare either the xtset or tsset as a panel or time-series data set, respectively.


Let’s assume that we have two subjects (A and B), who can experience an event at any time between some time variable 1 and 5, time(1:5). This is a longitudinal data set in the long format with id as the unique subject-level identifier, the exposure variable of interest event as the exposure, and time as an arbitrary time variable ranging from 1 to 5. The event for subject A occurs at time==3.

Screen Shot 2018-02-18 at 3.13.23 PM.png

Suppose you want to create a variable that counts the number of times the subject has the event. We will call this variable duration.

Screen Shot 2018-02-18 at 3.14.35 PM.png

The following Stata code will generate the duration variable.

by id (time), sort: gen byte duration = sum(event==1)

Sorting by the id and then time will nest the time sequence for each id. The sum() will add all event that is coded as 1.

It’s critical that you put time in parentheses (); otherwise, you can generate incorrect values. For instance, if you make the mistake of typing the Stata code as follows, you will generate a dataset which doesn’t provide the cumulative duration of having the event. Notice how the duration variable only has 1 instead of 1, 2, and 3.

by id time, sort: gen byte duration = sum(event==1)
Screen Shot 2018-02-18 at 3.17.26 PM.png

Similarly, if you use the following code, you will generate the incorrect values. The sum(event)==1 syntax should be sum(event==1). However, this will “flag” the time when the event first occurred, which may be useful in some cases. 

by id (time), sort: gen byte duration = sum(event)==1
Screen Shot 2018-02-18 at 3.18.48 PM.png

Let’s take our example further and generate a variable column that takes into consideration the period before the subject experiences an event. Suppose subject A experiences an event at time==3, but we want to center this as 0 and previous months as -1, -2, and so on. We need to first identify the time when the event occurs and populate that as a new variable, which we will call firstevent. We can use the following State code to generate firstevent based on the condition that the event==1 and the variable it occurs is time==3.

egen firstevent = min(cond(event == 1, time, .)), by(id)
Screen Shot 2018-02-18 at 3.19.41 PM.png

There will be missing values since not all subjects experience the event. To populate the missing values for the subjects with no events (event==0), we need to replace firstevent by identifying the max time of the entire study period using the summary command. Once we have the max time of the study period, we add 1 to this and replace the missing values from the firstevent variable.

replace firstevent = max(time) if firstevent == .
summ time
global maxtime = r(max)

replace firstevent = $maxtime + 1 if firstevent == .
Screen Shot 2018-02-18 at 3.20.31 PM.png

We can subtract the time from the first event to generate a new variable (its) that will capture the negative time before the event occurs and the positive time after the event occurs, centered on when event==1.

by id (time), sort: gen byte its = _n – firstevent
Screen Shot 2018-02-18 at 3.21.29 PM.png

The new variable its is short for interrupted time series analysis. An investigator can use the its variable to plan any interrupted times series analysis without having to go through the ordeal of generating this variable using other software.

Here is a summary of the entire Stata code, which you can also find on my Github page:

***** Declare XTSET panel dataset.
* Variable list: id event time 
* id        =   subject identifier
* event     =   exposure of interest
* time      =   time interval

**** Create the duration variable to capture time after event.
by id (time), sort: gen byte duration = sum(event==1)

**** Create a variable for the time before the event.
egen firstevent = min(cond(event == 1, time, .)), by(id)

**** Identify the maxtime.
summ time
global maxtime = r(max)

**** Replace missing data with the maxtime + 1.
replace firstevent = $maxtime + 1 if firstevent == .

**** Create its to capture time before and after event. 
by id (time), sort: gen byte its = _n - firstevent


I used several online references to develop this tutorial for Stata. Nicholas J. Cox has some excellent tutorials that was influential in developing this piece.

Cox N. First and last occurrences in panel data. From

The Statlist forum was also helpful; in particular, the following discussion threads.

The UCLA Institute for Digital Research and Education has a tutorial on using _N and _n to count in Stata.

Counting from _N to _N. UCLA: Statistical Consulting Group. From