Econometrics

Staggered difference-in-differences using R

I was interested in learning how to apply the Callaway & Sant'Anna staggered difference-in-differences framework to my work. After reading several papers and watching the video by Sant'Anna, I wrote a short tutorial on how to apply this framework to a simulated data. The tutorial is located on my RPubs site.

This is a unique method that used the R “did” package, which is based on the paper by Callaway & Sant’Anna.

Mediation analysis using R

It’s not uncommon to see covariates in a regression model that should not be there. For example, measurements that occur after the treatment assignment are included into a regression model as baseline covariates. Rather, one should consider a mediation analysis.

I wrote a tutorial on how to perform mediation analysis using R on my RPubs site (link).

I know that I make this mistake at times. This tutorial helped me to carefully consider which covariates to include in a regression model and which ones to consider for mediation analysis.

MEPS tutorial on interrupted time series analysis in R

I wrote a short tutorial on how to perform an interrupted time series analysis in R. I had a challenging time working on this because I wasn’t familiar with all the nuances of the ITSA. More importantly, I wasn’t able to leverage my Stata skills to do this in R. I’m used to the Stata margins command, which is great for creating constrasts. R has its own version of the margins command, but it lacks some of Stata’s features such as the pwcompare, which I use a lot in Stata. However, I found a workaround with linear splines, and I have uploaded this to my RPubs site (link). I hope you find this useful. I also saved my R Markdown code on my GitHub site (link).

MEPS tutorials on linkage files and trend analysis

I create two MEPS tutorials recently. One is on the use of condition-event linkage files to capture the disease-specific costs. I used migraine as a motivating example. In this tutorial, I go through the steps to identify migraine-related costs assocaited with office-based visits and inpatient night stays. In the second tutorial, I review how to perform simple trend analysis with linear regressio models. I pooled MEPS data from 2016 to 2021 and apply the approriate primary sampling units and strata from the pooled file.

The first tutorial is located on my RPubs page (MEPS Tutorial 4 - Using condition-event link (CLNK) file: A case study with migraine). The R Markdown code to create the tutorial is located in my GitHub repository (link).

The second tutorial is also located on my Rpubs page (MEPS Tutorial 5 - Simple Trend Analysis with Linear Models). The R Markdown code to create the tutorial is located in my GitHub repository (link).

Interrupted time series analysis (ITSA) with Stata

Interrupted time series analysis (ITSA) is a study design used to study the effects of an intervention across time. An important feature of the ITSA is the time when the intevention occurs. The time before and after the intervention are of interest because we want to visualize if the trends are similar or different. Additionally, we want to visualize the change immediately after the intervention is implemenated. I call this period the index date.

In this article, I’ll review the single-group ITSA and multiple groups ITSA. Then I’ll review how to perform an ITSA in Stata.

You can view the complete tutorial on my RPubs site.

Tweedie GLM model in R for Cost Data

I wrote a tutorial on using a Tweedie distribution for a GLM gamma model for cost data in R. Unlike Stata, R is very particular with zeroes when constructing GLM models. Hence, I opted to use the Tweedie distribution to mix and match the link function with the Gamma distribution. I settled on the identity link because it doesn’t involve retransformation and is each to interpret.

My tutorial is available on my RPubs site and GitHub site.

Two-part models in R - Application with cost data

I created a tutorial on how to use two-part models in R for cost data. I used the healthcare expenditures from the Medical Expenditure Panel Survey in 2017 as a motivating example. Normally, I use Stata when I construct two-part models. But I wanted to learn how I could do this in R. Fortunately, R has a package called twopartm that was developed by Duan and colleagues. You can find their document for the twopartm package here.

The tutorial I created is located on my GitHub page and RPubs page.

Cobb-Douglas production function and costs minimization problem

Update 2: This article was updated on 12 August 2023 when Dimanjan Dahal (Twitter account) identified a better way to present the Lagrangian functions. I updated this to better reflect the minimization problem and set the partial derivative solution to 0. Thank you, Dimanhan.

Update 1: This article was updated on 11 October 2021 when an anonymous reader identified an error with the example used at the end. The error was the negative value generated for the output elasticity of capital. In the previous example, I used R to generate a set of random numbers that were used in a regression model. The beta coefficient generated a negative value which was used in the linear form of the Cobb-Douglass equation. Since the output of elasticity should be between the values of 0 and 1, this negative coefficient should not be possible. Hence, I’ve updated the data frame used in the example to avoid this issue. Appreciation goes out to the anonymous reader who identified this error.

INTRODUCTION

The Cobb-Douglas (CD) production function is an economic production function with two or more variables (inputs) that describes the output of a firm. Typical inputs include labor (L) and capital (K). It is similarly used to describe utility maximization through the following function [U(x)]. However, in this example, we will learn how to answer a minimization problem subject to (s.t.) the CD production function as a constraint.

The functional form of the CD production function:

 
Figure1.png
 

where the output Y is a function of labor (L) and capital (K), A is the total factor productivity and is otherwise a constant, L denotes labor, K denotes capital, alpha represents the output elasticity of labor, beta represents the output elasticity of capital, and (alpha + beta = 1) represents the constant returns to scale (CRS). The partial derivative of the CD function with respect to (w.r.t) labor (L) is:

 
Figure2.png
 

Recall that quantity produced is based on the labor and capital; therefore, we can solve for alpha:

 
Figure3.png
 

This will yield the marginal product of labor (L). If alpha = 2, then a 10% increase in labor (L) will result in a 20% increase in output (Y).

The partial derivative of the CD function with respect to (w.r.t) labor (K) is:

 
Figure4.png
 

This will yield the marginal product of capital (K).

The CD production function can be converted to a linear model by taking the logarithm of both sides of the equation:

 
Figure5.png
 

This will allow for OLS regression methods, which is commonly used in economics to understand the association between inputs (L and K) on production (Y).

However, what happens when we are interested in the marginal cost with respect to (w.r.t.) production (Y)? This becomes a constraint (cost) minimization problem where the firm can control how much L and K they will use. In other words, we want to minimize the cost subject to (s.t.) the output

 
Figure6.png
 

Cost becomes a function of wage (w), the amount of labor (L), price of capital (r), and the amount of capital (K). To determine the optimal amount of inputs (L and K), we solve this minimization constraint using the Lagrange multiplier method:

 
 

Solve for L

 
Figure8.png
 

Substitute L in the constraint term (CD production function) in order to solve for K

 
Figure9.png
 

Now, we can completely solve for L (as a function of Y, A, w, and r) by substituting for K

 
Figure10.png
 

Substitute L and K into the cost minimization problem

 
Figure11.png
 

Simplify

 
Figure12.png
 

Final cost function

 
Figure13.png
 

Let’s see how we can use the results from a regression model to give us information about the total costs w.r.t. to the quantity produced.

Recall the linear form of the Cobb-Douglas production function:

 
Figure14.png
 

I simulated some data where we have the capital, labor, and quantity produced in R.

## Use the following libraries: library(jtools) library(broom) library(ggstance) library(broom.mixed) ## Generate random data for the data frame (cddata) set.seed(1234) production <- sample(100:600, 30, replace=TRUE) labor <- sample(50:350, 30, replace=TRUE) capital <- sample(6000:7000, 30, replace=TRUE) ## Cost function parameters: wage and price constants wage <- 35.00 price <- 30.00 ## Set up the data frame (cddata): cddata <- data.frame(production = production, labor = labor, capital = capital, wage = wage, price = price) ## Name rows using some timeline from 1988 to 2017 (30 years for 30 observations for each variable): row.names(cddata) <- 1988:2017

Then I perform a regression model using OLS

## Setting up the model, where log(a) is eliminated due to it being the intercept. cd.lm <- lm(formula = log(production) ~ log(labor) + log(capital), data = cddata) summary(cd.lm) Residuals: Min 1Q Median 3Q Max -0.96586 -0.25176 0.06148 0.37513 0.67433 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.44637 17.41733 0.255 0.800 log(labor) 0.14373 0.23595 0.609 0.548 log(capital) 0.05581 2.00672 0.028 0.978 Residual standard error: 0.5065 on 27 degrees of freedom Multiple R-squared: 0.01414, Adjusted R-squared: -0.05888 F-statistic: 0.1937 on 2 and 27 DF, p-value: 0.8251

After running the model, I stored the coefficients for use later in the production function.

## Store the coefficients
coeff <- coef(cd.lm)

## Assign the values to the production function parameters where Y = AL^(alpha)K^(beta)
intercept <- coeff[1]
alpha <- coeff[2]
beta <- coeff[3]

From the parameters, we can get A (intercept), alpha (log(labor)), and beta (log(capital)).

 
linear form of CD function.jpg
 

This will give us the quantity produced (Y) for given data on labor (L) and capital (K).

We can get the total costs (C) based on the quantity produced (Y) using the cost function:

 
Figure16.png
 

I set up my R code so that I have the intercept, alpha, beta, labor, wage, and price of the capital set up. I estimated each part of the cost function separately and then multiply the parts at the end.

## Cost
PartA <- (production / intercept)^(1 / alpha + beta)
PartB <- wage^(alpha / alpha + beta)
PartC <- price^(beta / alpha + beta)
PartD <- as.complex(alpha / beta )^(beta / alpha + beta) + as.complex(beta/ alpha)^(alpha / alpha + beta)

costs <- PartA * PartB * PartC * PartD
Note: R has a problem with performing complex operations with exponents that were defined using arrays or vectors. If you try to compute something like x^{alpha}, you will get an error where the value is “NaN.” I don’t have a complete understanding of the problem, but the solution is to make sure your root or base term is preceded by “as.complex(x)” to resolve the issue.

I plot the relationship between quantity produced and cost. In other words, this tells us the lowest costs needed to produce the quantities on the plot.

plot(production, costs)
 
cost production curve.jpg
 

CONCLUSIONS

Using the Cobb-Douglas production function and the cost minimization approach, we were able to find the optimal conditions for the cost function and plot the outcome relative to the quantity produced. As production increases, the minimum cost needed increases in a non-linear, exponential fashion, which makes sense given that Y (quantity produced) is in the numerator on the right-hand side of the cost function and positively related to the cost.

This was a fun exercise that made me think about the usefulness of the Cobb-Douglas production function, which I learned to optimize multiple times in my Economics courses. I was excited to find a pleasant utility for it using simulated data and will probably explore more exercises like this in the future.

REFERENCEs

I used a lot of resources to write this blog, which are provided below.

A site dedicated to the discussion of economics called EconomicsDiscussion.net was a great resource.

These papers were incredibly helpful in preparing the example in R:

  • Lin CP. The application of Cobb-Douglas production cost functions to construction firms in Japan and Taiwan. Review of Pacific Basin Financial Markets and Policies Vol. 5, No. 1 (2002): 111–128.

  • Larriviere JB, Sandler R. A student friendly illustration and project: empirical testing of the Cobb-Douglas production function using major league baseball. Journal of Economics and Economic Education Research, Volume 13, Number 3, 2012: 81-92

  • Hu, ZH. Reliable Optimal Production Control with Cobb-Douglas Model. Reliable Computing. 1998; 4(1): 63-69.

I encountered some issues regarding complex numbers in R. Fortunately, I found some great resources about it.

  • I found a great discussion about R’s calculation of exponents and “NaN” results and why complex numbers can mess up your math in R.

  • Another good site (R Tutorial: An Introduction to Statistics) explaining complex numbers in R.

  • John Myles White wrote a nice article about complex numbers in R.

Acknowledgements: I would like to thank the user who reached out to me about the coefficient errors for the output elasticity of capital. This helps me to learn my mistakes and correct them. Without the support and guidance from the community, I would not achieve my own goals of being a lifelong learner. Thank you.

Estimating marginal effects using Stata Part 1 – Linear models

BACKGROUND

Regression models provide unique opportunities to examine the impact of certain predictors on a specific outcome. These predictors’ effects are usually isolated using the model coefficients adjusting for all other predictors or covariates. A simple linear regression model with a single predictor x_i is represented as

 
fig1.png
 

where y_i denotes the outcome (dependent) variable for subject i, beta0 denotes the intercept, beta1 is the model coefficient that denotes the change in y due to a 1-unit change in x, and epsilon_i is the error term for subject i.

A 1-unit increase in x is associated with some change in the outcome y. This finding may explain predictor variable x’s impact on outcome variable y, but it doesn’t not tell us the impact of a representative or prototypical case.

The marginal effect allows us to examine the impact of variable x on outcome y for representative or prototypical cases. For example, Stata’s margins command can tell us the marginal effect of body mass index (BMI) between a 50-year old versus a 25-year old subject.

There are three types of marginal effects of interest:

1.       Marginal effect at the means (MEM)

2.       Average marginal effect (AME)

3.       Marginal effect at representative values (MER)

Each of these marginal effects have unique interpretations that will impact how you examine the regression results. (We will focus on the first two, since the third one is an extension of the AME.) The objective of this tutorial is to review these marginal effects and understand their interpretations through examples using Stata.

 

MOTIVATING EXAMPLE

We will use the Second National Health and Nutrition Examination Survey (NHANES) data from the 1980s, which can be found in Stata’s library using the following command:

use http://www.stata-press.com/data/r15/nhanes2.dta

Table 1 summarizes the characteristics of the NHANES population.

fig2.png

 

ADJUSTED PREDICTIONS

Adjusted prediction for a regression model provides the expected value of an outcome y conditioned on x assuming all other things are equal. In other words, this is the effect of the predictor variable x regressed to outcome variable y adjusting or controlling for other covariates. Therefore, if you were comparing the effect of a 1-unit increase in age to the BMI, then you could compare this across all patients who are equally White, Black, or Others.

Example 1

A simple linear regression model can capture the incremental effect of age on body mass index. For example, the impact of age on body mass index (BMI) can be represented as a linear regression:

 
FIG3.png
 

where BMI_1  is the body mass index for individual i, beta0 denotes the intercept (or BMI when AGE=0), beta1 denotes the change in BMI for each 1-unit increase in Age for individual i, and episilon_i denotes the error term for individual i. (The unit of BMI is kg/m^2).

The Stata command to perform a simple linear regression:

regress bmi age

The corresponding regression output is:

fig4.png

In this regression output example, the predictor of interest is AGE. The _cons parameter denotes the coefficient beta0 otherwise known as the intercept; therefore, a subject with AGE = 0 has a BMI that is 23.2 kg/m^2. (Although this is unrealistic, we will ignore this for now.) The impact AGE has on the BMI is denoted by the slope parameter beta1, which is the change in BMI due to a 1-unit change in Age. In this example, the a 1-unit increase in Age is associated with a 0.05 kg/m^2 increase in BMI.

If we wanted to know the difference in BMI between a 50-year old and 25-year old, we need to estimate the adjusted prediction, which estimates the difference in the outcome based on some user-defined values for the x variables.

To estimate the adjusted predicted BMI for a 50-year old, we used the following equation:

 
fig5.png
 

which is 25.7 kg/m^2. We can do this using the following Stata command:

di _b[_cons] + 50*_b[age]
25.655896

Similarly, we can estimate the adjusted predicted BMI for a 25-year old:

 
fig6.png
 

which is 24.4 kg/m^2.

The difference between these two is:

25.655896 - 24.433991 = 1.2 kg/m^2. 

Therefore, the difference in BMI between a 50-year old and 25-year old is on average 1.2 kg/m^2. This seems like a tedious process, but let’s see how we can make this exercise simpler using Stata’s margins command.

We can use Stata’s margins command to estimate the adjusted predicted BMI for a 50-year old and 25-year old:

margins, at(age=(25 50))

Figure 2. Stata’s margins command output for adjusted prediction of BMI for a 50-year old and 25-year old.

fig7.png

Example 2

We use a linear regression with other independent variables to illustrate the complexity of having other covariates adjusted in the model.

The regression model has the structure:

 
fig8.png
 

where  is the body mass index for individual i, beta0 is the intercept (or BMI when AGE=0), beta1 is the change in BMI for each 1-unit increase in Age for individual i, beta2 denotes the change in BMI for a female relative to a male, beta3 denotes the change in BMI due to contrasts in race categories (White, Black, and Other), and  is the error term for individual i. (The unit of BMI is kg/m^2).

For this example, RACE will be included into the regression model as a dummy variable using the following Stata command:

regress bmi age i.race i.sex

The corresponding regression output is:

fig9.png

The following are interpretations of the regression output.

A 1-unit increase in age is associated with a BMI increase of 0.05 kg/m^2 adjusting for race and sex or all things being equal.

Blacks are associated with a BMI increase of 1.4 kg/m^2 adjusting for age and sex compared to Whites.

Others are associated with a BMI decrease of 1.2 kg/m^2 adjusting for age and sex compared to Whites.

Females are associated with a BMI increase of 0.03 kg/m^2 adjusting for age and race.

If we wanted to know the adjusted prediction for a 50-year old and 25-year old, we can use the margins command:

margins, at(age=(25 50)) atmeans vsquish

The output is similar to Example 1 but there are some differences.

fig10.png

The atmeans option captures the “average” sample covariates. In our example, the mean proportion of females is 0.525, males is 0.475, Whites is 0.876, Blacks is 0.105, and Others is 0.019. Therefore, the adjusted predictions for 50-year old and 25-year old’s BMI is conditioned on the “average” values of the covariates in the model. This may not make sense because an individual subject can’t be 0.525 female and 0.475 male. Fortunately, we have other ways to address this with the marginal effect.

 

MARGINAL EFFECT

Marginal effect with the margins command generates the change in the conditional mean of outcome y with respect to a single predictor variable x. In other words, this is the partial effect of x on the outcome y for some representative or prototypical case. Usually this is obtained by performing a first-order derivative of the regression expression:

 
fig11.png
 

where the partial effect of the expected value of y condition on x is the first order derivative of the expected value of y condition on x with respect to x.

The representative or prototypical case can be the mean, observed, or a user defined case.

 

MARGINAL EFFECT OF THE MEAN (MEM)

MEM is the partial effect of on the dependent variable (y) conditioned on a regressor (x) after setting all the other covariates (w) at their means. In other words, MEM is the difference in x’s effect on y when all other covariates (RACE and FEMALE) are at their mean.

Let’s revisit the linear regression model but with the dummy variables included:

 
FIG12.png
 

In the output the beta1 = 0.0493881.

Getting the partial effect with respect to Age at the means for the other covariates, we use the following command:

regress bmi age i.race i.sex
FIG13.png
margins, dydx(age) atmeans vsquish
FIG14.png

Interpretation: For a subject who is average on all characteristics, the marginal change of a 1-unit increase in age is a 0.049 increase in the BMI.

We can also look at the MEM at different ages (e.g., 25 and 50 years):

margins, dydx(age) at(age=(25 50)) atmeans vsquish

This command performs the MEM for 25- and 50-year old subjects with their covariates set at the population mean. We interpret the results as the effect of age at different values of age at the average values of the other covariates.

The MEM should be:

fig15.png

The effect of age at 25 and 50 years old is an increase of 0.05 years. Notice that the MEM for 25- and 50-year olds are the same (MEM = 0.0493881). This is because the model is a linear regression. For every incremental increase in age, the incremental increase in the BMI is 0.0493881 given the other covariates are set at the mean.

To illustrate, we can manually perform this operation using the information above. Recall that the linear regression model with the dummy variables is represented as:

 
fig16.png
 

BMI for a 25-year old subject at the mean = intercept + 25*(beta1) + (mean of Black)*(beta2) + (mean of Other)*(beta3) + (mean of Female)*(beta4) = 24.42243 kg/m^2.

BMI for a 25-year old subject at the mean = 23.0528 + 25*(0.0493881) + .1049174*(1.382849) + .0193218*(-1.2243) + .5251667*(.025702) = 24.42243 kg/m^2, which is the same as the value presented in the adjusted prediction output.

Why are these the same? The linear regression is predictable in terms of the slope coefficients. Therefore, an incremental increase in predictor variable x will have the same incremental marginal increase in outcome variable y. When you apply the MEM to non-linear models, the slopes are no longer linear and will change based on varying levels of the continuous predictor x.

 

AVERAGE MARGINAL EFFECT (AME)

Unlike the MEM the average marginal effect (AME) doesn’t use the mean for the covariates when estimating the partial effect of the predictor variable x on the outcome variable y. Rather, the AME estimates the partial effect of the variable x on the outcome variable y for using the observed values for the covariates and then the average of that partial effect is estimated. In other words, the partial derivative is estimated with respect to x using the observed values for the other covariates (RACE and FEMALE), and then the average of that first-order derivative are averaged over the entire population to yield the AME. This is represented as:

 
ame figure.png
 

where the partial derivative of the estimated value of the outcome variable y with respect to x is conditioned on the values of covariates (w) for subject i over the entire population (N) and multiplied by beta_k (or the parameters of interest) .

Getting the partial effect with respect to Age at the observed values for the other covariates, we use the following command:

regress bmi age i.race i.sex

margins, dydx(age) asobserved vsquish
fig18.png

Interpretation: The average marginal effect of a 1-unit increase in age is a 0.049 increase in the BMI.

We can also look at the AME at different ages (e.g., 25 and 50 years):

margins, dydx(age) at(age=(25 50)) asobserved vsquish

This command performs the MEM for 25- and 50-year old subjects with their covariates set at the observed values. We interpret the results as the effect of age at different values of age at the observed values of the other covariates.

The AME should be:

fig19.png

The effect of age at 25 and 50 years old is an increase of 0.05 years. Notice that the AME for 25- and 50-year olds are the same (MEM = 0.0493881). Similar to the MEM, this is because the model is a linear regression. For every incremental increase in age, the incremental increase in the BMI is 0.0493881 given the other covariates are set at the observed values.

 

CONCLUSIONS

We see that the MEM and AME are exactly the same because of the linear model. The marginal effect of an increase in 1-unit of age is an increase in 0.05 kg/m^2 of the BMI. In the next part, non-linear models will be used to demonstrate that the MEM and AME are not equal.

 

REFERENCES

I used the following websites to help create this tutorial:

https://thomasleeper.com/margins/articles/Introduction.html

https://support.sas.com/rnd/app/ets/examples/margeff/index.html

https://www.ssc.wisc.edu/sscc/pubs/stata_margins.htm

 

I also used the following paper by Richard Williams:

Using the margins command to estimate and interpret adjusted predictions and marginal effects. The Stata Journal. 2012;12(2):308-331.

https://www.stata-journal.com/article.html?article=st0260