Tutorial
Installation
Should work with just pip install regpyhdfe.
Load in data
We need a pandas dataframe. For the purposes of this example You can go to https://github.com/lod531/regPyHDFE/blob/main/data/cleaned_nlswork.dta and download the cleaned nlswork dataset. This dataset contains entries that can be acquired in stata by typing use nlswork, except rows containing NA values have already been dropped (hence cleaned_nlswork.dta, rather than nlswork.dta).
Once You have a file, importing the data is as simple as
import pandas as pd
# load dataframe
df = pd.read_stata('path/to/cleaned_nlswork.dta')
Pandas has other import functions if You have a file in a different format, e.g. pd.read_csv.
Regress
Target is of course the target variable.
Predictors are… Predictors.
absorb_ids are names of variables to be absorbed as high dimensional fixed effects
cluster_ids are names of variables containing cluster information (i.e. if there are N clusters, then each row of a cluster variables contains one of N distinct values.)
target = "ln_wage"
predictors = ["hours", "tenure", "ttl_exp"]
absorb_ids = ["year", "idcode"]
cluster_ids = ["year"]
from regpyhdfe import Regpyhdfe
model = Regpyhdfe(df=df, target=target, predictors=predictors,
absorb_ids=absorb_ids,
cluster_ids=cluster_ids)
results = model.fit()
Examine results
At the time of writing, the results object is of type statsmodels.regression.linear_model.RegressionResults, documentation for which can be viewed here.
The statsmodels.regression.linear_model.RegressionResults` object has a variety of statistics, but chances are all You’re looking is a summary, like so:
print(results.summary())
The output of that looks like
OLS Regression Results
=======================================================================================
Dep. Variable: ln_wage R-squared (uncentered): 0.059
Model: OLS Adj. R-squared (uncentered): -1313.428
Method: Least Squares F-statistic: 185.2
Date: Thu, 14 Jan 2021 Prob (F-statistic): 2.09e-08
Time: 13:21:24 Log-Likelihood: 766.62
No. Observations: 12568 AIC: -1527.
Df Residuals: 9 BIC: -1505.
Df Model: 3
Covariance Type: cluster
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
hours -0.0017 0.001 -3.371 0.001 -0.003 -0.001
tenure 0.0109 0.003 3.858 0.000 0.005 0.016
ttl_exp 0.0348 0.003 12.650 0.000 0.029 0.040
==============================================================================
Omnibus: 1709.175 Durbin-Watson: 2.171
Prob(Omnibus): 0.000 Jarque-Bera (JB): 21109.707
Skew: -0.174 Prob(JB): 0.00
Kurtosis: 9.340 Cond. No. 6.87
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors are robust to cluster correlation (cluster)
And for Your convenience the whole script is
import pandas as pd
# load dataframe
df = pd.read_stata('/path/to/cleaned_nlswork.dta')
target = "ln_wage"
predictors = ["hours", "tenure", "ttl_exp"]
absorb_ids = ["year", "idcode"]
cluster_ids = ["year"]
from regpyhdfe import Regpyhdfe
model = Regpyhdfe(df=df, target=target,
predictors = predictors,
absorb_ids=absorb_ids,
cluster_ids=cluster_ids)
results = model.fit()
print(results.summary())