Tutorial

Installation

Should work with just pip install regpyhdfe.

Load in data

We need a pandas dataframe. For the purposes of this example You can go to https://github.com/lod531/regPyHDFE/blob/main/data/cleaned_nlswork.dta and download the cleaned nlswork dataset. This dataset contains entries that can be acquired in stata by typing use nlswork, except rows containing NA values have already been dropped (hence cleaned_nlswork.dta, rather than nlswork.dta).

Once You have a file, importing the data is as simple as

import pandas as pd
# load dataframe
df = pd.read_stata('path/to/cleaned_nlswork.dta')

Pandas has other import functions if You have a file in a different format, e.g. pd.read_csv.

Regress

Target is of course the target variable.

Predictors are… Predictors.

absorb_ids are names of variables to be absorbed as high dimensional fixed effects

cluster_ids are names of variables containing cluster information (i.e. if there are N clusters, then each row of a cluster variables contains one of N distinct values.)

target = "ln_wage"
predictors = ["hours", "tenure", "ttl_exp"]
absorb_ids = ["year", "idcode"]
cluster_ids = ["year"]

from regpyhdfe import Regpyhdfe
model = Regpyhdfe(df=df, target=target, predictors=predictors,
                    absorb_ids=absorb_ids,
                    cluster_ids=cluster_ids)
results = model.fit()

Examine results

At the time of writing, the results object is of type statsmodels.regression.linear_model.RegressionResults, documentation for which can be viewed here.

The statsmodels.regression.linear_model.RegressionResults` object has a variety of statistics, but chances are all You’re looking is a summary, like so:

print(results.summary())

The output of that looks like

                            OLS Regression Results
=======================================================================================
Dep. Variable:                ln_wage   R-squared (uncentered):                   0.059
Model:                            OLS   Adj. R-squared (uncentered):          -1313.428
Method:                 Least Squares   F-statistic:                              185.2
Date:                Thu, 14 Jan 2021   Prob (F-statistic):                    2.09e-08
Time:                        13:21:24   Log-Likelihood:                          766.62
No. Observations:               12568   AIC:                                     -1527.
Df Residuals:                       9   BIC:                                     -1505.
Df Model:                           3
Covariance Type:              cluster
==============================================================================
                                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
hours         -0.0017      0.001     -3.371      0.001      -0.003      -0.001
tenure         0.0109      0.003      3.858      0.000       0.005       0.016
ttl_exp        0.0348      0.003     12.650      0.000       0.029       0.040
==============================================================================
Omnibus:                     1709.175   Durbin-Watson:                   2.171
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            21109.707
Skew:                          -0.174   Prob(JB):                         0.00
Kurtosis:                       9.340   Cond. No.                         6.87
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors are robust to cluster correlation (cluster)

And for Your convenience the whole script is

import pandas as pd
# load dataframe
df = pd.read_stata('/path/to/cleaned_nlswork.dta')

target = "ln_wage"
predictors = ["hours", "tenure", "ttl_exp"]
absorb_ids = ["year", "idcode"]
cluster_ids = ["year"]

from regpyhdfe import Regpyhdfe
model = Regpyhdfe(df=df, target=target,
    predictors = predictors,
    absorb_ids=absorb_ids,
    cluster_ids=cluster_ids)
results = model.fit()
print(results.summary())