Introduction
This package provides a semi-convenient way of performing regression with high dimensional fixed effects in python.
To use this, Your data must be in a pandas dataframe. Please see Examples and Tutorial sections for instructions.
Limitations
Does not work with Weighting (yet).
In order to change address this, one would have to change the underyling model used for regressions. This is currently standard OLS from statsmodels, which does not support weighting. Statsmodels has other linear regression objects which do support weighting, but those have their own issues (Weighted Linear Regression does not support all kinds of weighting, Generalized Linear Models lacks appropriate summary statistics etc.)
Replication of Stata not _quite_ identical when using clustering
By not quite I mean an F-Value of 114.8 v.s. 114.58 etc.:
# OLS Regression Results
# =======================================================================================
# Dep. Variable: ttl_exp R-squared (uncentered): 0.454
# Model: OLS Adj. R-squared (uncentered): -623.342
# Method: Least Squares F-statistic: 114.8
# Date: Sun, 07 Feb 2021 Prob (F-statistic): 4.28e-08
# Time: 19:42:56 Log-Likelihood: -29361.
# No. Observations: 12568 AIC: 5.873e+04
# Df Residuals: 11 BIC: 5.874e+04
# Df Model: 2
# Covariance Type: cluster
# ==============================================================================
# coef std err t P>|t| [0.025 0.975]
# ------------------------------------------------------------------------------
# wks_ue 0.0307 0.016 1.975 0.074 -0.004 0.065
# tenure 0.8514 0.066 12.831 0.000 0.705 0.997
# ==============================================================================
# Omnibus: 2467.595 Durbin-Watson: 1.819
# Prob(Omnibus): 0.000 Jarque-Bera (JB): 8034.980
# Skew: 0.993 Prob(JB): 0.00
# Kurtosis: 6.376
# ==============================================================================
# ==============================================================================
#
# V.S.
#
# ==============================================================================
# ==============================================================================
# HDFE Linear regression Number of obs = 12,568
# Absorbing 1 HDFE group F( 2, 11) = 114.58
# Statistics robust to heteroskedasticity Prob > F = 0.0000
# R-squared = 0.6708
# Adj R-squared = 0.5628
# Number of clusters (idcode) = 3,102 Within R-sq. = 0.4536
# Number of clusters (year) = 12 Root MSE = 2.8836
#
# (Std. Err. adjusted for 12 clusters in idcode year)
# ------------------------------------------------------------------------------
# | Robust
# ttl_exp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
# -------------+----------------------------------------------------------------
# wks_ue | .0306653 .0155436 1.97 0.074 -.0035459 .0648765
# tenure | .8513953 .0663892 12.82 0.000 .7052737 .9975169
# _cons | 3.784107 .4974451 7.61 0.000 2.689238 4.878976
# ------------------------------------------------------------------------------
A note of interest is that the fewer degrees of freedom are involved, the worse this discrepancy is, here is an example where clustering leads to just two degrees of freedom in the residuals, so basically as bad as the discrepancy can be:
# OLS Regression Results
#=======================================================================================
#Dep. Variable: ttl_exp R-squared (uncentered): 0.454
#Model: OLS Adj. R-squared (uncentered): -6866.766
#Method: Least Squares F-statistic: 1.354e+04
#Date: Sun, 07 Feb 2021 Prob (F-statistic): 0.00608
#Time: 20:32:41 Log-Likelihood: -29361.
#No. Observations: 12568 AIC: 5.873e+04
#Df Residuals: 1 BIC: 5.874e+04
#Df Model: 2
#Covariance Type: cluster
#==============================================================================
# coef std err t P>|t| [0.025 0.975]
#------------------------------------------------------------------------------
#wks_ue 0.0307 0.004 7.483 0.085 -0.021 0.083
#tenure 0.8514 0.013 66.321 0.010 0.688 1.015
#==============================================================================
#Omnibus: 2467.595 Durbin-Watson: 1.819
#Prob(Omnibus): 0.000 Jarque-Bera (JB): 8034.980
#Skew: 0.993 Prob(JB): 0.00
#Kurtosis: 6.376
#==============================================================================
#==============================================================================
#
# v.s.
#
#==============================================================================
#==============================================================================
# HDFE Linear regression Number of obs = 12,568
# Absorbing 1 HDFE group F( 2, 1) = 7212.19
# Statistics robust to heteroskedasticity Prob > F = 0.0083
# R-squared = 0.6708
# Adj R-squared = 0.5628
# Number of clusters (union) = 2 Within R-sq. = 0.4536
# Number of clusters (idcode) = 3,102 Root MSE = 2.8836
#
# (Std. Err. adjusted for 2 clusters in union idcode)
# ------------------------------------------------------------------------------
# | Robust
# ttl_exp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
# -------------+----------------------------------------------------------------
# wks_ue | .0306653 .0043113 7.11 0.089 -.024115 .0854455
# tenure | .8513953 .0132713 64.15 0.010 .6827674 1.020023
# _cons | 3.784107 .0531894 71.14 0.009 3.108272 4.459942
# ------------------------------------------------------------------------------
#
# Absorbed degrees of freedom:
# -----------------------------------------------------+
# Absorbed FE | Categories - Redundant = Num. Coefs |
# -------------+---------------------------------------|
# idcode | 3102 3102 0 *|
# -----------------------------------------------------+
The T-values are still quite similar, but the F-statistic is completely wrong. It is my understanding that one is not really supposed to use clustering when fewer than 30 clusters are present. This yields:
#HDFE Linear regression Number of obs = 12,568
#Absorbing 1 HDFE group F( 2, 29) = 5953.94
#Statistics robust to heteroskedasticity Prob > F = 0.0000
# R-squared = 0.6708
# Adj R-squared = 0.5628
# Within R-sq. = 0.4536
#Number of clusters (delete_me) = 30 Root MSE = 2.8836
# (Std. Err. adjusted for 30 clusters in delete_me)
#------------------------------------------------------------------------------
# | Robust
# ttl_exp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
#-------------+----------------------------------------------------------------
# wks_ue | .0306653 .0054123 5.67 0.000 .0195959 .0417346
# tenure | .8513953 .0078384 108.62 0.000 .8353639 .8674267
# _cons | 3.784107 .0467015 81.03 0.000 3.688592 3.879622
#------------------------------------------------------------------------------
#Absorbed degrees of freedom:
#-----------------------------------------------------+
# Absorbed FE | Categories - Redundant = Num. Coefs |
#-------------+---------------------------------------|
# idcode | 3102 0 3102 |
#-----------------------------------------------------+
# OLS Regression Results
#=======================================================================================
#Dep. Variable: ttl_exp R-squared (uncentered): 0.454
#Model: OLS Adj. R-squared (uncentered): -235.820
#Method: Least Squares F-statistic: 7905.
#Date: Sun, 07 Feb 2021 Prob (F-statistic): 2.03e-40
#Time: 22:00:43 Log-Likelihood: -29361.
#No. Observations: 12568 AIC: 5.873e+04
#Df Residuals: 29 BIC: 5.874e+04
#Df Model: 2
#Covariance Type: cluster
#==============================================================================
# coef std err t P>|t| [0.025 0.975]
#------------------------------------------------------------------------------
#wks_ue 0.0307 0.005 6.529 0.000 0.021 0.040
#tenure 0.8514 0.007 125.159 0.000 0.837 0.865
#==============================================================================
#Omnibus: 2467.595 Durbin-Watson: 1.819
#Prob(Omnibus): 0.000 Jarque-Bera (JB): 8034.980
#Skew: 0.993 Prob(JB): 0.00
#Kurtosis: 6.376
#==============================================================================
Oh dear. What about 100?
The simplest fix would be to manually calculate these metrics - residuals and coefficients are all correct, which should give us all the information to calculate appropriately adjusted metrics. This would involve finding the method that Stata packages use, finding a reference for the method and implementing it as specified.