Skip to content

PriceDelayTstat do file winsorizing bug #177

@chenandrewy

Description

@chenandrewy

Long story short, we should remove the winsorizing in this do file. But I spent wayy to much time tracking this down, because of OCD, and thought we should document how OCD this repo is.

So there's a bug in

https://github.com/OpenSourceAP/CrossSection/blob/d81c696d283d62b61260f223eeac0e90511a4e77/Signals/Code/Predictors/ZZ2_PriceDelaySlope_PriceDelayRsq_PriceDelayTstat.do

line 84 has

gstats winsor PriceDelayTstat, by(time_avail_m) trim cuts(10 90) replace  // Trim very aggressively because coefficient/se not very well-behaved

which should mean that all extreme values are forced to the same value for a given time_avail_m. But, instead, we have these weird missing values if i run

gstats winsor PriceDelayTstat, by(time_avail_m) trim cuts(10 90) gen(TstatWin)
list permno time_avail_m PriceDelayTstat TstatWin if time_avail_m == tm(1954m7) & permno >= 20677 & permno <= 20800
Image

The value of 14.03342 should be trimmed to a smaller value of around 6, based on the summary stats:
Image

but instead it's made missing. Then the missing value is filled later on in the code, in the "Fill to Monthly" step. This weirdness might happen because the underlying data is actually daily, and we don't sort by daily date, but I'm honestly not sure.

It doesn't matter, because the OP (Hou and Moskowitz) don't mention any winsorizing. It also doesn't make sense to me why we should winsorize t-stats but not the slopes (if anything the slopes would be more noisy). Last, the winsorizing should not affect any portfolio sorts anyway.

I found this bug while comparing the python and Stata outputs. For the Stata liberation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions