This repository contains the code and documentation needed to harmonize prior and current Business Master File (BMF) extracts from NCCS and the IRS respectively. It also contains the code needed to create the unified BMF, that aggregates unique records from all prior BMFs to create a consolidated list of active and inactive nonprofit organizations.
NCCS previously processed each BMF released by the IRS by creating separate data dictionaries and schemas for each dictionary.
Our current data engineering methodology harmonizes columns across each year, ensuring that variables are consistently named and have standardized data types.
Additionally, we discard metadata variables that cannot be reconstructed either due to missing documentation or removal by the IRS from later BMF releases.
Currently released IRS BMF Extracts are also harmonized with the following metadata variables created from the harmonized variables:
EIN2is created fromEINby adding the following strings between the 9 digit EIN:EIN-XX-XXXXXXX. This ensures that the EIN is treated as a string and not a integer when saved in read in from.csvfiles without explicit data type handling.NTEEV2is created from theNTEE_IRScolumn containing NTEE codes reported by nonprofits in the Form 990.
Updating the BMF requires the following steps:
- Download the latest BMF data from the IRS.
- Create and clean the latest columns.
- Geocode the BMF and append various FIPs codes to the Latitude and Longitude columns.
- Update the financial columns with the latest e-filed data.
- Update the existing unified BMF with the 2025 data.
These steps need to be run periodically whenever a new BMF is published.