This project develops tools to semi-automate a specific style of philanthropic evaluation. The current system has two distinct components for processing different data:
- IRS Form 990s which must be filed by 501c3s
- Dept of Labor Scholarship Directory
The Form 990 code uses Excel VBA to score IRS Form 990s submitted by 501(c)(3) organizations. The scoring highlights entities that have endowments, award scholarships, and emphasize science education or research. This framework is at an early stage of development, and future versions are expected to be implemented in Python with relational database support and a web-based interface.
The Scholarship Directory information is scraped from the Labor Department website using Python and associated tools.
To get the system running:
- Download
Code.xlsmand place it in your working directory (e.g.,x/). - Download the
.txtfiles:nodenames.txtstopwords.txtpunctuation.txtrule.txtPlace them in the same directory as the Excel file.
- Download IRS Form 990 XML files from IRS Form 990 Series Downloads.
- Unzip those forms into
x/testforms/, and create subdirectories:x/testforms/990for standard 990 filesx/testforms/errantfor nonstandard or filtered-out files
- Microsoft Excel with VBA enabled (or compatible environment)
- Basic familiarity with running macros and editing file paths
- The system is modular and can be customized to suit different evaluation criteria.
- You may edit any of the supporting
.txtfiles to change what data is parsed or scored.
Each line defines:
- Data type (
String,Date,Integer,AbsInt) - Field length
- XML path to the node Example:
Date;10;Return/ReturnHeader/TaxPeriodBeginDt
Integer;4;Return/ReturnHeader/TaxYr
AbsInt;15;Return/ReturnData/IRS990/CYInvestmentIncomeAmt
String;600;Return/ReturnData/IRS990/ActivityOrMissionDesc
Used to clean and tokenize text fields—feel free to modify.
Defines scoring logic for each rule. Users can modify or add rules.
Parsed990Datacontains extracted data:- Headers: nodenames
- Rows: form unique IDs and their values
Scored990Datacontains rule evaluations:- Headers: rule names
- Rows: binary scores (1 or 0)
There are four rule types. Each uses a semicolon-delimited format:
Substring;RuleName;Nodename;Present;token1,token2,...
- Checks if tokens are present (or absent) in the specified text node.
Trend;RuleName;Nodename1,Nodename2,...
- Compares values across nodes for an upward/downward trend.
Percentile;RuleName;Nodename;Cutoff
- Scores 1 if a value is above the given percentile cutoff.
Eval;RuleName;Nodename;NumOrTxt;Expression
- Evaluates logical expressions involving the node's value.
Eval;Age;IRS990_FormationYr;Num;Year(Now()) - IRS990_FormationYr > 15
Substring;Web;IRS990_WebsiteAddressTxt;T;academy,edu
Percentile;EndYrBal;CYEndwmtFundGrp_EndYearBalanceAmt;0.50
Trend;YrNet;IRS990_NetAssetsOrFundBalancesBOYAmt,IRS990_NetAssetsOrFundBalancesEOYAmt
-
Move Files
In VBA modulemove990, runMove990FilesMoves Form 990 files to
/990/, skips Form 990EZ and others -
Parse XMLs
In moduleParse, runParseXML990FilesExtracts nodename data into
Parsed990Data -
Clean Text
In moduleStrip, runMasterCleans descriptions and web addresses; populates
DescFiltered -
Score Data
In moduleScore, runScoreEvaluates rules and outputs to
Scored990Data
Activate ChromeDriver.exe
If you don't have these Python extensions, then run:
- pip install requests
- pip install beautifulsoup4
- pip install selenium
- pip install pandas
Download scraper.py from GitHub repository and place it in your working directory.
The output goes to a csv file with 8 columns and as many rows as scholarships. The 8 columns are labeled
- ID
- Award Name
- Organization
- Purpose
- Level of Study
- Award Type
- Award Amount
- Deadline
In the output uploaded to GitHub, the csv file (called scholarships.csv) has 10,000 rows which was the entirety of what was available from the Labor Department's CareerOneStop website on July 30, 2025.
- Roy Rada: Project lead, architecture, Excel VBA development, Python coding, testing, refining parameters, maintaining
- Microsoft Copilot: Collaborative assistance in coding and system design
This project is licensed under the GNU General Public License v3.0.