System Design

Architecture

The system uses a Sequential Pipeline Architecture managed by main.py. It uses subprocess to trigger individual scripts, ensuring memory isolation between large data-loading tasks.

Data Schema

Data is centralized in db/philanthropy.db.

BMF Table: Stores core identity data. While subsection and classification are tracked for historical context, filtering is driven by ntee_cd logic in bmf_filter.py.
Form990gem Table: Stores extracted XML data, using the EIN as a primary key to join with BMF records.

Scoring Logic

The system uses a "Top-Half" scoring model. For every metric (e.g., Asset Growth), the median is calculated for the specific category. Organizations at or above the median receive +1 point.

Key Scripts

bmf_loader.py: Sets up the initial database and performs broad 501(c)(3) filtering.
irs_teos_ingest_fast.py: The high-performance parser that handles XML extraction and schema mapping.
scoring_engine.py: Performs the final ranking and generates CSV outputs in the output/ folder.

Entities are rejected (Category = 'Reject') if:

The ReturnTypeCd is not 990 (e.g., 990-PF or 990-EZ are excluded).
The WebsiteAddress is missing or malformed (normalized via regex).

System Design

Architecture

Data Schema

Scoring Logic

Key Scripts

Refinement Rules