The system uses a Sequential Pipeline Architecture managed by main.py. It uses subprocess to trigger individual scripts, ensuring memory isolation between large data-loading tasks.
Data is centralized in db/philanthropy.db.
subsection and classification are tracked for historical context, filtering is driven by ntee_cd logic in bmf_filter.py.The system uses a "Top-Half" scoring model. For every metric (e.g., Asset Growth), the median is calculated for the specific category. Organizations at or above the median receive +1 point.
bmf_loader.py: Sets up the initial database and performs broad 501(c)(3) filtering.irs_teos_ingest_fast.py: The high-performance parser that handles XML extraction and schema mapping.scoring_engine.py: Performs the final ranking and generates CSV outputs in the output/ folder.Entities are rejected (Category = 'Reject') if:
ReturnTypeCd is not 990 (e.g., 990-PF or 990-EZ are excluded).WebsiteAddress is missing or malformed (normalized via regex).