The objective of the System is to determine for the EntityURLs provided what is the minimum dollar amount to endow a named, targeted fund. The system has been tested on URLs for which the correct webpage and quote were manually identified in advance. You can feed the System hundreds of URLs of other 501c3s whose minimum endowment amount you do not know. The system is part of an overall project which includes programs that
Uses a single, modern, hardcoded user-agent for all requests: session.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
Introduces a delay between requests to mimic human browsing.
All matching of strings is to be case-insensitive. This include URLs and quotes and when any element in any set of strings from the parameters.txt is compared to any URL or quote.
{ "``url``": str, "source": "sitemap" | "homepage" | "crawl" }Duplicate prevention is handled instead by a normalized CandidateURLs tracking set.
RedFlagURLs: list of dicts
{ "``url``": str, "criterion": str }
VisitedURLs: list of dicts
{ "``url``": str, "score": int, "``http_status``": int }
OpenURLs: priority queue of tuples
( -score, ``len``(``url``), ``url`` )
CandidateQuotes: list of dicts
OpenQuotes: list of dicts
{ "quote": str, "``url``": str, "``creationScore``": int, "``fundScore``": int, "``minScore``": int, "``sumScore``": int }
RejectedQuotes: list of dicts
{ "quote": str, "``url``": str, "reason": str }
BestQuotes: list of dicts
same structure as OpenQuotes
PathSentence: string
For RedFlag tests and URL scoring, the URL is processed as follows:
1. Remove the protocol (e.g., "https://") and the "://".
2. Remove the y part of the domain (the entity-identifying part, which is the second label in a standard x.y.z domain). - For "giving.wayne.edu", remove "wayne" → "giving.edu"
3. The remaining string, including subdomains and path, is processed so that each non-alphabetic character (such as backslash, period, dash, etc.) is replaced with a blank space.
4. Collapse multiple spaces into a single space and trim leading/trailing spaces.
5. The result is called pathSentence.
Example:
https://giving.wayne.edu/endowments/minimumgiving.wayne.edu/endowments/minimumgiving.edu/endowments/minimumgiving ``edu`` endowments minimumgiving ``edu`` endowments minimumIn a set of strings, any two elements of the set are delimited by a comma (,). Thus { events, financial aid, history} has 3 elements. In subsequent instructions of the sort PathSentence contains any URL_Bad string → RedFlag, the meaning is that an element in URL_Bad is a substring of any string in PathSentence. Thus, for PathSentence to cause a URL to be moved to RedFlag requires that PathSentence contains "financial aid" and not just "financial".
The following constants are loaded from inputs/parameters.txt
For each EntityURL:
Set these data structures to empty:
URLs are first collected from
Only URLs whose normalized form is not already present in the internal CandidateURLs tracking set are placed on CandidateURLs.
In the current code, CandidateURLs is not a transient queue. It is a list of all URLs that have been discovered as candidates during the run. URLs copied from CandidateURLs to RedFlagURLs or OpenURLs remain on CandidateURLs as historical records.
Next CandidateURLs are processed for RedFlag criteria and if they meet RedFlag Criteria, then a
The original entry remains on CandidateURLs.
If a URL on CandidateURLs does not meet any RedFlag conditions, then
For URLs (meeting certain conditions) found on the scraping of the visited URL: their normalized forms are checked against the CandidateURLs tracking set. If not already present there, they are copied to CandidateURLs and then immediately filtered for placement on either RedFlagURLs or OpenURLs.
Sitemap Index Discovery: If the EntityURL provides a Sitemap Index (a file containing links to other .xml sitemaps), the crawler must visit each child sitemap.
Recursion Limit: The crawler shall follow sitemap links one level deep only. If a child sitemap is discovered, the crawler should extract the final page URLs from it but must not follow any further .xml links found within that child (no second-level indices).
Filtering Exceptions: Rules that filter URLs based on patterns (e.g., "no digits in path") must be ignored for .xml files found within a sitemap index to ensure pagination and section-based sitemaps are not blocked.
Capacity: Stop collecting after reaching 3,000 unique page URLs.
Priority: All URLs discovered via sitemaps should be added to the crawl queue with a "sitemap" source tag for prioritization.
If no sitemap exists or no URLs were moved from the sitemap CandidateURLs to OpenURLs, then the homepage is visited and its URLs scraped and put on CandidateURLs.
<a ``href``="..."> links.The general model is that URLs from the 'Collect First Set of URLs' stage:
For each CandidateURL (call it X), apply criteria in order:
Further RedFlag operates on Normalized URL: Normalize URL and call resultant URL PathSentence
PathSentence contains any URL_Bad string → RedFlag
PathSentence contains a personal name of more than 3 characters in length -> RedFlag
(^|[^a-z])NAME([^a-z]|$)For each URL remaining on CandidateURLS after RedFlag screening, move those URLs along with their PathSentence to OpenURLs.
For each (URL, PathSentence) upon its addition to OpenURL it is assigned a Score
- Score +=1 for each string in URL_Good found in PathSentence.
- Store score with URL.
While OpenURLs not empty and len``(``VisitedURLs``) < ``ThrottleMax:
Take highest scoring URL from OpenURL and move to VisitedURL. If more than one URL has highest score, then tie-breakers:
Call this URL to now be processed the StudyURL
Extract URLs from StudyURL (with same procedure as used for extracting URLs from a home page) and add these newly found URLs to CandidateURLs where they will be processed with RedFlag and other routines as where the original URLs from sitemaps or the home page.
Pass StudyURL to QuoteExtract.
Extract visible text (strip HTML/JS) StudyURL to create VisibleTextPage
The program parses VisibleTextPage to find each occurrence (call it MinInstance) of a string from the set AllowedMins
For each MinInstance, extract the 120 characters (including spaces) to the left of MinInstance and the 120 characters to the right of MinInstance (including spaces). If the left-most (or right-most) cutoff splits a word (with word defined as alphabetic characters bounded by either spaces or punctuation), then delete that word fragment. Store the resultant string with MinInstance inside it, as QuoteString in CandidateQuotes.
Example: If "$10,000" and "$20,000" are found within 75 characters, extract a single quote window covering both, not two overlapping quotes.
identify in QuoteString each string of alphabetic characters of length greater than 3 bounded by spaces and put each such string in a set call possibleNames
if any element in possibleNames is in names.txt, then move QuoteString from CandidateQuotes to RejectedQuotes
if QuoteString contains an element from badSet, then move QuoteString from CandidateQuotes to RejectedQuotes
Define ``aNum`` as a string of characters bounded by spaces which ``has``
4 or more digits
at least one comma but not more than two which demarcate thousands, as in 100,000
optionally begins with $
If QuoteString contains three consecutive aNum without any intervening non-empty words, then move QuoteString from CandidateQuotes to RejectedQuotes.
Example:
A quote like "...with a $2,000 gift ($5,000 for scholarship) with up
to five years to build to a fund minimum of $10,000 ($25,000 for
scholarship funds)..." should NOT be rejected for table_aNum, since
there are only two consecutive aNum at a time.
After a QuoteString has passed all RedFlag criteria and not been moved to RejectedQuotes, Score it as follows
For each remaining QuoteString on CandidateQuotes
Set creationScore = 0, fundScore = 0, minimumScore = 0
(In the following, the meaning of the expression "for each string in CreationSet, if in QuoteString" means that if an element in CreationSet is a substring of any string in QuoteString, then the condition is satisfied. By way of illustration, the string 'charit' in creationSet satisfies the string 'charitable' in QuoteString)
For each string in creationSet, if in QuoteString, then creationScore = creationScore +1
For each string in fundSet, if in QuoteString, then fundScore = fundScore +1
For each string in minSet, if in QuoteString, then minScore = minScore +1
Create sumScore = creationScore + fundtypeScore + minScore
Move this QuoteString from CandidateQuotes (along with its scores, element of AllowedMins, and URL) to OpenQuotes.
Stop when:
VisitedURLs`` >= ``ThrottleMax, orOpenURLs is empty.Review all entries in OpenQuotes
If ≤5 entries: keep all.
If >5:
BestQuotes.md should be formatted with hierarchy and bullets as follows:
For each of the 5 best quotes:
- header level 2: Best Quote #k (where k goes from 1 to 5)
- header level 3: : URL on which quote was found
- header level 4: minimal dollar amount
- bullet item: Quote verbatim
- bullet item: creationScore
- bullet item: fundScore
- bullet item: minScore
- bullet item: sumScore
For each EntityURL, create a subdirectory named after the Entity and produce each output file:
Files:
Ordering to follow discovery order.
The speed of our EndowmentScraper system is highly dependent on the number of URLs visited and the wait-time involved between a request from this system for a webpage and the response of the target webpage server. This system exploits python's multi-threaded capability to help deal with this. The 4 parameters which users can set for speed impact and a brief annotation of each follows:
Sitemap overlap: compare CandidateURLs to testing/sitemapURLs/*.txt
Homepage URLs: compare homepage scraper output to YaleURLs.txt
Scoring of URLs: confirm for URL https://giving.wayne.edu/about/faq that thePathSentence score is 3.
Quote Extraction: compare extracted quotes to testing/QuoteExtraction/*.csv
Wayne Quote Extract Page test:
confirm that what the program produces as VisibleTextPage from https://giving.wayne.edu/about/faq is the same as the text at testing/QuoteExtraction/wayne.txt
confirm that the program's two extracted quotes and their scores for that VisibleTextPage are the same as wayneOpenQuotes.csv whose first row is labels.
For catlin, pacfwv, swe, and yale, please get the precise target URL testing/testURLs.txt. Then confirm that the text that scraper.py extracts matches the relevant text file in testing/QuoteExtraction. Then check that the quotes and minimum dollar amounts in the relevant csv file match those produced by scraper.py. The columns in the catlin, pacfwv, swe, and yale csv files have not been updated to correspond to the latest changes in requirements and thus you should not expect an exact match between those 4 csv files and the scoring of scraper.py.