min-endow-scraper: Design and Implementation Plan

Objective

The objective of the System is to determine for the EntityURLs provided what is the minimum dollar amount to endow a named, targeted fund. The system has been tested on URLs for which the correct webpage and quote were manually identified in advance. You can feed the System hundreds of URLs of other 501c3s whose minimum endowment amount you do not know. The system is part of an overall project which includes programs that

General Conditions

Markdown and Formatting

Modularity

Test Not Log

Mimic Human Behavior

Error Handling

Minimize Overhead

Naming Conventions

Case-Insensitive Pattern Matching

All matching of strings is to be case-insensitive. This include URLs and quotes and when any element in any set of strings from the parameters.txt is compared to any URL or quote.

Data Structure Specifications

Duplicate prevention is handled instead by a normalized CandidateURLs tracking set.

URL Normalization Process

For RedFlag tests and URL scoring, the URL is processed as follows:

1. Remove the protocol (e.g., "https://") and the "://".

2. Remove the y part of the domain (the entity-identifying part, which is the second label in a standard x.y.z domain). - For "giving.wayne.edu", remove "wayne" → "giving.edu"

3. The remaining string, including subdomains and path, is processed so that each non-alphabetic character (such as backslash, period, dash, etc.) is replaced with a blank space.

4. Collapse multiple spaces into a single space and trim leading/trailing spaces.

5. The result is called pathSentence.

Example:

User-Defined Constants

Preamble

In a set of strings, any two elements of the set are delimited by a comma (,). Thus { events, financial aid, history} has 3 elements. In subsequent instructions of the sort PathSentence contains any URL_Bad string → RedFlag, the meaning is that an element in URL_Bad is a substring of any string in PathSentence. Thus, for PathSentence to cause a URL to be moved to RedFlag requires that PathSentence contains "financial aid" and not just "financial".

Constants

The following constants are loaded from inputs/parameters.txt

Inputs

Process for URLs

Initialize

For each EntityURL:

Set these data structures to empty:

Populating CandidateURLs

URLs are first collected from

Only URLs whose normalized form is not already present in the internal CandidateURLs tracking set are placed on CandidateURLs.

In the current code, CandidateURLs is not a transient queue. It is a list of all URLs that have been discovered as candidates during the run. URLs copied from CandidateURLs to RedFlagURLs or OpenURLs remain on CandidateURLs as historical records.

RedFlag

Next CandidateURLs are processed for RedFlag criteria and if they meet RedFlag Criteria, then a

The original entry remains on CandidateURLs.

OpenURLs

If a URL on CandidateURLs does not meet any RedFlag conditions, then

VisitedURLs

For URLs (meeting certain conditions) found on the scraping of the visited URL: their normalized forms are checked against the CandidateURLs tracking set. If not already present there, they are copied to CandidateURLs and then immediately filtered for placement on either RedFlagURLs or OpenURLs.

Collect First Set of URLs

Sitemap Parsing Rules

Homepage URL Extraction Rules

RedFlag Filtering

The general model is that URLs from the 'Collect First Set of URLs' stage:

First 4 Red Flags

For each CandidateURL (call it X), apply criteria in order:

  1. Protocol not http/https → RedFlag
  2. Path contains #, ?, or digit → RedFlag
  3. File extension not in AllowedURLfileExtension
  4. Domain mismatch (y-part does not equal y part of EntityURL)

Normalize and Continue RedFlag

Further RedFlag operates on Normalized URL: Normalize URL and call resultant URL PathSentence

URL_Bad

PathSentence contains any URL_Bad string → RedFlag

Personal Name

PathSentence contains a personal name of more than 3 characters in length -> RedFlag

RedFlag Enforcement

Move to OpenURLs

For each URL remaining on CandidateURLS after RedFlag screening, move those URLs along with their PathSentence to OpenURLs.

Heuristic Scoring

For each (URL, PathSentence) upon its addition to OpenURL it is assigned a Score

- Score +=1 for each string in URL_Good found in PathSentence.

- Store score with URL.

Visit URLs

While OpenURLs not empty and len``(``VisitedURLs``) < ``ThrottleMax:

Take highest scoring URL from OpenURL and move to VisitedURL. If more than one URL has highest score, then tie-breakers:

  1. Shorter URL length
  2. Lexicographic order

Call this URL to now be processed the StudyURL

Quote Extraction

Extract visible text (strip HTML/JS) StudyURL to create VisibleTextPage

Find MinInstance and Create QuoteString

The program parses VisibleTextPage to find each occurrence (call it MinInstance) of a string from the set AllowedMins

For each MinInstance, extract the 120 characters (including spaces) to the left of MinInstance and the 120 characters to the right of MinInstance (including spaces). If the left-most (or right-most) cutoff splits a word (with word defined as alphabetic characters bounded by either spaces or punctuation), then delete that word fragment. Store the resultant string with MinInstance inside it, as QuoteString in CandidateQuotes.

Post-Processing of QuoteStrings for Multiple Minimums in Proximity

Example: If "$10,000" and "$20,000" are found within 75 characters, extract a single quote window covering both, not two overlapping quotes.

Moving QuotesStrings from Candidates to Open or Rejected

identify proper names

badset

if QuoteString contains an element from badSet, then move QuoteString from CandidateQuotes to RejectedQuotes

Numbers Near Each Other

Define ``aNum`` as a string of characters bounded by spaces which ``has``

If QuoteString contains three consecutive aNum without any intervening non-empty words, then move QuoteString from CandidateQuotes to RejectedQuotes.

Example:
A quote like "...with a $2,000 gift ($5,000 for scholarship) with up to five years to build to a fund minimum of $10,000 ($25,000 for scholarship funds)..." should NOT be rejected for table_aNum, since there are only two consecutive aNum at a time.

Scoring: From CandidateQuotes to OpenQuotes

After a QuoteString has passed all RedFlag criteria and not been moved to RejectedQuotes, Score it as follows

Scoring QuoteStrings

For each remaining QuoteString on CandidateQuotes

Set creationScore = 0, fundScore = 0, minimumScore = 0

(In the following, the meaning of the expression "for each string in CreationSet, if in QuoteString" means that if an element in CreationSet is a substring of any string in QuoteString, then the condition is satisfied. By way of illustration, the string 'charit' in creationSet satisfies the string 'charitable' in QuoteString)

For each string in creationSet, if in QuoteString, then creationScore = creationScore +1

For each string in fundSet, if in QuoteString, then fundScore = fundScore +1

For each string in minSet, if in QuoteString, then minScore = minScore +1

Create sumScore = creationScore + fundtypeScore + minScore

Move this QuoteString from CandidateQuotes (along with its scores, element of AllowedMins, and URL) to OpenQuotes.

Termination

Stop when:

BestQuotes Selection and Output Format

Best Five

Review all entries in OpenQuotes

BestQuotes.md Format

BestQuotes.md should be formatted with hierarchy and bullets as follows:

For each of the 5 best quotes:

    -   header level 2: Best Quote #k (where k goes from 1 to 5)

    -   header level 3: : URL on which quote was found

    -   header level 4: minimal dollar amount

    -   bullet item: Quote verbatim

    -   bullet item: creationScore

    -   bullet item: fundScore

    -   bullet item: minScore

    -   bullet item: sumScore

Output for EntityURL

For each EntityURL, create a subdirectory named after the Entity and produce each output file:

Files:

Ordering to follow discovery order.

Speed

The speed of our EndowmentScraper system is highly dependent on the number of URLs visited and the wait-time involved between a request from this system for a webpage and the response of the target webpage server. This system exploits python's multi-threaded capability to help deal with this. The 4 parameters which users can set for speed impact and a brief annotation of each follows:

Testing

Syntax and Execution

Functional Testing

Subsidiary Tests