scrapd.core package¶
scrapd.core.apd module¶
Define the module containing the function used to scrap data from the APD website.
-
async
scrapd.core.apd.
async_retrieve
(pages=-1, from_=None, to=None, attempts=1, backoff=1, dump=False)[source]¶ Retrieve fatality data.
- Parameters
pages (str) – number of pages to retrieve or -1 for all
from (str) – the start date
to (str) – the end date
attempts (int) – number of attempts per report
backoff (int) – initial backoff time (second)
dump (bool) – dump reports with parsing issues
- Returns
the list of fatalities and the number of pages that were read.
- Return type
tuple
-
scrapd.core.apd.
extract_traffic_fatalities_page_details_link
(news_page)[source]¶ Extract the fatality detail page links from the news page.
- Parameters
news_page (str) – html content of the new pages
- Returns
a list of links.
- Return type
list or None
-
scrapd.core.apd.
fetch_and_parse
(session, url, dump=False)[source]¶ Parse a fatality page from a URL.
- Parameters
session (aiohttp.ClientSession) – aiohttp session
url (str) – detail page URL
- Returns
a dictionary representing a fatality.
- Return type
dict
-
async
scrapd.core.apd.
fetch_detail_page
(session, url)[source]¶ Fetch the content of a detail page.
- Parameters
session (aiohttp.ClientSession) – aiohttp session
url (str) – request URL
- Returns
the page content.
- Return type
str
-
async
scrapd.core.apd.
fetch_news_page
(session, page=1)[source]¶ Fetch the content of a specific news page from the APD website.
The page number starts at 1.
- Parameters
session (aiohttp.ClientSession) – aiohttp session
page (int) – page number to fetch, defaults to 1
- Returns
the page content.
- Return type
str
-
scrapd.core.apd.
fetch_text
(session, url, params=None)[source]¶ Fetch the data from a URL as text.
- Parameters
session (aiohttp.ClientSession) – aiohttp session
url (str) – request URL
params (dict) – request paramemters, defaults to None
- Returns
the data from a URL as text.
- Return type
str
-
scrapd.core.apd.
generate_detail_page_urls
(titles)[source]¶ Generate the full URLs of the fatality detail pages.
- Parameters
titles (list) – a list of partial link
- Returns
a list of full links to the fatality detail pages.
- Return type
list
scrapd.core.constant module¶
Define the scrapd constants.
-
class
scrapd.core.constant.
Fields
[source]¶ Bases:
object
Define the resource constants.
-
AGE
= 'age'¶
-
CASE
= 'case'¶
-
CRASH
= 'crash'¶
-
DATE
= 'date'¶
-
DECEASED
= 'deceased'¶
-
DOB
= 'dob'¶
-
ETHNICITY
= 'ethnicity'¶
-
FATALITIES
= 'fatalities'¶
-
FIRST_NAME
= 'first'¶
-
GENDER
= 'gender'¶
-
GENERATION
= 'generation'¶
-
LAST_NAME
= 'last'¶
-
LATITUDE
= 'latitude'¶
-
LINK
= 'link'¶
-
LOCATION
= 'location'¶
-
LONGITUDE
= 'longitude'¶
-
MIDDLE_NAME
= 'middle'¶
-
NOTES
= 'notes'¶
-
TIME
= 'time'¶
-
scrapd.core.date_utils module¶
Define a module to manipulate dates.
-
scrapd.core.date_utils.
check_dob
(dob)[source]¶ In case that a date only contains 2 digits, determine century.
- Parameters
dob (datetime.date) – DOB
- Returns
DOB with 19xx or 20xx as appropriate
- Return type
datetime.date
-
scrapd.core.date_utils.
compute_age
(date, dob)[source]¶ Compute a victim’s age.
- Parameters
date (datetime.date) – crash date
dob (datetime.date) – date of birth
- Returns
the victim’s age.
- Return type
int
-
scrapd.core.date_utils.
from_date
(date)[source]¶ Parse the date from a human readable format, with options for the from date.
If the date cannot be parsed, datetime.date.min is returned.
If the day of the month is not specified, the first day is used.
- Parameters
date (str) – date
- Returns
a date object representing the date.
- Return type
datetime.date
-
scrapd.core.date_utils.
is_before
(d1, d2)[source]¶ Return True if d1 is strictly before d2.
- Parameters
d1 (datetime.date) – date 1
d2 (datetime.date) – date 2
- Returns
True is d1 is before d2.
- Return type
bool
-
scrapd.core.date_utils.
is_between
(date, from_=None, to=None)[source]¶ Check whether a date is comprised between 2 others.
- Parameters
date (datetime.date) – date to check
from (datetime.date) – start date, defaults to None
to (datetime.date) – end date, defaults to None
- Returns
True if the date is between from_ and to
- Return type
bool
-
scrapd.core.date_utils.
parse_date
(date, default=None, settings=None)[source]¶ Parse the date from a human readable format.
If no default value is specified and there is an error, an exception is raised. Otherwise the default value is returned.
- Parameters
date (str) – date
default (datetime.date) – default value in case the date cannot be parsed.
settings (dict) – a dictionary containing the parsing options. All the available options are defined here: https://dateparser.readthedocs.io/en/latest/dateparser.html#dateparser.conf.Settings.
- Returns
a date object representing the date.
- Return type
datetime.date
-
scrapd.core.date_utils.
parse_time
(time)[source]¶ Parse the time from a human readable format.
- Parameters
time (str) – time
- Returns
a time object representing the time.
- Return type
datetime.time
-
scrapd.core.date_utils.
to_date
(date)[source]¶ Parse the date from a human readable format, with options for the to date.
If the date cannot be parsed, datetime.date.max is returned.
If the day of the month is not specified, the last day is used.
- Parameters
date (str) – date
- Returns
a date object representing the date.
- Return type
datetime.date
scrapd.core.formatter module¶
Define the formatter module.
This module contains all the classes with the ability to print the results. They destination depends on the custom formatter used to print the results and can be sdtout, sdterr, a file or even a remote storage if the formatter allows it.
-
class
scrapd.core.formatter.
CSVFormatter
(format_='json', output=None)[source]¶ Bases:
scrapd.core.formatter.Formatter
Define the CSV formatter.
Displays the results as a CSV.
-
class
scrapd.core.formatter.
CountFormatter
(format_='json', output=None)[source]¶ Bases:
scrapd.core.formatter.Formatter
Define the Count formatter.
Simply displays the number of results matching the search criterias.
-
class
scrapd.core.formatter.
Formatter
(format_='json', output=None)[source]¶ Bases:
object
Define the Formatter base class.
The default printer method simply uses the print() function.
-
formatters
= {'count': <class 'scrapd.core.formatter.CountFormatter'>, 'csv': <class 'scrapd.core.formatter.CSVFormatter'>, 'json': <class 'scrapd.core.formatter.JSONFormatter'>, 'python': <class 'scrapd.core.formatter.PythonFormatter'>}¶
-
-
class
scrapd.core.formatter.
JSONFormatter
(format_='json', output=None)[source]¶ Bases:
scrapd.core.formatter.Formatter
Define the JSON formatter.
Displays the results as JSON. The keys are sorted and an indentation of 2 spaces is set.
-
class
scrapd.core.formatter.
PythonFormatter
(format_='json', output=None)[source]¶ Bases:
scrapd.core.formatter.Formatter
Define the Python formatter.
Displays the results using PrettyPrinter with an indentation of 2 spaces.
scrapd.core.version module¶
Define a set of utility functions for managing versions.