Architecture¶
The primary components are listed below (by Python module path) and described:
spade.controller.management.commands¶
Contains the scrape and useragents management commands.
spade.model.models¶
Contains the database models:
A UserAgent stores a user-agent string that will be used to scrape sites
the next time the scrape management command is run.
A Batch represents a single run of the scrape management command.
A BatchUserAgent stores a user-agent string that actually was used when
scraping a particular batch. This is copied from a UserAgent when
scrape is run; the separation prevents future changes to the user-agent
list from modifying or corrupting data from past runs.
A SiteScan object is created for each top-level URL in the list of URLs
given to the scrape management command.
A URLScan object is created for each URL scanned; this includes the initial
top-level URLs, and all linked pages one level deep.
A URLContent object stores the scraped contents of a single URL for a
particular user agent. In other words, for every URLScan there will be N
URLContent objects, if there are N UserAgent records at the time the
scrape is initiated.
A LinkedCSS contains information about a single linked CSS file. Every CSS
file at a distinct URL has only one LinkedCSS record, even if it was linked
from multiple scraped HTML pages (thus LinkedCSS has a many-to-many
relationship with URLContent).
Similarly, a LinkedJS contains information about a single linked JS file.
When the contents of a LinkedCSS file are parsed by
spade.utils.cssparser.CSSParser, a CSSRule object is created for every
CSS rule in the file, and a CSSProperty object for every property in every
rule.
The various *Data models contain aggregated data about issues detected in
the scan.
spade.scraper¶
A Scrapy scraper that scrapes a list of given URLs with all user-agent strings listed in the database, following links one level deep, and saving all response contents (including linked JS and CSS) in the database.
spade.settings¶
Contains the Django project settings.
spade.tests¶
Contains the tests.
spade.utils.data_aggregator¶
Contains a DataAggregator class that populates the BatchData,
SiteScanData, URLScanData, URLContentData and LinkedCSSData
models with summary aggregate data about the scan.
spade.utils.css_parser¶
Contains a CSSParser class that can take raw CSS, parse it, and store it
into the CSSRule and CSSProperty database models.
spade.utils.html_diff¶
Contains a HTMLDiff class that can compare the tag structure of two chunks
of HTML, ignoring differences in tag content and attributes, and return a
measure of their similarity (0.0 if they have nothing in common, 1.0 if they
are identical).
spade.view.urls¶
The URL configuration for the site.
Run python manage.py runserver to fire up a development web server and view
the app in your browser at http://localhost:8000/.
spade.view.views¶
Contains the Django view functions.