Creating a Spider

This document describes how to define a spider, a class that implements the logic to fetch version information for one or more software products.

Creating a Simple Spider

To define a simple spider that works provides version data for a specific software product and hence does not require parameters:

  1. Add your entry (e.g. myentry) to the versiontracker/data.json file with an empty JSON object as value:

    "myentry": {}
    
  2. Create a new file, versiontracker/spiders/myentry.py, which extends the Spider class from versiontracker.spiders:

    from . import Spider
    
    
    class MyEntry(Spider):
        name = 'myentry'
    

    If the entry ID contains an hyphen, use an underscore for the name class variable (e.g. entry-identry_id).

  3. In your spider class, you must first define the first_request() method, which must return the first URL that must be parsed:

    def first_request(self, data):
        return 'https://www.example.com'
    
  4. When the URL returned by first_request() is downloaded, the parse() method of your class is called, so you must implement it:

    def parse(self, response):
        # …
    

    The parse() method may return an instance of versiontracker.items.Item or a Scrapy Request object to fetch another URL.

    To learn how to implement the parse method, you can:

    • Read the Scrapy tutorial.
    • Check the source of other simple spiders, such as: apt, cuda, freedroid, grass, mozjs, nvidia, openbsd, thunderbird.

For example, for mozjs:

  • data.json:

    "mozjs": {},
    
  • spiders/mozjs.py:

    import re
    
    from scrapy import Request
    
    from ..items import Item
    
    from . import Spider
    
    
    class MozJS(Spider):
        name = 'mozjs'
    
        def parse(self, response):
            xpath = '//h2[@id=\'Current_release\']/following-sibling::ul//a/@href'
            response.meta['url'] = response.url
            return Request(response.urljoin(response.xpath(xpath).extract_first()),
                           callback=self.parse_article,
                           meta=response.meta)
    
        def parse_article(self, response):
            base_xpath = '//article[@id=\'wikiArticle\']' \
                         '//div[re:test(@class, \'\\bnote\\b\')]'
            xpath = base_xpath + '//a/@href'
            download_link = response.xpath(xpath).extract_first()
            version = re.search('mozjs-(\\d+(\\.\\d+)+)', download_link).group(1)
            xpath = base_xpath + '//span[re:test(@class, \'\\bgI\\b\')]/text()'
            date = response.xpath(xpath).extract_first()
            return Item(date=date, response=response, version=version)
    
        def first_request(self, data):
            return 'https://developer.mozilla.org/en-US/docs/Mozilla/Projects' \
                   '/SpiderMonkey/Releases'
    

Creating a Parametrized Spider

To extend your spider so that it supports multiple software products:

  • Remember to specify the spider name in the data.json file, at least in those entries whose name does not match the spider name:

    "myentry": {
        "spider": {
            "name": "myspider"
        }
    }
    
  • In the first_request() method, data is a dictionary that contains the spider parameters defined for the current software product, as well as its id.

  • In the parse() method, response.meta is a dictionary that contains the spider parameters defined for the current software product, as well as its id.

    If you return or yield an Item object, know that you can include the response object in the item:

    return Item(version='1.2.3', response=response)
    

    If a response object is passed, the contents of response.meta are used as fallback, as well as the response.url value.

    If you return or yield new Request objects, remember to set their meta property to that of the response:

    return Request('http://www.example.com', meta=response.meta)
    
  • You can define a dictionary with default values for spider parameters, so that those values are used when not defined in a software product.

    To do that, override the start_requests() method as follows:

    def start_requests(self):
        return super().iter_start_requests(
            params={'key1': 'value1', 'key2': 'value2'})
    

Creating a Path Spider

To implement a spider that navigates paths and supports path placeholders, your Spider class must:

  1. Subclass versiontracker.spiders.PathSpider.

  2. Use the first_request() method to define the target path in the dictionary of spider parameters if it is not manually specified in data.json:

    def first_request(self, data):
        data['path'] = '/path'
        return super().first_request(data)
    
  3. Implement the path_url() method so that it returns the URL for a given path:

    def path_url(self, data):
        return 'http://www.example.com' + data['new_path']
    

    As you can see in the example above, the new_path property of the received data dictionary contains the target path.

    The base URL is usually different for each software product. A common approach to solve that issue is to define a base_url key in the implementation of first_request(), so that it can be read in path_url():

    def path_url(self, data):
        return data['base_url'] + data['new_path']
    
  4. Implement the iter_entries() method so that it yields dictionaries for each entry containing the entry name (file or folder name) and its last modification date:

    def iter_entries(self, response):
        for name, date in items_found_in_response:
            yield {'date': date, 'name': name}
    

Creating a Git Spider

To implement a spider that can use either commits or tags to determine the version information of a software product:

  1. Subclass versiontracker.spiders.PathSpider.

  2. Implement the first_request(). You must call the parent implementation first, and then return the right URL based on whether the search is for commits (data['commit'] is not None) or for tags:

    def first_request(self, data):
        super().first_request(data)
        return commits_url if data.get('commit', None) else tags_url
    

    Tip

    If you are building a spider for a Git hosting service (as opposed to a self-hosted Git server), you may want to use the git_service_url() method here. See the implementation of the github spider as an example.

  3. Implement the parse() method so that it iterates through the received commits or tags.

    You can use self.searching_commits(response) to determine whether you are iterating commits or tags.

    For each commit or tag, you must extract its commit message or tag name and call self.item(response, string) with the received response and the commit message or tag name as the second parameter.

    self.item() may return an Item object with a version already defined based on the specified commit message or tag name. In that case, you must fill the date and url fields of the item, and return that item.

    If self.item() returns None, just continue iterating through the commits or tags until it does not.

Using Scrapy to Speed Up Development

Version Tracker is built on top of Scrapy, and many of the tools that Scrapy provides to speed up the development of spiders can be used for Version Tracker as well.

The Scray shell, for example, can come in really handy. It allows you to run XPath expressions against any URL interactively, and even to load URLs in your browser the way the Version Tracker sees them, which is very useful for sites that present different content to spiders.

Another useful feature of Scrapy is HTTP caching. Version Tracker is configured to use a standard caching implementation by default, however it is possible to switch to a different implementation (Scrapy’s default) which caches pages indefinitely, which should never be used in production but speeds up development a lot. To use this caching policy, comment out the HTTPCACHE_POLICY line in the settings.py file:

#HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'