Extending Software Tracking Data

The data used to determine how to fetch version information of a supported product is defined in a JSON file, versiontracker/data.json.

The data file contains a JSON object where keys are software IDs, and their value is a JSON object with settings:

{
    "0ad": { … },
    "4kslideshowmaker": { … },
    …
}

The object of a few entries may be empty because there is a spider built specifically for that entry:

"nvidia": {}

However, the object of most entries usually contains a spider key, which determines how to retrieve the version data for that software entry:

"calamares": {
    "spider": {
        "name": "github"
    }
}

Some software products, however, may also need a formatter key, which defines additional changes to apply to the retrieved data:

"c-ares": {
    "spider": {
        "name": "github",
        "tag": "^cares-\\d+(_\\d+)*$"
    },
    "formatter": "n_n"
}

Spiders

A spider is a way to fetch version information data for several one or more software products based on some product-specific parameters.

To use a spider, you must define the name of the spider that you want to use, and the parameters that you want to pass to it:

"spider": {
    "name": "<spider name>",
    "<key1>": "<value1>",
    "<key2>": "<value2>",
    …
}

Each spider may accept different parameters, and some parameters may be optional while other parameters are mandatory. You must learn how to use the parameters of each spider.

That said, most spiders try to provide some sane defaults that cover most common scenarios. It is possible to write an entry for a project hosted in GitHub or SourceForge that simply defines the name of the spider, without extra parameters:

"flacon": {
    "spider": {
        "name": "github"
    }
}

Version Tracker implements many spiders, all of which are described in detail below.

Path Spiders

The following spiders are based on paths. You must specify the path to a file or a folder in one of the parameters of the spider, and the version information is based on the information of that file or folder (its name, its date and its location).

Warning

It is often best to point to a file instead of a folder, as the date of a folder is usually either the date when the folder itself was created or the date when the contents of the folder last changed.

More often than not, those dates do not match the actual release date of the target software.

Whole path items may be replaced by one of the following placeholders:

  • ${highest} is replaced by the last file or folder found in alphabetical order.
  • ${latest} is replaced by the newest file or folder found.

These placeholders may incude a Python regular expression, separated from the placeholder keyword by a colon (e.g. ${latest:<regular expression>}), so that only the latest or the highest item that also matches the specified pattern is selected.

The last item in a path should be a placeholder. Otherwise, the spider automatically appends the following placeholder to the path:

${latest:(\d+\.\d+(\.\d+(\.\d+)?)?)}

On the last item of a path, the first capturing group (content between parentheses) of a regular expression is considered the version. If there are no capturing groups in the regular expression, or if there is no regular expression in the placeholder, the whole path item is considered the version.

Tip

If you need to use parenthesis in your regular expressions but you do not want them to act as a capturing group, append ?: to the starting parenthesis: (?:…).

http

This is a very useful spider. It should be able to retrieve version information from any HTTP server (e.g. Apache or nginx) that provides a file-browsing web interface that has not been customized too much.

These are just some examples of supported file-serving web interfaces:

The http spider accepts a single, mandatory parameter: url. It must contain the URL of the target file or folder, including placeholders as needed.

The following are some examples of usage of the http spider:

"bitlbee": {
    "spider": {
        "name": "http",
        "url": "https://get.bitlbee.org/src/"
    }
},
"ed": {
    "spider": {
        "name": "http",
        "url": "http://ftp.gnu.org/gnu/ed"
    }
},
"kamoso": {
    "spider": {
        "name": "http",
        "url": "http://download.kde.org/stable/kamoso/${latest}/src/"
    }
}

sourceforge

Handles software hosted in SourceForge.

It accepts the following optional parameters:

  • project is the name of the project, as it appears in the SourceForge URL. If no project is specified, the entry ID is used as project.

  • path is the path to the target file or folder from the ‘Files’ section of the target SourceForge project. If no path is specified, the spider uses the default placeholder to find a match, as described above.

    If no path is specified or the specified path contains no folders (there is no ‘/’ character in it), the file that SourceForge reports as the latest file of the project is considered first, and only if it does not match a specified regular expression will then the rest of the files of the project be considered.

    If you do not trust that the file that SourceForge reports as the latest is really the latest (we have seen some projects where it was not), and the target file is not within any folder, prepend ‘/’ to your path (e.g. /${latest}).

Git Spiders

The following spiders are based on web interfaces for Git repositories.

They provide different parameters to define the location of the web interface for the repository, and some of them support special features, but their basic behavior is always as follows.

You may specify one of two parameters to determine where to take the version information from: commit or tag. These parameters are mutually exclusive, you should use one of them only. If both are specified, tag is ignored.

If specified, the commit parameter should contain a regular expression to match a version definition in a commit message. You may use a capture group to indicate, within the regular expression, which part constitutes the version. Otherwise, whatever the regular expression matches is considered the version.

The commit parameter should be avoided when possible, though. It is only there to deal with projects that do not use (version) tags but that at least indicate version changes in the commit where they happen, and in a way that allows to extract them using a regular expression.

If no commit is specified, the version is extracted from the tags of the repository.

By default, the first tag is taken. If that tag starts with v, which is a common prefix for version tags, that prefix is removed from the resulting version. It is also common to append the package name to the version tag name (e.g. mysoftware-1.0.0). If that prefix matches the ID of your entry, it is removed as well.

If you need to be picky about the tag that you select, define the tag parameter with a regular expression. Only the latest tag that matches your regular expression will be selected, and the resulting version will be the first capture group in your regular expression or, if there are no capture groups, the whole match.

github

Handles projects hosted on GitHub.

In addition to common Git parameters described above, this spider accepts the following parameters:

  • project, which is the name of the project that contains the repository. If not specified, the entry ID is used as project.
  • repository, which is the name of the target repository. If not specified, the entry ID is used as project.

GitHub repositories may mark any given tag as the ‘latest release’, even if it is not really the latest tag. They usually use this feature to mark the latest stable release, which is the one that Version Tracker should find. By default, the spider evaluates this latest release tag first, and only if it does not match the specified tag parameter does the spider evaluate the rest of the tags.

If you would rather have the spider iterate tag normally by date, without caring for which tag is marked as the latest release, set the ignore-latest parameter of the spider to true.

kdegit

Handles projects hosted on KDE’s git repositories.

In addition to common Git parameters described above, this spider accepts the project parameter, which is the name of the target project. If not specified, the entry ID is used as project.

gitlab

Handles projects hosted on gitlab.com or on a self-hosted installation of GitLab.

In addition to common Git parameters described above, this spider accepts some additional parameters to indicate the location of the target repository.

  • If the repository is hosted on gitlab.com, use the following parameters:

    • project, which is the name of the project that contains the repository. If not specified, the entry ID is used as project.
    • repository, which is the name of the target repository. If not specified, the entry ID is used as project.

    For example:

    "libaccounts-glib": {
        "spider": {
            "name": "gitlab",
            "project": "accounts-sso",
            "commit": "^Version (\\d+(\\.\\d+)+)$"
        }
    }
    
  • If the repository is on a self-hosted installation of GitLab, use the url parameter, which is the absolute URL of the target repository.

    For example:

    "telepathy-accounts-signon": {
        "spider": {
            "name": "gitlab",
            "url": "https://git.merproject.org/mer-core/telepathy-accounts-signon"
        }
    }
    

bitbucket

Handles projects hosted on Bitbucket.

In addition to common Git parameters described above, this spider accepts the following parameters:

  • project, which is the name of the project that contains the repository. If not specified, the entry ID is used as project.
  • repository, which is the name of the target repository. If not specified, the entry ID is used as project.

cgit

Handles cgit servers.

In addition to common Git parameters described above, this spider accepts the url parameter, which is the URL to the home page of the target Git repository.

gitweb

Handles gitweb servers.

In addition to common Git parameters described above, this spider accepts the url parameter, which is the URL to the home page of the target Git repository.

gitserver

If you are unsure of whether a self-hosted Git server web interface is provided by cgit or gitweb, you may use the special gitserver spider name in the data.json file, and run the lint tool.

The lint tool checks the specified URL and changes gitserver to the right value.

Project Hosting Service Spiders

In addition to SourceForge and Git-based services, we provide built-in spiders for some other project hosting services:

alioth

Handles software hosted on alioth.debian.org.

Supported parameters are:

  • project: ID of the project, as it appears on the URL of the project page. If not specified, the entry ID is used.
  • package: Name of the target package, as it appears in the ‘Latest File Releases’ table of the project page. If not specified, the entry ID is used.

Example:

"ccid": {
    "spider": {
        "name": "alioth",
        "project": "pcsclite"
    }
}

hackage

Handles Haskell packages published on Hackage.

You can use the package parameter to specify the name of the package. Otherwise, the entry ID is used as package name.

Example:

"sha": {
    "spider": {
        "name": "hackage",
        "package": "SHA"
    }
}

launchpad

Handles projects hosted on Launchpad.

You can use the project parameter to specify the name of the project. Otherwise, the entry ID is used as project name.

Example:

"cuneiform": {
    "spider": {
        "name": "launchpad",
        "project": "cuneiform-linux"
    }
}

opendesktop

Handles projects hosted on any of the openDesktop.org sites.

Use the site field to indicate the mid-level domain (between www and .org, e.g. linux-games or kde-look) of the target site. You may omit the field altogether if the target software is hosted on linux-apps.com.

You must use the project parameter to specify the numeric project code, the one displayed on the URL of a project.

For example, the following entry tracks the software hosted at https://www.kde-look.org/p/998890/ :

"bumblebee-indicator": {
    "spider": {
        "name": "opendesktop",
        "site": "kde-look",
        "project": "998890"
    }
}

pypi

Handles Python packages published on PyPI.

You can use the package parameter to specify the name of the package. Otherwise, the entry ID is used as package name.

Example:

"python-bugzilla": {
    "spider": {
        "name": "pypi"
    }
}

Software Collection Spiders

We also provide spiders for software collections hosted on their own sites.

4k

Handles software from 4K Download.

No parameters are required, you simply need to ensure that the entry ID matches the package name, as it appears on the filename of any download.

Example:

"4kslideshowmaker": {
    "spider": {
        "name": "4k"
    }
}

aqbanking

Handles software from AqBanking.

It requires a package parameter, which must contain the code of the target package. For example:

"gwenhywfar": {
    "spider": {
        "name": "aqbanking",
        "package": "01"
    }
},

To find out the code of a package, go to the downloads page and hover the header with the name of the target package, which is a link. Your browser should show, usually on the bottom-left corner, the URL where the link is pointing:

_images/aqbanking.png

In the URL you should be able to see the code. For example, the following URL points to a package whose code is 03:

http://www.aquamaniac.de/sites/download/packages.php?package=03&showall=1

Generic Spiders

These spiders allow you to generate version information for very different software products using generic web-scraping techniques.

text

Allows you to read the version and the release date from a plain text file.

Use the url parameter to indicate the location of the target plain text file.

You should then use a formatter to indicate how to capture the version and the release date in the text.

If the date cannot be extracted from the text, your spider configuration must include a no-date field set to true.

Example:

"graphviz": {
    "spider": {
        "name": "text",
        "url": "https://raw.githubusercontent.com/ellson/graphviz/master/ChangeLog"
    },
    "formatter": [
        "(?:^|\n)\\w+ \\d+, \\d{4}\n\\s+- Release (.*)\n",
        "(?:^|\n)(\\w+ \\d+, \\d{4})\n\\s+- Release .*\n"
    ]
}

xpath

Allows you to read the version, the release date and the URL from an XML or HTML file using XPath expressions.

Because Version Tracker is built on top of Scrapy, your XPath expressions can use any of the EXSLT extensions that Scrappy supports.

Use the url parameter to indicate the location of the target document.

Use the version parameter to indicate an XPath expression to obtain the latest stable version of the target software.

Use the date parameter to indicate an XPath expression to obtain the release date of the captured version.

In both cases, focus on capturing text containing the required data. You can use a formatter later to extract the actual version or date.

To capture the release date, you may replace the date field with a date-url field, which must be an XPath expression that captures a URL. In that case, the headers of that URL are requested and the value of the Last-Modified header is used as release date.

The url field is used as reference URL of the captured software version by default. Alternatively, you may include an url-xpath field in your spider configuration that contains an XPath expression pointing to a different URL to use as reference URL.

URLs captured by XPath expressions in the date-url or url-xpath fields may be relative URLs. The xpath spider automatically resolves them to absolute URLs.

If all XPath expressions have a common root (i.e. start with the same text), you may define a base field containing that text, so that it is automatically appended to any other configured XPath expression.

If the base XPath expression does not need any additional text for the version or date fields, you may omit those fields altogether. If date-url or url-xpath are to be used, though, you need to define them with an empty string as value, otherwise url and date are used instead.

If the date cannot be extracted from the document and you are using base (e.g. for both version and url-xpath), your spider configuration must include a no-date field set to true.

All XPath expressions must be written using XPath 1.0. However, a additional function is supported (re:test()) to capture only elements matching a given regular expression.

Example:

"boinc": {
    "spider": {
        "name": "xpath",
        "url": "http://boinc.berkeley.edu/download_all.php",
        "base": "//*[contains(text(),'Recommended version')]/parent::*/td",
        "version": "[1]",
        "date": "[4]"
    }
}

Beyond Built-in Spiders

If none of the built-in spiders works for the entry you are defining:

Formatters

In addition to a spider, every entry may define a formatter, which is used to further process the retrieved version and date fields.

The value of the formatter field may be either a string, to format the version field, or an array containing two strings, the first string to format the version field and the second string to format the date field.

A formatter string defines a regular expression pattern that is searched in the target field (version or date) returned by the spider.

Version

To format the version field you must specify a regular expression pattern that does either of the following:

  • Defines no capture groups and matches the software version.
  • Defines a single capture group which captures the software version.
  • Defines several (usually 2-4) capture groups. In that case, matched capture groups are joined together separated by dots.

Some named patterns exist for common scenarios, so that you can use them as value for the formatter string instead of the corresponding regular expression pattern:

Name Pattern Example
n \d+  
n.n \d+(?:\.\d+)+  
n_n (\d+)(?:_(\d+))?(?:_(\d+))?(?:_(\d+))? 1_2_3 → 1.2.3

There is also a special name that you can use, date, which is not replaced by a regular expression pattern. It is replaced by the captured date in YYYYMMDD format.

Date

To format the date field you must specify a regular expression pattern that does either of the following:

  • Defines no capture groups and matches the release date.
  • Defines a single capture group which captures the release date.
  • Defines 3 capture groups with the following names:
    • y: year
    • m: month
    • d: day

Some named patterns exist for common scenarios, so that you can use them as value for the formatter string instead of the corresponding regular expression pattern:

Name Pattern Example
d mmm y \d+\s+\w{3}\s+\d{4} 27 Aug 2016

Example

"tbb": {
    "spider": {
        "name": "xpath",
        "url": "https://www.threadingbuildingblocks.org/download",
        "version": "//div[@id='download-now-version']",
        "date": "//a/@href[re:test(., 'software_releases/\\w+/tbb\\d+_\\d{8}')]"
    },
    "formatter": [
        "(\\d+(?:\\.\\d+)*)(?: Update (\\d+))?",
        "(?P<y>\\d{4})(?P<m>\\d{2})(?P<d>\\d{2})"
    ]
}

Lint Tool

You may run the tools/lint.py script to format versiontracker/data.json according to our own rules, which enforce a specific order of fields (e.g. the spider field always goes before the formatter field).

The tool does not remove unexpected data, but it does move any unexpected field after expected fields, and may arbitrarily alter the other of unexpected fields. If you are creating a new spider with new field names, you should update this tool so that it can handle your new fields as well.