Data files, APIs and discovery

Different users have different needs when it comes to accessing OCDS data.

Data on the Web Best Practices suggests that “Data should be available in multiple data formats” in order to increases the number of different users, tools and applications that can process the data.

Which formats are most important will depend on the priority use cases for each OCDS implementation, but you are encouraged to consider:

  • Bulk downloads - packaging together multiple releases or multiple records in one or more files for users to download and import into local tools.
  • Individual release and record downloads - providing a URI at which each release or record can be obtained. This is crucial for 4 ☆ data publication.
  • CSV and spreadsheet serializations - providing multiple releases or compiled records for download, enabling users to work with data directly in spreadsheet software or other tools.
  • API access - enabling interactive access to your data.

Bulk downloads

The release package and record package can provide bulk access to releases and records, respectively.

However, very large files can be difficult for users to download and process. The following section provides suggested good practices which will assist users in accessing data. These are not requirements of the standard, but are based on experiences of maximizing the number of users able to work with datasets with their existing hardware and software.

File size limits

When generating packages for bulk download, apply the following limits:

  • Unzipped OCDS packages ought to not exceed 1Gb each;
  • Zipped OCDS packages ought to not exceed 10 Mb each;
  • A single OCDS package ought to contain a maximum of 250,000 awards or contracts;

When a file is likely to exceed one of these limits, release or records ought to be split across multiple files. Dynamically generated bulk downloads do not have to apply these limits, though publishers ought to consider ways to advise users when a query is likely to generate a very large file.

Segmenting files

When the suggested limits entail publication of multiple files, publishers ought to consider ways to split data across available files.

For releases, publishers can:

  1. Segment by release date - placing all the releases from a given day, month or year in the same file;
  2. Segment by contracting process identifier - placing all the releases related to a given set of contract process identifiers together in the same package;

For records, publishers can segment by the first release date associated with a contracting process, or by contracting process identifier.

Following these approaches will avoid release and records ‘jumping’ between files when the bulk files are updated.

Compression

OCDS packages can be compressed in order to save on disk space and bandwidth.

If compressing packages, publishers ought to use the ZIP file format.

Serving files

The web server providing access to bulk files ought to correctly report the HTTP Last-Modified header so that consuming applications only need to download updated files.

Individual releases and records

To achieve 4 ☆ OCDS publication requires each release and record to be accessible at a permanent URI. This can be achieved by:

(a) Archiving a single-release release package for each release to a web accessible file system as it is created, and then regularly merging these releases to compile individual record files in the same file system. One approach might be to have a folder for each ocid and to place the releases and record related to that process into that folder.

(b) Providing access to releases and records through an API.

Note that the second approach needs the API to maintain a full revision history of each contracting process, so that releases from each stage of a contracting process can be provided.

Publishers ought to consider how to ensure URIs can remain stable, even across a change of systems.

Flattened serializations

The serialization page provides details of how to generate ‘flat’ versions of OCDS data for use in spreadsheet software.

The same principles discussed for bulk files above ought to be applied to CSV or Excel downloads of data.

Discovery and APIs

There are many thousands of organizations who ought to be able to publish open contracting data. As a result, maintaining a central registry of all published data is impractical. Instead, OCDS proposes a common pattern for the discovery of releases and records.

For the discovery of bulk datasets, and the location of any data feeds, we propose use of a data.json file.

For the discovery of individual releases and records, we propose use of Atom feeds.

Dataset and feed discovery

Publishers can provide a data.json document describing the location of all of the bulk OCDS files available for download.

This ought to follow the structure proposed by the US Project Open Data with:

  • Each record containing a distribution block with at least one accessURL pointing to OCDS data.
  • Each record containing ‘open-contracting-data’, and either ‘open-contracting-release’ or ‘open-contracting-record’ in the keyword array.
  • accessLevel set appropriately

In addition, the data.json document can contain one or more records with the keyword ‘open-contracting-feed’ and either ‘open-contracting-release’ or ‘open-contracting-record’ and pointing via an accessURL in their distribution block to an atom feed document.

Feeds

Publishers exposing individual records and releases, of regularly updated packages in small sets, ought to provide one or more Atom feeds that index these, and provide an easy mechanism for users to discover recently published or updated release and records.

The link to the release or record ought to be provided via a <link> tag, and the updated date of the entry ought to reflect the updated date of that release or record. The <id> ought to reflect the release id for release, or the ocid for records.

The release.tag ought to be contained within an <category> element of the feed.

Feeds requiring pagination ought to follow the approach set out in RFC 5005.

Well Known

Future implementations of OCDS will explore use of the /.well-known/ protocol to declare a location for storing a data.json file.

At present, such files can be hosted anywhere, and consuming applications pointed towards them manually.

The data.json structure has been chosen to allow organizations following this approach to include tagged ‘open-contracting-data’ within their existing data discovery mechanisms, and given the availability of a plugin for the widely used CKAN which will also support exposure of data.json files.

Linking data

For 5 ☆ publication of OCDS data, publishers should seek to use URIs in their datasets, linking to other machine-readable data sources at an entity-by-entity level.