Data files and APIs

Tip

Did you arrive at this page looking to download OCDS data? Check out the OCP Data Registry.

Different users have different needs when it comes to accessing OCDS data. A common need is to automate the download of all the OCDS data, in some form.

The W3C Data on the Web Best Practices describe ways to increase the number of users, tools and applications that can access and make effective use of a given dataset.

With respect to your OCDS publication, which best practices are most important will depend on the particular needs of your users, but you are encouraged to consider:

To meet the basic criteria for data quality, it ought to be possible for a user to automate the download of all the data, either using an HTML page listing bulk download URLs, or using only machine-readable data as input.

Bulk downloads

The release package and record package can provide bulk access to releases and records, respectively.

However, very large files can be difficult for users to download and process. The following section suggests good practices to assist users in accessing data. These are not requirements of the standard, but are based on experiences of maximizing the number of users able to work with datasets with their existing hardware and software.

File size limits

When generating packages for bulk download, apply the following limits:

  • Unzipped OCDS packages ought to not exceed 1 Gb each.

  • Zipped OCDS packages ought to not exceed 10 Mb each.

  • A single OCDS package ought to contain a maximum of 250,000 awards or contracts.

When a file is likely to exceed one of these limits, release or records ought to be split across multiple files. Dynamically generated bulk downloads do not have to apply these limits, though publishers ought to consider ways to advise users when a query is likely to generate a very large file.

Segmenting files

When the suggested limits entail publication of multiple files, publishers ought to consider ways to split data across available files.

For releases, publishers can:

  1. Segment by release date - placing all the releases from a given day, month or year in the same file;

  2. Segment by contracting process identifier - placing all the releases related to a given set of contract process identifiers together in the same package;

For records, publishers can segment by the first release date associated with a contracting process, or by contracting process identifier.

Following these approaches will avoid releases and records 'jumping' between files when the bulk files are updated.

Compression

OCDS packages can be compressed in order to save on disk space and bandwidth.

When compressing packages, use ZIP or GZIP, as these are commonly available, often without additional software. Avoid RAR, which requires additional software.

Serving files

The web server providing access to bulk files ought to report the HTTP Last-Modified header correctly, so that consuming applications only need to download updated files.

Also, publishers ought to ensure that the data export is completed successfully, i.e. that no files were truncated.

Individual releases and records

Each release and record can be made accessible at a permanent URL. This can be achieved by:

  • Archiving a ‘single-release’ release package for each release to a web accessible file system as it is created, and then regularly merging these releases into records in the same file system. One approach might be to have a directory for each ocid and to put the releases and the record related to that process in that directory.

  • Providing access to each release and the merged records via an API.

Publishers ought to consider how to ensure that URLs remain stable, even across a change of systems.

API access

API design is a deep topic. As such, the following guidance is not intended to be comprehensive or prescriptive. Wherever possible, publishers ought to carry out their own user research.

Discoverability

Ensure that the API endpoints and documentation are discoverable. For example, add a link to the footer of your procurement portal, and list the API endpoints in your open data portal.

Documentation

Provide API documentation, with at least the lists of endpoints, methods and parameters. Many publishers use Swagger to document their APIs.

Access control and rate limiting

Avoid adding access controls (like user registration or API keys), in order to maximize the ease of access to the publication.

If access controls are necessary, do not use access tokens that need to be refreshed regularly. For example, every two hours is too frequent.

If the API implements rate limits (throttling):

Completeness

Ensure that all the OCDS data can be accessed via the API. For example, if the data source is an Elasticsearch index, either implement pagination using the search_after parameter, or ensure that index.max_result_window is large enough to return all results.

Endpoints

Your design choices in this area ought to be informed by user research. That said, you can consider providing:

  • A release package endpoint with pagination, for retrieving multiple releases without providing a specific OCID and release ID

  • A record package endpoint with pagination, for retrieving multiple records without providing a specific OCID

  • A release endpoint, for retrieving an individual release by OCID and release ID

  • A record endpoint, for retrieving an individual record by OCID

For package endpoints, you can also provide filtering and ordering options. In particular, consider date filters and/or a reverse chronological order option, so that users can retrieve only new or updated data.

If you choose to provide endpoints for retrieving individual records and/or releases but not endpoints for paginating through records and/or releases, then you need to provide a machine-readable list of OCIDs and/or release IDs. Otherwise, it will not be possible to automate the download of all the data, which is a basic criterion for data quality.

Response format

  • Put the release, record or package at the top-level of the JSON data. For example, do not embed it under a results array.

  • Use a JSON library instead of implementing JSON serialization yourself. This also guarantees that the encoding is UTF-8.

  • Remove NULL characters (\u0000) from the JSON response. These characters cannot be imported by users into some SQL databases.

  • If results cannot be returned, use an appropriate HTTP error code (400-599); do not return a JSON object with an error message and a 200 HTTP status code. That said, if a search request returns no results, it is appropriate to use a 200 HTTP status code, with an empty result set.

Pagination

To support pagination, the top-level links object in release packages and record packages has two fields:

  • next: A URL to the next sequential package

  • prev: A URL to the previous sequential package

Ensure that the API's performance does not degrade on deep pages. For example, if the data source is a SQL database, use the seek method (also known as keyset pagination) rather than an OFFSET clause.

When using the seek method, you can use either of these query string parameters to construct the next and/or prev URLs:

  • cursor=CURSOR, to return a page of results that are positioned after the cursor, in sequential order. The cursor might be a row ID or similar.

  • since=TIMESTAMP, to return a page of results that are modified after the since timestamp, in chronological order.

When using the offset method, you can use either of these query string parameters to construct the next and/or prev URLs:

  • offset=NUMBER, to return a page of results that are positioned after the offset number, in sequential order. Use offset=0 for the first offset.

  • page=NUMBER, to return a page of results that are positioned at the page number, in sequential order. Use page=1 for the first page, not page=0.

In either case:

  • Use limit=NUMBER, to limit the number of results returned on each page.

  • Include the total number of results across all pages.

In addition to performance reasons, the seek method is preferred to the offset method when results are ordered in reverse chronology, because:

  • A given page won't return the same results over time. page=1 will return different results today, next week, and next year.

  • Users can receive duplicate results while paginating. For example, if a new release is published to page 1 while users are paginating, then the result at the bottom of each page will be moved to the top of the following page.

  • It is harder for users to synchronize with the API. With the seek method, users can retrieve new results by submitting the timestamp or ID of their last request. With page, users need to determine which results are new or old.

Monitoring

Set up error monitoring, so that if a request causes an HTTP 500 Internal Server Error, you can investigate.

CSV and spreadsheet serializations

The serialization page provides details of how to generate 'flat' versions of OCDS data for use in spreadsheet software.

The same principles discussed for bulk files above ought to be applied to CSV or Excel downloads of data.