Data files and APIs

Astuce

Did you arrive at this page looking to download OCDS data? Check out the OCP Data Registry.

Different users have different needs when it comes to accessing OCDS data. A common need is to automate the download of all the OCDS data, in some form.

The W3C Data on the Web Best Practices describe ways to increase the number of users, tools and applications that can access and make effective use of a given dataset.

With respect to your OCDS publication, which best practices are most important will depend on the particular needs of your users, but you are encouraged to consider:

To meet the basic criteria for data quality, it ought to be possible for a user to automate the download of all the data, either using an HTML page listing bulk download URLs, or using only machine-readable data as input.

Téléchargements en masse

The release package and record package can provide bulk access to releases and records, respectively.

However, very large files can be difficult for users to download and process. The following section suggests good practices to assist users in accessing data. These are not requirements of the standard, but are based on experiences of maximizing the number of users able to work with datasets with their existing hardware and software.

Limites de taille de fichiers

Lorsque vous générez des paquets de données pour un téléchargement en masse, appliquez les limites suivantes :

  • Unzipped OCDS packages ought to not exceed 1 Gb each.

  • Zipped OCDS packages ought to not exceed 10 Mb each.

  • A single OCDS package ought to contain a maximum of 250,000 awards or contracts.

Lorsqu'un fichier risque de dépasser l'une de ces limites, les instances ou les archives devraient être divisées en plusieurs fichiers. Les téléchargements de masse générés de manière dynamique n'ont pas à appliquer ces limites, même si les producteurs de données devraient envisager des moyens d'avertir l'usager lorsque une requête risque de générer un très gros fichier.

Segmenter les fichiers

Lorsque les limites suggérées conduisent à publier plusieurs fichiers, les producteurs de données devraient envisager des moyens de découper les données en plusieurs fichiers.

Pour les instances, les producteurs peuvent choisir de :

  1. Segmenter par releaseDate - en plaçant toutes les instances d'un jour, mois ou année donné dans un même fichier ;

  2. Segmenter par identifiant de processus de passation de marché - en plaçant toutes les instances liées à un ensemble d'identifiants de processus de passation de marché dans un même paquet ;

Pour les archives, il est préférable pour les producteurs de segmenter les données soit sur la base de la première releaseDate associées avec un processus de passation de marché, soit par identifiant de processus de passation de marché.

Following these approaches will avoid releases and records 'jumping' between files when the bulk files are updated.

Compression

Les paquets de données OCDS peuvent être compressés afin d'économiser de l'espace de stockage et de bande passante.

When compressing packages, use ZIP or GZIP, as these are commonly available, often without additional software. Avoid RAR, which requires additional software.

Mettre les fichiers à disposition

The web server providing access to bulk files ought to report the HTTP Last-Modified header correctly, so that consuming applications only need to download updated files.

Also, publishers ought to ensure that the data export is completed successfully, i.e. that no files were truncated.

Instances et archives individuelles

Each release and record can be made accessible at a permanent URL. This can be achieved by:

  • Archiving a ‘single-release’ release package for each release to a web accessible file system as it is created, and then regularly merging these releases into records in the same file system. One approach might be to have a directory for each ocid and to put the releases and the record related to that process in that directory.

  • Providing access to each release and the merged records via an API.

Publishers ought to consider how to ensure that URLs remain stable, even across a change of systems.

API access

API design is a deep topic. As such, the following guidance is not intended to be comprehensive or prescriptive. Wherever possible, publishers ought to carry out their own user research.

Discoverability

Ensure that the API endpoints and documentation are discoverable. For example, add a link to the footer of your procurement portal, and list the API endpoints in your open data portal.

Documentation

Provide API documentation, with at least the lists of endpoints, methods and parameters. Many publishers use Swagger to document their APIs.

Access control and rate limiting

Avoid adding access controls (like user registration or API keys), in order to maximize the ease of access to the publication.

If access controls are necessary, do not use access tokens that need to be refreshed regularly. For example, every two hours is too frequent.

If the API implements rate limits (throttling):

Completeness

Ensure that all the OCDS data can be accessed via the API. For example, if the data source is an Elasticsearch index, either implement pagination using the search_after parameter, or ensure that index.max_result_window is large enough to return all results.

Endpoints

Your design choices in this area ought to be informed by user research. That said, you can consider providing:

  • A release package endpoint with pagination, for retrieving multiple releases without providing a specific OCID and release ID

  • A record package endpoint with pagination, for retrieving multiple records without providing a specific OCID

  • A release endpoint, for retrieving an individual release by OCID and release ID

  • A record endpoint, for retrieving an individual record by OCID

For package endpoints, you can also provide filtering and ordering options. In particular, consider date filters and/or a reverse chronological order option, so that users can retrieve only new or updated data.

If you choose to provide endpoints for retrieving individual records and/or releases but not endpoints for paginating through records and/or releases, then you need to provide a machine-readable list of OCIDs and/or release IDs. Otherwise, it will not be possible to automate the download of all the data, which is a basic criterion for data quality.

Response format

  • Put the release, record or package at the top-level of the JSON data. For example, do not embed it under a results array.

  • Use a JSON library instead of implementing JSON serialization yourself. This also guarantees that the encoding is UTF-8.

  • Remove NULL characters (\u0000) from the JSON response. These characters cannot be imported by users into some SQL databases.

  • If results cannot be returned, use an appropriate HTTP error code (400-599); do not return a JSON object with an error message and a 200 HTTP status code. That said, if a search request returns no results, it is appropriate to use a 200 HTTP status code, with an empty result set.

Pagination

To support pagination, the top-level links object in release packages and record packages has two fields:

  • next: A URL to the next sequential package

  • prev: A URL to the previous sequential package

Ensure that the API's performance does not degrade on deep pages. For example, if the data source is a SQL database, use the seek method (also known as keyset pagination) rather than an OFFSET clause.

When using the seek method, you can use either of these query string parameters to construct the next and/or prev URLs:

  • cursor=CURSOR, to return a page of results that are positioned after the cursor, in sequential order. The cursor might be a row ID or similar.

  • since=TIMESTAMP, to return a page of results that are modified after the since timestamp, in chronological order.

When using the offset method, you can use either of these query string parameters to construct the next and/or prev URLs:

  • offset=NUMBER, to return a page of results that are positioned after the offset number, in sequential order. Use offset=0 for the first offset.

  • page=NUMBER, to return a page of results that are positioned at the page number, in sequential order. Use page=1 for the first page, not page=0.

In either case:

  • Use limit=NUMBER, to limit the number of results returned on each page.

  • Include the total number of results across all pages.

In addition to performance reasons, the seek method is preferred to the offset method when results are ordered in reverse chronology, because:

  • A given page won't return the same results over time. page=1 will return different results today, next week, and next year.

  • Users can receive duplicate results while paginating. For example, if a new release is published to page 1 while users are paginating, then the result at the bottom of each page will be moved to the top of the following page.

  • It is harder for users to synchronize with the API. With the seek method, users can retrieve new results by submitting the timestamp or ID of their last request. With page, users need to determine which results are new or old.

Monitoring

Set up error monitoring, so that if a request causes an HTTP 500 Internal Server Error, you can investigate.

CSV and spreadsheet serializations

La page sur la sérialisation fournit des détails sur la manière de générer des versions "à plat" des données OCDS pour être utilisées dans des tableurs.

Les mêmes principes évoqués pour les fichiers en vrac ci-dessus doivent être appliqués aux téléchargements de données au format CSV ou Excel.