Intro

The datahub stores datapackages (datasets). Datapackage is consist of:

  • the Data: file(s) with the data. Usually it is tabular (csv, xls) but could be any file type.
  • the Metadata: descriptor file, named datapackage.json, that contains all the information you need to find and to use the data itself

Detailed description of the datapackage format is here: http://datahub.io/docs/data-packages

GET API

Get files via /r/ endpoint

Use our /r/ endpoint, if you know the name of the file you need in the dataset, e.g. for data.csv:

GET https://datahub.io/<owner>/<dataset_name>/r/data.csv

If you don't know the filename, or there is a lot of files in the dataset, use our enumeration logic:

GET https://datahub.io/<owner>/<dataset_name>/r/0.csv
GET https://datahub.io/<owner>/<dataset_name>/r/1.csv
GET https://datahub.io/<owner>/<dataset_name>/r/2.csv
...

The datahub.io path logic is described here: getting-data#perma-urls-for-data

Get the descriptor

If you need to see the list of the files in the dataset, or other metadata, then get the descriptor (datapackage.json):

GET https://datahub.io/<owner>/<dataset_name>/datapackage.json

Now you could parse the descriptor to get each resource (data) path:

descriptor = json.load('datapackage.json')
for resource in descriptor.resources:
    print(resource.name, resource.path)

Then you can easily get all the files.

POST API

The easiest way to post data on the datahub is to use our CLI tool, instructions is here: publishing_data.

Here is the way to POST data from your application: