Ingest

`download_resource_data(resource, into, api_key)` ¶

Downloads the file specified in the resource's url field to the given path and returns the SHA1 hash of it. If the url is an upload url (i.e. the URL of a file which is stored on this CKAN instance) then the API key will be used to ensure we have access. This allows private datasets to have resources ingested in the datastore before they are made public.

Parameters:

Name	Type	Description	Default
`resource`	`dict`	the resource dict	required
`into`	`Path`	the path to the file where the data should be put	required
`api_key`	`str`	the user's API key	required

Returns:

Type	Description
`str`	the hash of the downloaded file

Source code in ckanext/versioned_datastore/lib/importing/ingest.py

def download_resource_data(resource: dict, into: Path, api_key: str) -> str:
    """
    Downloads the file specified in the resource's url field to the given path and
    returns the SHA1 hash of it. If the url is an upload url (i.e. the URL of a file
    which is stored on this CKAN instance) then the API key will be used to ensure we
    have access. This allows private datasets to have resources ingested in the
    datastore before they are made public.

    :param resource: the resource dict
    :param into: the path to the file where the data should be put
    :param api_key: the user's API key
    :returns: the hash of the downloaded file
    """
    hasher = hashlib.sha1()
    # grab the resource's data via this URL
    url = toolkit.url_for(
        'resource.download',
        id=resource['package_id'],
        resource_id=resource['id'],
        qualified=True,
    )
    # include the auth header regardless of whether the resource URL is of a file hosted
    # by this CKAN instance, or another website. We need to do this as the
    # resource.download route is protected by auth. This is safe and won't leak creds
    # because if the resource's URL is another website, the resource.download route will
    # respond with a redirect to the other site and requests won't copy those headers on
    # to the request to the redirected URL
    headers = {'Authorization': api_key}
    with closing(requests.get(url, stream=True, headers=headers)) as r:
        r.raise_for_status()
        with into.open('wb') as f:
            for chunk in r.iter_content(chunk_size=8192, decode_unicode=False):
                if chunk:
                    f.write(chunk)
                    hasher.update(chunk)

    return hasher.hexdigest()

`iter_records(data, stats)` ¶

Iterate over the dicts in the given data iterable, converting each to a Record object for Splitgill to ingest. The stats object will be updated periodically during the operation to show progress (by updating the count value).

For each dict in the data stream, the _id key is checked to see if it exists. If it does exist, the associated value is used as the record ID for the record created from that dict. If it does not exist, a new _id value is added to the dict and used as the new record's ID.

New record IDs are generated sequentially to maintain insertion order of these records within this data stream. Unless there are more than 1 billion records in the stream, the resulting generated IDs will always be 12 characters long. If there are more than 1 billion records in the stream, the resulting generated IDs may be longer than 12 characters.

IDs take the form of a 3-letter prefix concatenated with the sum of a constant value and the record's index in the stream (i.e. the first record is at position 0, the 10th record is at position 9 etc) in hex. The hex representation is padded with 0s to ensure it is at least 9 characters long, hence achieving a 12 character total ID length. The constant value is a random number between 0 (inclusive) and 3294967296 (exclusive) which is chosen because it is 1 billion less than the maximum integer expressible in 9 hex characters. This is where the 1 billion soft-limit on IDs of length 12 comes from. If the constant is chosen as 3294967295 and a billion records IDs are generated, the hex representation of the constant + the last index in the stream will be 10 characters long, resulting in a 13 character ID. Is this overcomplicated? Perhaps.