Skip to content

Utils

find_version(data_dict)

Retrieve the version from the data_dict. The version can be specified as a parameter in its own right or as a special filter in the filters dict using the key version. Using the version parameter is preferred and will override any filter version value. The filter method is provided because of limitations in the CKAN recline.js framework used by the NHM on CKAN 2.3 where no additional parameters can be passed other than q, filters etc.

Parameters:

Name Type Description Default
data_dict dict

the data dict, this might be modified if the version key is used (it will be removed if present)

required

Returns:

Type Description
Optional[int]

the version found as an integer, or None if no version was found

Source code in ckanext/versioned_datastore/logic/basic/utils.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def find_version(data_dict: dict) -> Optional[int]:
    """
    Retrieve the version from the data_dict. The version can be specified as a parameter
    in its own right or as a special filter in the filters dict using the key
    __version__. Using the version parameter is preferred and will override any filter
    version value. The filter method is provided because of limitations in the CKAN
    recline.js framework used by the NHM on CKAN 2.3 where no additional parameters can
    be passed other than q, filters etc.

    :param data_dict: the data dict, this might be modified if the __version__ key is
        used (it will be removed if present)
    :returns: the version found as an integer, or None if no version was found
    """
    version = data_dict.get('version')
    # pop the __version__ to avoid including it in the normal search filters, even if we
    # don't use it
    filter_version = data_dict.get('filters', {}).pop('__version__', None)

    if version is not None:
        return int(version)

    if filter_version is not None:
        # it'll probably be a list because it's a normal filter as far as the frontend
        # is concerned
        if isinstance(filter_version, list):
            # just use the first value
            filter_version = filter_version[0]
        if filter_version is not None:
            return int(filter_version)

    # no version found, return None
    return None

format_facets(aggs)

Formats the facet aggregation result into the format we require. Specifically we expand the buckets out into a dict that looks like this:

{
    "facet1": {
        "details": {
            "sum_other_doc_count": 34,
            "doc_count_error_upper_bound": 3
        },
        "values": {
            "value1": 1,
            "value2": 4,
            "value3": 1,
            "value4": 2,
        }
    },
    etc
}

etc.

Parameters:

Name Type Description Default
aggs dict

the aggregation dict returned from splitgill/elasticsearch

required

Returns:

Type Description
dict

the facet information as a dict

Source code in ckanext/versioned_datastore/logic/basic/utils.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
def format_facets(aggs: dict) -> dict:
    """
    Formats the facet aggregation result into the format we require. Specifically we
    expand the buckets out into a dict that looks like this:

        {
            "facet1": {
                "details": {
                    "sum_other_doc_count": 34,
                    "doc_count_error_upper_bound": 3
                },
                "values": {
                    "value1": 1,
                    "value2": 4,
                    "value3": 1,
                    "value4": 2,
                }
            },
            etc
        }

    etc.

    :param aggs: the aggregation dict returned from splitgill/elasticsearch
    :returns: the facet information as a dict
    """
    facets = {}
    for facet, details in aggs.items():
        facets[facet] = {
            'details': {
                'sum_other_doc_count': details['sum_other_doc_count'],
                'doc_count_error_upper_bound': details['doc_count_error_upper_bound'],
            },
            'values': {
                value_details['key']: value_details['doc_count']
                for value_details in details['buckets']
            },
        }

    return facets

get_fields(resource_id, version=None)

Given a resource id, returns the fields that existed at the given version. If the version is None then the fields for the latest version are returned.

The response format is important as it must match the requirements of reclineJS's field definitions. See http://okfnlabs.org/recline/docs/models.html#field for more details.

All fields are returned by default as string or array types. This is because we have the capability to allow searchers to specify whether to treat a field as other types when searching, and therefore we don't need to try and guess the type, and we can leave it to the user to know the type which won't cause problems like interpreting a field as a number when it shouldn't be (for example a barcode like '013655395').

The fields are returned in either alphabetical order, or if we have the ingestion details for the resource at the required version then the order of the fields will match the order of the fields in the original source.

Parameters:

Name Type Description Default
resource_id str

the resource's id

required
version Optional[int]

the version of the data we're querying (default: None, which means latest)

None

Returns:

Type Description
List[dict]

a list of dicts containing the field data

Source code in ckanext/versioned_datastore/logic/basic/utils.py
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
def get_fields(resource_id: str, version: Optional[int] = None) -> List[dict]:
    """
    Given a resource id, returns the fields that existed at the given version. If the
    version is None then the fields for the latest version are returned.

    The response format is important as it must match the requirements of reclineJS's
    field definitions. See
    http://okfnlabs.org/recline/docs/models.html#field
     for more    details.

    All fields are returned by default as string or array types. This is because we have
    the capability to allow searchers to specify whether to treat a field as other types
    when searching, and therefore we don't need to try and guess the type, and we can
    leave it to the user to know the type which won't cause problems like interpreting a
    field as a number when it shouldn't be (for example a barcode like '013655395').

    The fields are returned in either alphabetical order, or if we have the ingestion
    details for the resource at the required version then the order of the fields will
    match the order of the fields in the original source.

    :param resource_id: the resource's id
    :param version: the version of the data we're querying (default: None, which means
        latest)
    :returns: a list of dicts containing the field data
    """
    database = get_database(resource_id)
    data_fields = database.get_data_fields(version)
    parsed_fields = {field.path: field for field in database.get_parsed_fields(version)}

    fields = []
    seen = {'_id'}

    for field in data_fields:
        if field.parsed_path in seen:
            continue
        field_repr = {
            'id': field.parsed_path,
            'type': infer_type(field, parsed_fields),
            'sortable': True,
        }
        if field.children:
            field_repr['sortable'] = False
            if any(child.is_list_element for child in field.children):
                field_repr['type'] = 'array'
        seen.add(field.parsed_path)
        fields.append(field_repr)

    details = get_details_at(resource_id, version)
    if details is None:
        # no details, just order by alphabetical field name
        fields.sort(key=itemgetter('id'))
    else:
        # we have details, order the fields using the order of the columns in the
        # original source
        column_order = details.get_columns(skip_empty=False)

        def key(f: dict) -> int:
            try:
                return column_order.index(f['id'])
            except ValueError:
                return len(column_order)

        fields.sort(key=key)

    # add the _id field to the start of the field list
    fields.insert(0, {'id': '_id', 'type': 'string'})

    return fields

infer_type(data_field, parsed_fields, threshold=0.8)

Infers the type of each of the given data fields, using the parsed fields to determine what type to assign. A type is assigned to a field if more than the threshold percentage of values in the field are of the parsed type.

This function uses the parsed types instead of the data types for three reasons: - often the field data types are just strings, so we'd need to use the parsed types anyway to get a non-string type - the response to the get_fields function where this function is called is typically used in a search result and therefore the fields response may be used to create another search in which case the parsed types would have to be used as you can't search using the data types - there is no date field data type, whereas there is a date field parsed type

In most cases, the parsed type returned by this function will match the data type of the field's values.

Parameters:

Name Type Description Default
data_field DataField

the data field to infer for

required
parsed_fields Dict[str, ParsedField]

a dict of parsed paths to parsed fields

required
threshold float

a threshold determining the percentage of the values in the field which must be of the parsed type to be inferred as it (default: 0.8, i.e., 80%)

0.8

Returns:

Type Description
str

the name of the inferred type (one of string, number, date, boolean)

Source code in ckanext/versioned_datastore/logic/basic/utils.py
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
def infer_type(
    data_field: DataField, parsed_fields: Dict[str, ParsedField], threshold: float = 0.8
) -> str:
    """
    Infers the type of each of the given data fields, using the parsed fields to
    determine what type to assign. A type is assigned to a field if more than the
    threshold percentage of values in the field are of the parsed type.

    This function uses the parsed types instead of the data types for three reasons:
      - often the field data types are just strings, so we'd need to use the parsed
        types anyway to get a non-string type
      - the response to the get_fields function where this function is called is
        typically used in a search result and therefore the fields response may be used
        to create another search in which case the parsed types would have to be used
        as you can't search using the data types
      - there is no date field data type, whereas there is a date field parsed type

    In most cases, the parsed type returned by this function will match the data type of
    the field's values.

    :param data_field: the data field to infer for
    :param parsed_fields: a dict of parsed paths to parsed fields
    :param threshold: a threshold determining the percentage of the values in the field
                      which must be of the parsed type to be inferred as it
                      (default: 0.8, i.e., 80%)
    :returns: the name of the inferred type (one of string, number, date, boolean)
    """
    parsed_field = parsed_fields.get(data_field.parsed_path)

    # this should only ever happen for fields which are always nested objects
    if not parsed_field:
        return 'object'

    # check for dates first, then booleans, then numbers, and then default to string
    for parsed_type in (ParsedType.DATE, ParsedType.BOOLEAN, ParsedType.NUMBER):
        if parsed_field.type_counts[parsed_type] / parsed_field.count >= threshold:
            # just so happens that the recline types match the ParsedType names
            return parsed_type.name.lower()
    return 'string'

make_request(data_dict)

Creates a SearchRequest from the given data_dict and returns it. The SearchRequest object will use a BasicQuery as its query which will be made by this function from the given data_dict. This function should be used by all basic actions which need to use basic queries for searches.

Parameters:

Name Type Description Default
data_dict dict

the data_dict passed to the action

required

Returns:

Type Description
SearchRequest

a SearchRequest object

Source code in ckanext/versioned_datastore/logic/basic/utils.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def make_request(data_dict: dict) -> SearchRequest:
    """
    Creates a SearchRequest from the given data_dict and returns it. The SearchRequest
    object will use a BasicQuery as its query which will be made by this function from
    the given data_dict. This function should be used by all basic actions which need to
    use basic queries for searches.

    :param data_dict: the data_dict passed to the action
    :returns: a SearchRequest object
    """
    query = BasicQuery(
        data_dict['resource_id'],
        find_version(data_dict),
        data_dict.get('q'),
        data_dict.get('filters'),
    )
    request = SearchRequest(
        query,
        size=data_dict.get('limit'),
        offset=data_dict.get('offset'),
        after=data_dict.get('after'),
        sorts=list(map(Sort.from_basic, data_dict.get('sort', []))),
        fields=data_dict.get('fields', []),
        data_dict=data_dict,
    )
    if 'facets' in data_dict:
        facet_limits = data_dict.get('facet_limits', {})
        for facet in data_dict['facets']:
            request.add_agg(
                facet,
                'terms',
                field=keyword(facet),
                size=facet_limits.get(facet, 10),
            )
    return request