Skip to content

Action

vds_multi_autocomplete_field(resource_ids, text='', lowercase=False, version=None)

Queries the field names in the given resources, filtering by checking if the field name contains the text parameter.

The fields key in the response dict contains the field's full path, and then for each resource ID the field path appears in: - the field's name - the field's path - the number of records which have this field - the number of records which have a text parsed value in this field - the number of records which have a keyword parsed value in this field - the number of records which have a boolean parsed value in this field - the number of records which have a date parsed value in this field - the number of records which have a number parsed value in this field - the number of records which have a geo parsed value in this field

Note that for large numbers of resources, this action is quite expensive to run.

Parameters:

Name Type Description Default
resource_ids List[str]

the resources match fields on (if no resource IDs are passed, all resources are searched)

required
text str

the text to search for

''
lowercase bool

whether to compare the text to the field names in lowercase (default is False)

False
version Optional[int]

the version to search at (default is None, which means latest)

None

Returns:

Type Description

the total number of fields matched and details about the fields that were matched

Source code in ckanext/versioned_datastore/logic/multi/action.py
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
@action(
    schema.vds_multi_autocomplete_field(),
    helptext.vds_multi_autocomplete_field,
    get=True,
)
def vds_multi_autocomplete_field(
    resource_ids: List[str],
    text: str = '',
    lowercase: bool = False,
    version: Optional[int] = None,
):
    """
    Queries the field names in the given resources, filtering by checking if the field
    name contains the text parameter.

    The fields key in the response dict contains the field's full path, and then for
    each resource ID the field path appears in:
        - the field's name
        - the field's path
        - the number of records which have this field
        - the number of records which have a text parsed value in this field
        - the number of records which have a keyword parsed value in this field
        - the number of records which have a boolean parsed value in this field
        - the number of records which have a date parsed value in this field
        - the number of records which have a number parsed value in this field
        - the number of records which have a geo parsed value in this field

    Note that for large numbers of resources, this action is quite expensive to run.

    :param resource_ids: the resources match fields on (if no resource IDs are passed,
                         all resources are searched)
    :param text: the text to search for
    :param lowercase: whether to compare the text to the field names in lowercase
                      (default is False)
    :param version: the version to search at (default is None, which means latest)
    :returns: the total number of fields matched and details about the fields that were
             matched
    """
    fields = defaultdict(dict)

    # if no resource IDs were provided, use all resources available to the user
    if not resource_ids:
        resource_ids = sorted(get_available_datastore_resources())

    for resource_id in resource_ids:
        database = get_database(resource_id)

        try:
            parsed_fields = database.get_parsed_fields(version=version)
        except NotFoundError:
            # temporary fix for splitgill#38 (so we can ignore unavailable resources)
            continue

        for field in parsed_fields:
            if text in (field.path.lower() if lowercase else field.path):
                fields[field.path][resource_id] = {
                    'name': field.name,
                    'path': field.path,
                    'count': field.count,
                    'text': field.count_text,
                    'keyword': field.count_keyword,
                    'boolean': field.count_boolean,
                    'date': field.count_date,
                    'number': field.count_number,
                    'geo': field.count_geo,
                }

    return {'count': len(fields), 'fields': fields}

vds_multi_autocomplete_field_latest(resource_ids, text='', lowercase=False)

Returns fields from the latest version of the given resources.

This action is significantly faster than vds_multi_autocomplete_field for larger and/or more resources, but it does not allow filtering by version and does not return field or type counts.

Parameters:

Name Type Description Default
resource_ids List[str]

the resources match fields on (if no resource IDs are passed, all resources are searched)

required
text str

the text to search for

''
lowercase bool

whether to compare the text to the field names in lowercase (default is False)

False

Returns:

Type Description

the total number of fields matched and details about the fields that were matched

Source code in ckanext/versioned_datastore/logic/multi/action.py
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
@action(
    schema.vds_multi_autocomplete_field_latest(),
    helptext.vds_multi_autocomplete_field_latest,
    get=True,
)
def vds_multi_autocomplete_field_latest(
    resource_ids: List[str],
    text: str = '',
    lowercase: bool = False,
):
    """
    Returns fields from the latest version of the given resources.

    This action is significantly faster than vds_multi_autocomplete_field for larger
    and/or more resources, but it does not allow filtering by version and does not
    return field or type counts.

    :param resource_ids: the resources match fields on (if no resource IDs are passed,
        all resources are searched)
    :param text: the text to search for
    :param lowercase: whether to compare the text to the field names in lowercase
        (default is False)
    :returns: the total number of fields matched and details about the fields that were
        matched
    """

    # if no resource IDs were provided, use all resources available to the user
    if not resource_ids:
        resource_ids = sorted(get_available_datastore_resources())

    resource_fields = get_latest_resource_fields(resource_ids)
    if text:
        fields = {
            k: v
            for k, v in resource_fields.items()
            if text in (k.lower() if lowercase else k)
        }
    else:
        fields = resource_fields

    return {'count': len(fields), 'fields': fields}

vds_multi_autocomplete_value(data_dict, field, prefix=None)

Queries the given resources using the given query and at an optional version, and then returns the values that match the given prefix for the given field. The list of values is returned.

Parameters:

Name Type Description Default
data_dict dict

the action options

required
field str

the field to get the values from

required
prefix Optional[str]

the prefix to search for

None

Returns:

Type Description

a dict containing the list of values and an after key if there are more values available. If the after key is not present, there are no more values available

Source code in ckanext/versioned_datastore/logic/multi/action.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
@action(
    schema.vds_multi_autocomplete_value(),
    helptext.vds_multi_autocomplete_value,
    get=True,
)
def vds_multi_autocomplete_value(
    data_dict: dict,
    field: str,
    prefix: Optional[str] = None,
):
    """
    Queries the given resources using the given query and at an optional version, and
    then returns the values that match the given prefix for the given field. The list of
    values is returned.

    :param data_dict: the action options
    :param field: the field to get the values from
    :param prefix: the prefix to search for
    :returns: a dict containing the list of values and an after key if there are more
        values available. If the after key is not present, there are no more values
        available
    """
    # extract the limit but default to a size of 20 if it's not present
    size = data_dict.pop('size', 20)
    # grab the after as we need to use it for the agg, not the query
    after = data_dict.pop('after', None)

    request = make_request(data_dict)
    request.set_no_results()

    # create the full path to the parsed field type we're going to filter and agg over
    field_path = keyword(field)

    if prefix:
        request.extra_filter &= Q('prefix', **{field_path: prefix})

    # add the aggregation which gets the field values
    request.add_agg(
        'field_values',
        'composite',
        # get one more than the requested size so that we can work out the after
        size=size + 1,
        sources={field: A('terms', field=field_path, order='asc')},
        # only include the after key if there is one
        **({'after': {field: after}} if after is not None else {}),
    )

    response = request.run()

    agg_result = response.aggs['field_values']
    values = [bucket['key'][field] for bucket in agg_result['buckets']]
    result = {'values': values[:size]}
    if 'after_key' in agg_result and len(values) > size:
        result['after'] = agg_result['after_key'][field]
    return result

vds_multi_count(data_dict)

Queries the given resources using the given query and at an optional version, and then returns the total number of records which matched the query and the counts per resource. Compared to the basic query count, this action provides a richer query language and the ability to query multiple resources at the same time.

Parameters:

Name Type Description Default
data_dict dict

the data dict of options

required

Returns:

Type Description

a dict which contains the total count and a breakdown of the hits per resource

Source code in ckanext/versioned_datastore/logic/multi/action.py
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
@action(schema.vds_multi_count(), helptext.vds_multi_count, get=True)
def vds_multi_count(data_dict: dict):
    """
    Queries the given resources using the given query and at an optional version, and
    then returns the total number of records which matched the query and the counts per
    resource. Compared to the basic query count, this action provides a richer query
    language and the ability to query multiple resources at the same time.

    :param data_dict: the data dict of options
    :returns: a dict which contains the total count and a breakdown of the hits per
        resource
    """
    request = make_request(data_dict)
    request.set_no_results()
    # use an aggregation to get the hit count of each resource, set the size to the
    # number of resources we're querying to ensure we get all counts in one go and don't
    # have to paginate with a composite agg
    request.add_agg(
        'counts', 'terms', field='_index', size=len(request.query.resource_ids)
    )
    response = request.run()

    # default the counts to 0 for all resources
    counts = {resource_id: 0 for resource_id in request.query.resource_ids}
    # then update with the counts from the resources that matched the query
    for bucket in response.aggs['counts']['buckets']:
        counts[unprefix_index_name(bucket['key'])] += bucket['doc_count']
    return {'total': response.count, 'counts': counts}

vds_multi_direct(data_dict)

Allows users to run Elasticsearch queries directly against the cluster. The raw response from Elasticsearch is returned directly as well. This is locked down to admin users only to avoid misuse, but can be used from other extensions easily with the ignore_auth context option.

To control the version that is searched, either include it in the search object directly, or: - pass "latest", None, or don't include the version key in the data dict to search the latest version - pass "all" to search all versions - pass a version timestamp value to search at a specific version

A list of resource IDs must be passed and if no resource IDs are passed an error is raised during validation.

Parameters:

Name Type Description Default
data_dict dict

the action data dict

required

Returns:

Type Description

the raw response from Elasticsearch

Source code in ckanext/versioned_datastore/logic/multi/action.py
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
@action(schema.vds_multi_direct(), helptext.vds_multi_direct)
def vds_multi_direct(data_dict: dict):
    """
    Allows users to run Elasticsearch queries directly against the cluster. The raw
    response from Elasticsearch is returned directly as well. This is locked down to
    admin users only to avoid misuse, but can be used from other extensions easily with
    the ignore_auth context option.

    To control the version that is searched, either include it in the search object
    directly, or:
        - pass "latest", None, or don't include the version key in the data dict to
          search the latest version
        - pass "all" to search all versions
        - pass a version timestamp value to search at a specific version

    A list of resource IDs must be passed and if no resource IDs are passed an error is
    raised during validation.

    :param data_dict: the action data dict
    :returns: the raw response from Elasticsearch
    """
    resource_ids = data_dict['resource_ids']
    version = data_dict.get('version', 'latest')
    search = Search.from_dict(data_dict.get('search', {}))

    databases = map(get_database, resource_ids)
    if version == 'latest':
        indices = [database.indices.latest for database in databases]
    else:
        indices = [database.indices.wildcard for database in databases]
        if version is not None and version != 'all':
            search = search.filter(version_query(int(version)))

    # call search.index() empty first to reset the indices thus avoiding them being set
    # in the passed search dict
    search = search.index().index(indices)

    multi_search = MultiSearch(using=es_client()).add(search)
    result = next(iter(multi_search.execute()))

    return result.to_dict()

vds_multi_fields(data_dict, size=10, ignore_groups=None, sample=1.0)

Groups the fields available on the given resources at the optional given version and returns as many of them as requested in the size parameter. The groups are created by matching field names across resources and then sorted in descending order with the most common groups at the top (both most common in terms of resources containing the field, but also most common as in with the most records that have the field in them in a resource).

This aggregates across all records in a resource, so it can be very slow.

Parameters:

Name Type Description Default
data_dict dict

the action options

required
size int

the number of field groups to return (defaults to the top 10)

10
ignore_groups Optional[List[str]]

an optional list of fields to ignore

None
sample Optional[float]

set to <1 to enable random sampling of records (increases query speed); this is the probability that a given record will be included

1.0

Returns:

Type Description

a list of field groups represented as dicts

Source code in ckanext/versioned_datastore/logic/multi/action.py
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
@action(schema.vds_multi_fields(), helptext.vds_multi_fields, get=True)
def vds_multi_fields(
    data_dict: dict,
    size: int = 10,
    ignore_groups: Optional[List[str]] = None,
    sample: Optional[float] = 1.0,
):
    """
    Groups the fields available on the given resources at the optional given version and
    returns as many of them as requested in the size parameter. The groups are created
    by matching field names across resources and then sorted in descending order with
    the most common groups at the top (both most common in terms of resources containing
    the field, but also most common as in with the most records that have the field in
    them in a resource).

    This aggregates across all records in a resource, so it can be very slow.

    :param data_dict: the action options
    :param size: the number of field groups to return (defaults to the top 10)
    :param ignore_groups: an optional list of fields to ignore
    :param sample: set to <1 to enable random sampling of records (increases query
        speed); this is the probability that a given record will be included
    :returns: a list of field groups represented as dicts
    """
    request = make_request(data_dict)
    request.set_no_results()

    query = request.query.to_dsl()

    field_groups = FieldGroups(skip_ids=True)
    if ignore_groups:
        for ignore in ignore_groups:
            field_groups.ignore(ignore)

    for plugin in ivds_implementations():
        plugin.vds_modify_field_groups(request.query.resource_ids, field_groups)

    for resource_id in request.query.resource_ids:
        database = get_database(resource_id)
        try:
            fields = database.get_parsed_fields(
                version=request.query.version,
                query=query,
                sample_probability=sample,
                chunk_size=100,
            )
        except NotFoundError:
            # temporary fix for splitgill#38 (so we can ignore unavailable resources)
            continue
        field_groups.add(resource_id, fields)

    return field_groups.select(size)

vds_multi_hash(query, query_version=None)

Given a query and a query schema version, hash them and return it.

Parameters:

Name Type Description Default
query dict

the query

required
query_version Optional[str]

the version of the query schema

None

Returns:

Type Description

the hash

Source code in ckanext/versioned_datastore/logic/multi/action.py
260
261
262
263
264
265
266
267
268
269
270
271
272
@action(schema.vds_multi_hash(), helptext.vds_multi_hash, get=True)
def vds_multi_hash(query: dict, query_version: Optional[str] = None):
    """
    Given a query and a query schema version, hash them and return it.

    :param query: the query
    :param query_version: the version of the query schema
    :returns: the hash
    """
    if query_version is None:
        query_version = get_latest_query_version()
    validate_query(query, query_version)
    return hash_query(query, query_version)

vds_multi_query(data_dict)

Queries the given resources using the given query and at an optional version, and then return the results. Compared to the basic query, this action provides a richer query language and the ability to query multiple resources at the same time.

Parameters:

Name Type Description Default
data_dict dict

the data dict of options

required

Returns:

Type Description

a dict which contains the total number of records found, an after value for pagination, and a list of dicts of record data

Source code in ckanext/versioned_datastore/logic/multi/action.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
@action(schema.vds_multi_query(), helptext.vds_multi_query, get=True)
def vds_multi_query(data_dict: dict):
    """
    Queries the given resources using the given query and at an optional version, and
    then return the results. Compared to the basic query, this action provides a richer
    query language and the ability to query multiple resources at the same time.

    :param data_dict: the data dict of options
    :returns: a dict which contains the total number of records found, an after value
        for pagination, and a list of dicts of record data
    """
    request = make_request(data_dict)
    response = request.run()
    result = {
        'total': response.count,
        'after': response.next_after,
        'records': [
            {
                'data': hit.data,
                'resource': hit.resource_id,
            }
            for hit in response.hits
        ],
    }

    for plugin in ivds_implementations():
        plugin.vds_after_multi_query(response, result)

    return result

vds_multi_stats(data_dict, field, missing=None)

Retrieves a simple set of numerical stats about the given field and returns them in a dict. The stats provided are the same as the Elasticsearch stats aggregation (because that is what is used!), so you'll get the following keys in the dict response:

  • count
  • min
  • max
  • avg
  • sum

The missing parameter defines how documents that are missing a value should be treated. By default, they will be ignored, but it is also possible to treat them as if they had a value by providing one here.

Parameters:

Name Type Description Default
data_dict dict

the data dict passed to this action

required
field str

the field to get stats for

required
missing Optional[float]

value to use for records missing this field, or None to ignore them

None

Returns:

Type Description

a dict of statistical data

Source code in ckanext/versioned_datastore/logic/multi/action.py
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
@action(schema.vds_multi_stats(), helptext.vds_multi_stats, get=True)
def vds_multi_stats(data_dict: dict, field: str, missing: Optional[float] = None):
    """
    Retrieves a simple set of numerical stats about the given field and returns them in
    a dict. The stats provided are the same as the Elasticsearch stats aggregation
    (because that is what is used!), so you'll get the following keys in the dict
    response:

      - count
      - min
      - max
      - avg
      - sum

    The missing parameter defines how documents that are missing a value should be
    treated. By default, they will be ignored, but it is also possible to treat them as
    if they had a value by providing one here.

    :param data_dict: the data dict passed to this action
    :param field: the field to get stats for
    :param missing: value to use for records missing this field, or None to ignore them
    :returns: a dict of statistical data
    """
    request = make_request(data_dict)
    request.set_no_results()
    agg_options = {'field': number(field)}
    if missing is not None:
        agg_options['missing'] = missing
    request.add_agg('field_stats', 'stats', **agg_options)
    response = request.run()
    return response.aggs['field_stats']