Skip to content

search-index rebuild fails: Row object leaking into pkg_dict (JSON serialization) #201

@cooper667

Description

@cooper667

Summary

ckan search-index rebuild on staging fails for several datasets with:

TypeError: Object of type Row is not JSON serializable

Without -i, the first such dataset aborts the whole rebuild, leaving a partial index. With -i the errors are skipped but those datasets are missing from Solr and will not appear in search.

Traceback

File "/usr/lib/adx/submodules/ckan/ckan/lib/search/__init__.py", line 251, in rebuild
    package_index.update_dict(
File "/usr/lib/adx/submodules/ckan/ckan/lib/search/index.py", line 105, in update_dict
    self.index_package(pkg_dict, defer_commit)
File "/usr/lib/adx/submodules/ckan/ckan/lib/search/index.py", line 124, in index_package
    data_dict_json = json.dumps(pkg_dict)
  ...
TypeError: Object of type Row is not JSON serializable

A SQLAlchemy Row object is leaking into pkg_dict before json.dumps at ckan/lib/search/index.py:124. This is almost certainly an extension (before_dataset_index / before_index hook, or a package-dict modifier) attaching a raw query result instead of converting it to a plain dict/list.

Known affected dataset IDs (staging)

  • e6426824-6985-4d33-91e0-a785ce9634b9
  • a705c2a4-f498-450c-aa41-c28061e11dfa
  • d55c9e6f-c051-4c2d-95dd-41e21b70a020

Separate but related

During the same rebuild, one dataset also fails Solr-side with an immense-term error on its config field:

Exception writing document id ba3bbd4e59033c155d1755a6d2f52075 ...
Document contains at least one immense term in field="config"
(whose UTF8 encoding is longer than the max length 32766);
bytes can be at most 32766 in length; got 38485.

Separate root cause (Solr schema: config is declared as StrField, which is not analyzed/tokenized and has a 32 KB term limit). Either truncate/normalize the value upstream, or change the schema field type to TextField.

Repro

kubectl exec deployment/ckan -n adr-s -- \
  ckan -c /tmp/production.ini search-index rebuild

Suggested investigation path

  1. Grep the codebase and submodules for before_dataset_index / before_index / functions that mutate pkg_dict and look for ones that assign the result of a SQLAlchemy query directly (i.e., without ._asdict() or dict(row) or iterating into a list of dicts).
  2. Load one of the affected datasets via ckan dataset show <id> and compare to an unaffected dataset to narrow down which field holds the Row.
  3. For the config field issue, inspect dataset ba3bbd4e59033c155d1755a6d2f52075 and decide whether the giant value is legitimate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions