Summary
ckan search-index rebuild on staging fails for several datasets with:
TypeError: Object of type Row is not JSON serializable
Without -i, the first such dataset aborts the whole rebuild, leaving a partial index. With -i the errors are skipped but those datasets are missing from Solr and will not appear in search.
Traceback
File "/usr/lib/adx/submodules/ckan/ckan/lib/search/__init__.py", line 251, in rebuild
package_index.update_dict(
File "/usr/lib/adx/submodules/ckan/ckan/lib/search/index.py", line 105, in update_dict
self.index_package(pkg_dict, defer_commit)
File "/usr/lib/adx/submodules/ckan/ckan/lib/search/index.py", line 124, in index_package
data_dict_json = json.dumps(pkg_dict)
...
TypeError: Object of type Row is not JSON serializable
A SQLAlchemy Row object is leaking into pkg_dict before json.dumps at ckan/lib/search/index.py:124. This is almost certainly an extension (before_dataset_index / before_index hook, or a package-dict modifier) attaching a raw query result instead of converting it to a plain dict/list.
Known affected dataset IDs (staging)
e6426824-6985-4d33-91e0-a785ce9634b9
a705c2a4-f498-450c-aa41-c28061e11dfa
d55c9e6f-c051-4c2d-95dd-41e21b70a020
Separate but related
During the same rebuild, one dataset also fails Solr-side with an immense-term error on its config field:
Exception writing document id ba3bbd4e59033c155d1755a6d2f52075 ...
Document contains at least one immense term in field="config"
(whose UTF8 encoding is longer than the max length 32766);
bytes can be at most 32766 in length; got 38485.
Separate root cause (Solr schema: config is declared as StrField, which is not analyzed/tokenized and has a 32 KB term limit). Either truncate/normalize the value upstream, or change the schema field type to TextField.
Repro
kubectl exec deployment/ckan -n adr-s -- \
ckan -c /tmp/production.ini search-index rebuild
Suggested investigation path
- Grep the codebase and submodules for
before_dataset_index / before_index / functions that mutate pkg_dict and look for ones that assign the result of a SQLAlchemy query directly (i.e., without ._asdict() or dict(row) or iterating into a list of dicts).
- Load one of the affected datasets via
ckan dataset show <id> and compare to an unaffected dataset to narrow down which field holds the Row.
- For the
config field issue, inspect dataset ba3bbd4e59033c155d1755a6d2f52075 and decide whether the giant value is legitimate.
Summary
ckan search-index rebuildon staging fails for several datasets with:Without
-i, the first such dataset aborts the whole rebuild, leaving a partial index. With-ithe errors are skipped but those datasets are missing from Solr and will not appear in search.Traceback
A SQLAlchemy
Rowobject is leaking intopkg_dictbeforejson.dumpsatckan/lib/search/index.py:124. This is almost certainly an extension (before_dataset_index/before_indexhook, or a package-dict modifier) attaching a raw query result instead of converting it to a plain dict/list.Known affected dataset IDs (staging)
e6426824-6985-4d33-91e0-a785ce9634b9a705c2a4-f498-450c-aa41-c28061e11dfad55c9e6f-c051-4c2d-95dd-41e21b70a020Separate but related
During the same rebuild, one dataset also fails Solr-side with an immense-term error on its
configfield:Separate root cause (Solr schema:
configis declared asStrField, which is not analyzed/tokenized and has a 32 KB term limit). Either truncate/normalize the value upstream, or change the schema field type toTextField.Repro
kubectl exec deployment/ckan -n adr-s -- \ ckan -c /tmp/production.ini search-index rebuildSuggested investigation path
before_dataset_index/before_index/ functions that mutatepkg_dictand look for ones that assign the result of a SQLAlchemy query directly (i.e., without._asdict()ordict(row)or iterating into a list of dicts).ckan dataset show <id>and compare to an unaffected dataset to narrow down which field holds theRow.configfield issue, inspect datasetba3bbd4e59033c155d1755a6d2f52075and decide whether the giant value is legitimate.