Skip to content

Removing invalid docs from persistence.#5206

Merged
fulmicoton merged 1 commit intomainfrom
5205/ditch-invalid-docs
Jul 16, 2024
Merged

Removing invalid docs from persistence.#5206
fulmicoton merged 1 commit intomainfrom
5205/ditch-invalid-docs

Conversation

@fulmicoton
Copy link
Collaborator

@fulmicoton fulmicoton commented Jul 9, 2024

Removing invalid docs from persistence.

This PR filters out invalid docs from batches before persisting the
batch.
That way, they won't be persisted in the mrecordlog and the
indexing pipeline will not have to parse them again (and log an error).

This PR also improves a little bit on the metric for ingested docs.
The num_docs and bytes metric now have a validity `label`, and the
point at which they are measured is documented in the metric description.

Closes #5205

@fulmicoton fulmicoton force-pushed the 5205/ditch-invalid-docs branch 4 times, most recently from c83ae27 to c6fe1f2 Compare July 9, 2024 09:57
@fulmicoton fulmicoton requested a review from trinity-1686a July 9, 2024 09:57
This PR filters out invalid docs from batches before persisting the
batch.
That way, they won't be persisted in the mrecordlog and the
indexing pipeline will not have to parse them again (and log an error).

This PR also improves a little bit on the metric for ingested docs.
The num_docs and bytes metric now have a validity `label`, and the
point at which they are measured is documented in the metric description.

Closes #5205
@fulmicoton fulmicoton force-pushed the 5205/ditch-invalid-docs branch from c6fe1f2 to 0a57d53 Compare July 9, 2024 10:12
@trinity-1686a trinity-1686a linked an issue Jul 10, 2024 that may be closed by this pull request
@@ -122,7 +148,7 @@ pub(super) async fn validate_doc_batch(
doc_mapper: Arc<dyn DocMapper>,
) -> IngestV2Result<(DocBatchV2, Vec<ParseFailure>)> {
if is_document_validation_enabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of scope for this PR: it might be worth thinking about disabling validation when VRL is present, or to do VRL validation early. I can imagine a document-doc_mapper-VRL combo where not having VRL here causes a document to be rejected, even though it would have been accepted during actual indexation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent point! I'll create an issue.

@fulmicoton fulmicoton merged commit a3f074e into main Jul 16, 2024
@fulmicoton fulmicoton deleted the 5205/ditch-invalid-docs branch July 16, 2024 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ditching invalid documents upon ingestion

2 participants