fix(ds-identify): parse multi-line YAML datasource_list#6883
Open
cjp256 wants to merge 2 commits into
Open
Conversation
Multi-line YAML lists are valid YAML. The prior code assumed that
multi-line datasource_list values could not be parsed, but both
multi-line flow sequences and block sequences are standard YAML:
# multi-line flow sequence
datasource_list: [
Azure
]
# block sequence
datasource_list:
- Azure
- None
The existing read_datasource_list() only handled single-line flow
sequences (e.g. 'datasource_list: [Azure, None]'). When a config
file contains a multi-line flow sequence, the parser saw only '['
without the closing ']', produced an empty dslist, and fell back to
the full default datasource list. On Azure VMs this meant
ds-identify checked every datasource unnecessarily.
Add read_datasource_list_multiline() to handle both multi-line flow
sequences and block sequences as a fallback when single-line parsing
produces no result. Also tighten get_single_line_flow_sequence() to
reject values that lack a closing ']', so incomplete flow sequences
fall through to the multi-line parser correctly.
An alternative approach would be to shell out to python3 for YAML
parsing (e.g. python3 -c 'import yaml; ...'), since python3 is a
hard dependency of cloud-init and is guaranteed to be present.
This would handle every valid YAML construct correctly with zero
edge cases and much less code. However, ds-identify is designed as
a pure POSIX shell script for speed — it runs as a systemd generator
early in boot where startup latency matters, and on systems where
cloud-init should be disabled it avoids the cost of a python3
interpreter entirely. The shell-based approach preserves that
property while covering the formats seen in practice. We could
maybe do one and attempt to fall back to the other?
When datasource_list is present in config but cannot be parsed by
any method, ds-identify now emits a distinct error indicating the
key was found but unparsable. I think maybe this error needs to go
to console or otherwise be detected by cloud-init later and escalated
as part of status, etc.
663ee0f to
5800d1d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Multi-line YAML lists are valid YAML. The prior code assumed that multi-line datasource_list values could not be parsed, but both multi-line flow sequences and block sequences are standard YAML:
multi-line flow sequence
datasource_list: [
Azure
]
block sequence
datasource_list:
- Azure
- None
The existing read_datasource_list() only handled single-line flow sequences (e.g. 'datasource_list: [Azure, None]'). When a config file contains a multi-line flow sequence, the parser saw only '[' without the closing ']', produced an empty dslist, and fell back to the full default datasource list. On Azure VMs this meant ds-identify checked every datasource unnecessarily.
Add read_datasource_list_multiline() to handle both multi-line flow sequences and block sequences as a fallback when single-line parsing produces no result. Also tighten get_single_line_flow_sequence() to reject values that lack a closing ']', so incomplete flow sequences fall through to the multi-line parser correctly.
An alternative approach would be to shell out to python3 for YAML parsing (e.g. python3 -c 'import yaml; ...'), since python3 is a hard dependency of cloud-init and is guaranteed to be present. This would handle every valid YAML construct correctly with zero edge cases and much less code. However, ds-identify is designed as a pure POSIX shell script for speed — it runs as a systemd generator early in boot where startup latency matters, and on systems where cloud-init should be disabled it avoids the cost of a python3 interpreter entirely. The shell-based approach preserves that property while covering the formats seen in practice. We could maybe do one and attempt to fall back to the other?
When datasource_list is present in config but cannot be parsed by any method, ds-identify now emits a distinct error indicating the key was found but unparsable. I think maybe this error needs to go to console or otherwise be detected by cloud-init later and escalated as part of status, etc.