Skip to content

fix(ds-identify): parse multi-line YAML datasource_list#6883

Open
cjp256 wants to merge 2 commits into
canonical:mainfrom
cjp256:ds-identify-multiline
Open

fix(ds-identify): parse multi-line YAML datasource_list#6883
cjp256 wants to merge 2 commits into
canonical:mainfrom
cjp256:ds-identify-multiline

Conversation

@cjp256
Copy link
Copy Markdown
Contributor

@cjp256 cjp256 commented May 14, 2026

Multi-line YAML lists are valid YAML. The prior code assumed that multi-line datasource_list values could not be parsed, but both multi-line flow sequences and block sequences are standard YAML:

multi-line flow sequence

datasource_list: [
Azure
]

block sequence

datasource_list:
- Azure
- None

The existing read_datasource_list() only handled single-line flow sequences (e.g. 'datasource_list: [Azure, None]'). When a config file contains a multi-line flow sequence, the parser saw only '[' without the closing ']', produced an empty dslist, and fell back to the full default datasource list. On Azure VMs this meant ds-identify checked every datasource unnecessarily.

Add read_datasource_list_multiline() to handle both multi-line flow sequences and block sequences as a fallback when single-line parsing produces no result. Also tighten get_single_line_flow_sequence() to reject values that lack a closing ']', so incomplete flow sequences fall through to the multi-line parser correctly.

An alternative approach would be to shell out to python3 for YAML parsing (e.g. python3 -c 'import yaml; ...'), since python3 is a hard dependency of cloud-init and is guaranteed to be present. This would handle every valid YAML construct correctly with zero edge cases and much less code. However, ds-identify is designed as a pure POSIX shell script for speed — it runs as a systemd generator early in boot where startup latency matters, and on systems where cloud-init should be disabled it avoids the cost of a python3 interpreter entirely. The shell-based approach preserves that property while covering the formats seen in practice. We could maybe do one and attempt to fall back to the other?

When datasource_list is present in config but cannot be parsed by any method, ds-identify now emits a distinct error indicating the key was found but unparsable. I think maybe this error needs to go to console or otherwise be detected by cloud-init later and escalated as part of status, etc.

Multi-line YAML lists are valid YAML. The prior code assumed that
multi-line datasource_list values could not be parsed, but both
multi-line flow sequences and block sequences are standard YAML:

  # multi-line flow sequence
  datasource_list: [
                   Azure
                   ]

  # block sequence
  datasource_list:
    - Azure
    - None

The existing read_datasource_list() only handled single-line flow
sequences (e.g. 'datasource_list: [Azure, None]'). When a config
file contains a multi-line flow sequence, the parser saw only '['
without the closing ']', produced an empty dslist, and fell back to
the full default datasource list. On Azure VMs this meant
ds-identify checked every datasource unnecessarily.

Add read_datasource_list_multiline() to handle both multi-line flow
sequences and block sequences as a fallback when single-line parsing
produces no result. Also tighten get_single_line_flow_sequence() to
reject values that lack a closing ']', so incomplete flow sequences
fall through to the multi-line parser correctly.

An alternative approach would be to shell out to python3 for YAML
parsing (e.g. python3 -c 'import yaml; ...'), since python3 is a
hard dependency of cloud-init and is guaranteed to be present.
This would handle every valid YAML construct correctly with zero
edge cases and much less code. However, ds-identify is designed as
a pure POSIX shell script for speed — it runs as a systemd generator
early in boot where startup latency matters, and on systems where
cloud-init should be disabled it avoids the cost of a python3
interpreter entirely. The shell-based approach preserves that
property while covering the formats seen in practice.  We could
maybe do one and attempt to fall back to the other?

When datasource_list is present in config but cannot be parsed by
any method, ds-identify now emits a distinct error indicating the
key was found but unparsable.  I think maybe this error needs to go
to console or otherwise be detected by cloud-init later and escalated
as part of status, etc.
@cjp256 cjp256 force-pushed the ds-identify-multiline branch from 663ee0f to 5800d1d Compare May 14, 2026 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant