Skip to content

Comments

fix: handle GUIDs with XML entities in them like feedparser-py#60

Merged
bug-ops merged 2 commits intobug-ops:mainfrom
fazalmajid:fix/guid-entities
Feb 20, 2026
Merged

fix: handle GUIDs with XML entities in them like feedparser-py#60
bug-ops merged 2 commits intobug-ops:mainfrom
fazalmajid:fix/guid-entities

Conversation

@fazalmajid
Copy link
Contributor

Summary

Handle XML entities in attributes like GUIDs

Motivation

See #59 for details
Fixes #59

Changes

Handle Quick-XML Event::GeneralRef when handling element text.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test addition or update

Test Plan

  • Ran cargo make test (all tests pass)
  • Ran cargo make lint (no warnings)
  • Added new tests for the changes
  • Tested manually with:

Note that cargo make ci-all is failing due to an unrelated issue https://rustsec.org/advisories/RUSTSEC-2026-0013 in pyo3 and there is another one in cargo-nexttest, but I didn't think you'd want to commingle those security fixes with this regression fix

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code where necessary
  • [N/A] I have updated the documentation accordingly
  • My changes generate no new warnings
  • I have added tests that prove my fix/feature works
  • New and existing tests pass locally

Additional Notes

Hyrum's law strikes again. This would fix false positives in code that is switching from feedparser-py to feedparser-rs, but code that was using the incorrect GUIDs generated by older versions of feedparser-rs will experience false positives.

@github-actions github-actions bot added type: tooling Development tools, CI/CD, or infrastructure component: core feedparser-rs-core Rust library component: python Python bindings (PyO3) component: node Node.js bindings (napi-rs) component: tests Test suite or test infrastructure area: parser Feed parsing logic lang: rust Rust code lang: python Python code size: M Medium PR (<200 lines changed) labels Feb 19, 2026
@bug-ops bug-ops requested a review from Copilot February 20, 2026 01:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a bug where XML entities in GUID elements (and other text fields) were not being properly decoded, causing character loss and breaking compatibility with Python's feedparser library. The issue occurred because quick-xml 0.39's Event::GeneralRef events (emitted for entity references like &#038;) were being silently ignored.

Changes:

  • Added entity resolution logic to handle numeric character references (&, &), predefined XML entities (&, <, etc.), and unknown entities
  • Updated read_text function to process Event::GeneralRef events
  • Added comprehensive test coverage for entity decoding in both Rust and Python binding tests

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
crates/feedparser-rs-core/src/parser/common.rs Added Event::GeneralRef handling in read_text() and new resolve_entity() helper function to decode character references and named entities
crates/feedparser-rs-py/tests/test_guid_entities.py Added regression tests for entity decoding in GUIDs covering numeric references, hex references, predefined entities, and multiple entities

Comment on lines 376 to 379
Ok(Event::GeneralRef(e)) => {
let resolved = resolve_entity(&e)?;
append_bytes(&mut text, resolved.as_bytes(), limits.max_text_length)?;
}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation violates the bozo pattern that is fundamental to this project. According to the coding guidelines (CodingGuidelineID: 1000002), the parser must NEVER panic or return errors for malformed feeds. Instead, it should set the bozo flag and continue parsing.

When resolve_entity returns an error (e.g., for an invalid character reference), this will propagate up as a Result error, causing the entire parsing operation to fail or set bozo at a higher level. However, the bozo pattern requires continuing to parse and extract as much data as possible even when encountering entity resolution errors.

The fix should:

  1. Change resolve_entity to return String instead of Result<String>
  2. When an invalid character reference is encountered, preserve it as-is (e.g., "&#xFFFF;") rather than returning an error
  3. Optionally, the caller (read_text) could detect when entity resolution fails and set bozo on the feed object if that context is available

This is particularly important because real-world feeds may contain invalid character references, and Python feedparser would handle these gracefully. For API compatibility (the #1 priority), we must match that behavior.

Copilot generated this review using guidance from repository custom instructions.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

assert_eq!(text, "https://example.com/?a=1&b=2&c=3");
}

#[test]
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for edge cases in entity handling. According to the testing guidelines (CodingGuidelineID: 1000004), tests should cover both success and failure cases, including edge cases.

Recommended additional tests:

  1. Invalid character reference (e.g., &amp;#xFFFF; or &amp;#999999;) - should set bozo but continue parsing
  2. Unknown/custom entity references (e.g., &amp;customEntity;) - should preserve as-is
  3. Malformed entity syntax (e.g., &amp;#x; or &amp;) - should handle gracefully
  4. Mix of valid and invalid entities in same text

These tests are particularly important for verifying the bozo pattern works correctly when entity resolution encounters errors.

Suggested change
#[test]
#[test]
fn test_read_text_invalid_numeric_char_ref() {
let xml = b"<guid>invalid &#999999; ref</guid>";
let mut reader = Reader::from_reader(&xml[..]);
reader.config_mut().trim_text(true);
let mut buf = Vec::new();
let limits = ParserLimits::default();
loop {
match reader.read_event_into(&mut buf) {
Ok(Event::Start(_)) => break,
Ok(Event::Eof) => panic!("Unexpected EOF"),
_ => {}
}
buf.clear();
}
buf.clear();
let result = read_text(&mut reader, &mut buf, &limits);
assert!(result.is_err(), "Invalid numeric character reference should result in an error");
}
#[test]
fn test_read_text_unknown_named_entity() {
let xml = b"<guid>value &customEntity; here</guid>";
let mut reader = Reader::from_reader(&xml[..]);
reader.config_mut().trim_text(true);
let mut buf = Vec::new();
let limits = ParserLimits::default();
loop {
match reader.read_event_into(&mut buf) {
Ok(Event::Start(_)) => break,
Ok(Event::Eof) => panic!("Unexpected EOF"),
_ => {}
}
buf.clear();
}
buf.clear();
let result = read_text(&mut reader, &mut buf, &limits);
assert!(result.is_err(), "Unknown/custom entity should be surfaced as an error at this level");
}
#[test]
fn test_read_text_malformed_entity_syntax() {
// Malformed entity: missing code after &#x and bare '&'
let xml = b"<guid>value &#x; and &</guid>";
let mut reader = Reader::from_reader(&xml[..]);
reader.config_mut().trim_text(true);
let mut buf = Vec::new();
let limits = ParserLimits::default();
loop {
match reader.read_event_into(&mut buf) {
Ok(Event::Start(_)) => break,
Ok(Event::Eof) => panic!("Unexpected EOF"),
_ => {}
}
buf.clear();
}
buf.clear();
let result = read_text(&mut reader, &mut buf, &limits);
assert!(result.is_err(), "Malformed entity syntax should not cause a panic and must be reported as an error");
}
#[test]
fn test_read_text_mixed_valid_and_invalid_entities() {
let xml = b"<guid>ok &amp; bad &#999999; mix</guid>";
let mut reader = Reader::from_reader(&xml[..]);
reader.config_mut().trim_text(true);
let mut buf = Vec::new();
let limits = ParserLimits::default();
loop {
match reader.read_event_into(&mut buf) {
Ok(Event::Start(_)) => break,
Ok(Event::Eof) => panic!("Unexpected EOF"),
_ => {}
}
buf.clear();
}
buf.clear();
let result = read_text(&mut reader, &mut buf, &limits);
assert!(result.is_err(), "Mixed valid and invalid entities should be handled without panicking");
}
#[test]

Copilot uses AI. Check for mistakes.
@bug-ops
Copy link
Owner

bug-ops commented Feb 20, 2026

Hey, @fazalmajid ! Thanks for the fix! Good catch.

I've merged #62 which updates dependencies, so the Security Audit check should pass now.
Could you rebase your branch on top of main to pick up those changes?

@bug-ops bug-ops mentioned this pull request Feb 20, 2026
4 tasks
@github-actions github-actions bot added size: L Large PR (<500 lines changed) and removed size: M Medium PR (<200 lines changed) labels Feb 20, 2026
@fazalmajid
Copy link
Contributor Author

@bug-ops I rebased and implemented the Copilot recommendations and associated tests. Three things I found in doing so:

  1. The code is now bozo-tolerant, but the read_text function has no way to set the bozo flag. Bubbling this up would require invasive refactoring so I have not implemented it, but you should be aware of this

  2. When handling text that has a mix of valid and invalid entities, since quick-xml treats each entity independently, the good ones will be converted, and the bad ones will not. This differs from the feedparser-py behavior, where they are either all converted, or all left as-is atomically. If you look at the test test_guid_entities.py:test_guid_with_mixed_valid_and_unknown_entities feedparser-rs-py returns AT&T&unknown; but feedparser-py's behavior is to return AT&amp;T&unknown;:

zanzibar ~>p
Python 3.14.3 (main, Feb 11 2026, 10:07:14) [GCC 15.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> feedparser.parse(b"""<?xml version="1.0"?>
...     <rss version="2.0">
...         <channel>
...             <item>
...                 <guid>AT&amp;T&unknown;</guid>
...             </item>
...         </channel>
...     </rss>""")
{'bozo': 1, 'entries': [{'id': 'AT&amp;T&unknown;', 'guidislink': True, 'link': 'AT&amp;T&unknown;'}], 'feed': {}, 'headers': {}, 'encoding': 'utf-8', 'version': 'rss20', 'bozo_exception': SAXParseException("Entity 'unknown' not defined\n"), 'namespaces': {}}

(and note it also sets the bozo flag with SAXParseException("Entity 'unknown' not defined\n"))

  1. running cargo make test mutates crates/feedparser-rs-node/index.js, changing bindingPackageVersion checks from 0.3.0 to 0.4.3 and other stuff I do not understand, so I left well enough alone and did not check those changes in the commit.

Thanks for the work on feedparser-rs! I was a contributor to feedparser-py, and looking forward to incorporating feedparser-rs in my WIP Rust rewrite of https://github.com/fazalmajid/temboz after a false start with brittle feed-rs.

@fazalmajid
Copy link
Contributor Author

Also, just for the sake of clarity in the absence of a CLA: I agree to the terms in CONTRUBUTING.ms and agree this code will be licensed under the terms of the project (dual MIT or Apache 2.0 license).

@bug-ops
Copy link
Owner

bug-ops commented Feb 20, 2026

Thanks for the detailed analysis and the rebase!
All three findings are valid. The index.js mutation is expected from the napi-rs build pipeline.
I've created #64 to track the follow-up work — this keeps your PR focused on the core fix.
Thanks for the contribution!

@bug-ops bug-ops merged commit 9c71d41 into bug-ops:main Feb 20, 2026
27 checks passed
bug-ops added a commit that referenced this pull request Feb 20, 2026
bug-ops added a commit that referenced this pull request Feb 20, 2026
* release: prepare v0.4.4

* release: update changelog with PR #60 reference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: parser Feed parsing logic component: core feedparser-rs-core Rust library component: node Node.js bindings (napi-rs) component: python Python bindings (PyO3) component: tests Test suite or test infrastructure lang: python Python code lang: rust Rust code size: L Large PR (<500 lines changed) type: tooling Development tools, CI/CD, or infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feedparser-rs handles entities in GUIDs differently from feedparser-py

2 participants