fix: handle GUIDs with XML entities in them like feedparser-py#60
fix: handle GUIDs with XML entities in them like feedparser-py#60bug-ops merged 2 commits intobug-ops:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request fixes a bug where XML entities in GUID elements (and other text fields) were not being properly decoded, causing character loss and breaking compatibility with Python's feedparser library. The issue occurred because quick-xml 0.39's Event::GeneralRef events (emitted for entity references like &) were being silently ignored.
Changes:
- Added entity resolution logic to handle numeric character references (&, &), predefined XML entities (&, <, etc.), and unknown entities
- Updated
read_textfunction to processEvent::GeneralRefevents - Added comprehensive test coverage for entity decoding in both Rust and Python binding tests
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| crates/feedparser-rs-core/src/parser/common.rs | Added Event::GeneralRef handling in read_text() and new resolve_entity() helper function to decode character references and named entities |
| crates/feedparser-rs-py/tests/test_guid_entities.py | Added regression tests for entity decoding in GUIDs covering numeric references, hex references, predefined entities, and multiple entities |
| Ok(Event::GeneralRef(e)) => { | ||
| let resolved = resolve_entity(&e)?; | ||
| append_bytes(&mut text, resolved.as_bytes(), limits.max_text_length)?; | ||
| } |
There was a problem hiding this comment.
This implementation violates the bozo pattern that is fundamental to this project. According to the coding guidelines (CodingGuidelineID: 1000002), the parser must NEVER panic or return errors for malformed feeds. Instead, it should set the bozo flag and continue parsing.
When resolve_entity returns an error (e.g., for an invalid character reference), this will propagate up as a Result error, causing the entire parsing operation to fail or set bozo at a higher level. However, the bozo pattern requires continuing to parse and extract as much data as possible even when encountering entity resolution errors.
The fix should:
- Change
resolve_entityto returnStringinstead ofResult<String> - When an invalid character reference is encountered, preserve it as-is (e.g., "") rather than returning an error
- Optionally, the caller (
read_text) could detect when entity resolution fails and set bozo on the feed object if that context is available
This is particularly important because real-world feeds may contain invalid character references, and Python feedparser would handle these gracefully. For API compatibility (the #1 priority), we must match that behavior.
| assert_eq!(text, "https://example.com/?a=1&b=2&c=3"); | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
Missing test coverage for edge cases in entity handling. According to the testing guidelines (CodingGuidelineID: 1000004), tests should cover both success and failure cases, including edge cases.
Recommended additional tests:
- Invalid character reference (e.g.,
&#xFFFF;or&#999999;) - should set bozo but continue parsing - Unknown/custom entity references (e.g.,
&customEntity;) - should preserve as-is - Malformed entity syntax (e.g.,
&#x;or&) - should handle gracefully - Mix of valid and invalid entities in same text
These tests are particularly important for verifying the bozo pattern works correctly when entity resolution encounters errors.
| #[test] | |
| #[test] | |
| fn test_read_text_invalid_numeric_char_ref() { | |
| let xml = b"<guid>invalid 󴈿 ref</guid>"; | |
| let mut reader = Reader::from_reader(&xml[..]); | |
| reader.config_mut().trim_text(true); | |
| let mut buf = Vec::new(); | |
| let limits = ParserLimits::default(); | |
| loop { | |
| match reader.read_event_into(&mut buf) { | |
| Ok(Event::Start(_)) => break, | |
| Ok(Event::Eof) => panic!("Unexpected EOF"), | |
| _ => {} | |
| } | |
| buf.clear(); | |
| } | |
| buf.clear(); | |
| let result = read_text(&mut reader, &mut buf, &limits); | |
| assert!(result.is_err(), "Invalid numeric character reference should result in an error"); | |
| } | |
| #[test] | |
| fn test_read_text_unknown_named_entity() { | |
| let xml = b"<guid>value &customEntity; here</guid>"; | |
| let mut reader = Reader::from_reader(&xml[..]); | |
| reader.config_mut().trim_text(true); | |
| let mut buf = Vec::new(); | |
| let limits = ParserLimits::default(); | |
| loop { | |
| match reader.read_event_into(&mut buf) { | |
| Ok(Event::Start(_)) => break, | |
| Ok(Event::Eof) => panic!("Unexpected EOF"), | |
| _ => {} | |
| } | |
| buf.clear(); | |
| } | |
| buf.clear(); | |
| let result = read_text(&mut reader, &mut buf, &limits); | |
| assert!(result.is_err(), "Unknown/custom entity should be surfaced as an error at this level"); | |
| } | |
| #[test] | |
| fn test_read_text_malformed_entity_syntax() { | |
| // Malformed entity: missing code after &#x and bare '&' | |
| let xml = b"<guid>value &#x; and &</guid>"; | |
| let mut reader = Reader::from_reader(&xml[..]); | |
| reader.config_mut().trim_text(true); | |
| let mut buf = Vec::new(); | |
| let limits = ParserLimits::default(); | |
| loop { | |
| match reader.read_event_into(&mut buf) { | |
| Ok(Event::Start(_)) => break, | |
| Ok(Event::Eof) => panic!("Unexpected EOF"), | |
| _ => {} | |
| } | |
| buf.clear(); | |
| } | |
| buf.clear(); | |
| let result = read_text(&mut reader, &mut buf, &limits); | |
| assert!(result.is_err(), "Malformed entity syntax should not cause a panic and must be reported as an error"); | |
| } | |
| #[test] | |
| fn test_read_text_mixed_valid_and_invalid_entities() { | |
| let xml = b"<guid>ok & bad 󴈿 mix</guid>"; | |
| let mut reader = Reader::from_reader(&xml[..]); | |
| reader.config_mut().trim_text(true); | |
| let mut buf = Vec::new(); | |
| let limits = ParserLimits::default(); | |
| loop { | |
| match reader.read_event_into(&mut buf) { | |
| Ok(Event::Start(_)) => break, | |
| Ok(Event::Eof) => panic!("Unexpected EOF"), | |
| _ => {} | |
| } | |
| buf.clear(); | |
| } | |
| buf.clear(); | |
| let result = read_text(&mut reader, &mut buf, &limits); | |
| assert!(result.is_err(), "Mixed valid and invalid entities should be handled without panicking"); | |
| } | |
| #[test] |
|
Hey, @fazalmajid ! Thanks for the fix! Good catch. I've merged #62 which updates dependencies, so the Security Audit check should pass now. |
3c9ec33 to
d87df00
Compare
|
@bug-ops I rebased and implemented the Copilot recommendations and associated tests. Three things I found in doing so:
(and note it also sets the bozo flag with
Thanks for the work on feedparser-rs! I was a contributor to feedparser-py, and looking forward to incorporating feedparser-rs in my WIP Rust rewrite of https://github.com/fazalmajid/temboz after a false start with brittle feed-rs. |
|
Also, just for the sake of clarity in the absence of a CLA: I agree to the terms in CONTRUBUTING.ms and agree this code will be licensed under the terms of the project (dual MIT or Apache 2.0 license). |
|
Thanks for the detailed analysis and the rebase! |
* release: prepare v0.4.4 * release: update changelog with PR #60 reference
Summary
Handle XML entities in attributes like GUIDs
Motivation
See #59 for details
Fixes #59
Changes
Handle Quick-XML Event::GeneralRef when handling element text.
Type of Change
Test Plan
cargo make test(all tests pass)cargo make lint(no warnings)Note that cargo make ci-all is failing due to an unrelated issue https://rustsec.org/advisories/RUSTSEC-2026-0013 in pyo3 and there is another one in cargo-nexttest, but I didn't think you'd want to commingle those security fixes with this regression fix
Checklist
Additional Notes
Hyrum's law strikes again. This would fix false positives in code that is switching from feedparser-py to feedparser-rs, but code that was using the incorrect GUIDs generated by older versions of feedparser-rs will experience false positives.