Support self-closing XML tags #345
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
Current version of wikiextractor can't handle self-closing
<text />tag -- which occurs in the wiki dump when the page is completely empty.Current parser doesn't recognize self-closing tag, and treats it as an opening tag. It then sets the
inTextflag and continues until the next explicitly closing</text>tag.For instance, the following two articles are merged into one with incorrect text and medatata.
The parsed object has the following properties:
Proposal
This PR adjusts the regexp used to parse XML tags to recognize self-closing tags and handle then accordingly