Support self-closing XML tags #345

ffuuugor · 2025-07-12T21:30:03Z

Issue

Current version of wikiextractor can't handle self-closing <text /> tag -- which occurs in the wiki dump when the page is completely empty.

Current parser doesn't recognize self-closing tag, and treats it as an opening tag. It then sets the inText flag and continues until the next explicitly closing </text> tag.

For instance, the following two articles are merged into one with incorrect text and medatata.

<page>
    <title>Portal:Morocco/Morocco news</title>
    <id>2501281</id>
    <revision>
      ...
      <text bytes="0" sha1="phoiac9h4m842xq45sp7s6u21eteeq1" />
    </revision>
</page>
<page>
    <title>Obsidian (disambiguation)</title>
    <id>2501285</id>
    <revision>
      ...
      <text bytes="2085" sha1="kjwe64q274yupcbbfvlokltnyrcyz21" xml:space="preserve">
           {{wiktionary|obsidian}} '''[[Obsidian]]''' is a type of volcanic glass.
            ...
      </text>
    </revision>
</page>

The parsed object has the following properties:

id: 2501281 # id from the first article
title: Obsidian (disambiguation) # title from the second article
text: <revision> ... # a bunch of XML because it's mistakenly treated as text

Proposal

This PR adjusts the regexp used to parse XML tags to recognize self-closing tags and handle then accordingly

Support self-closing XML tags

8955fdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support self-closing XML tags #345

Support self-closing XML tags #345

Uh oh!

ffuuugor commented Jul 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Support self-closing XML tags #345

Are you sure you want to change the base?

Support self-closing XML tags #345

Uh oh!

Conversation

ffuuugor commented Jul 12, 2025

Issue

Proposal

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant