Skip to content

Conversation

@ffuuugor
Copy link

Issue

Current version of wikiextractor can't handle self-closing <text /> tag -- which occurs in the wiki dump when the page is completely empty.

Current parser doesn't recognize self-closing tag, and treats it as an opening tag. It then sets the inText flag and continues until the next explicitly closing </text> tag.

For instance, the following two articles are merged into one with incorrect text and medatata.

<page>
    <title>Portal:Morocco/Morocco news</title>
    <id>2501281</id>
    <revision>
      ...
      <text bytes="0" sha1="phoiac9h4m842xq45sp7s6u21eteeq1" />
    </revision>
</page>
<page>
    <title>Obsidian (disambiguation)</title>
    <id>2501285</id>
    <revision>
      ...
      <text bytes="2085" sha1="kjwe64q274yupcbbfvlokltnyrcyz21" xml:space="preserve">
           {{wiktionary|obsidian}} '''[[Obsidian]]''' is a type of volcanic glass.
            ...
      </text>
    </revision>
</page>

The parsed object has the following properties:

id: 2501281 # id from the first article
title: Obsidian (disambiguation) # title from the second article
text: <revision> ... # a bunch of XML because it's mistakenly treated as text

Proposal

This PR adjusts the regexp used to parse XML tags to recognize self-closing tags and handle then accordingly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant