Skip to content

extension for extract hyperlinks#10

Closed
badbye wants to merge 12 commits intoankushshah89:masterfrom
badbye:master
Closed

extension for extract hyperlinks#10
badbye wants to merge 12 commits intoankushshah89:masterfrom
badbye:master

Conversation

@badbye
Copy link
Copy Markdown

@badbye badbye commented Mar 2, 2017

Hi, this request tries to extract the hyperlinks in docx files.
Related issue: #9

I have to say that I did a lot of changes. However, it works as usual. I will really appreciate if you merge it.

Usage

I create a DOCReader class which store all the information of a docx file. After running the process method, the data attribute will store the text of header, footer, links, document.

obj = DOCReader(docx_file)
text = obj.process()
print obj.data['links']  # a list of tuple (text, hyperlink)

How it works

The hyperlinks are stored in the word/_rels/document.xml.rels file. All the links has a Id attribute, for example:

<Relationship Id="rId14" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="https://www.google.com" TargetMode="External"/>

The corresponding texts are stored in the other documents, which also have the Id attributes.
When parsing the other documents (the xml2text method), I add some code to find the corresponding text.

@badbye badbye closed this Mar 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants