Chapter05 #25
19bsr16054
started this conversation in
General
Replies: 1 comment
-
|
-- Added a new method for the kw extraction file, to just extract the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What is semanticClimate?
semanticClimate is a project which aims to convert climate related documents by the IPCC to a semantic form, understood by both humans and computers.
Why are such conversions required?
The goal is to extract useful information from the IPCC documents so that relevant information could be made accessible for everyone to read and understand. This is achieved by creating dictionaries of different types for different purposes.
What is chapter 05 about?
Summary of the Chapter
Environment and Origin
Tools used: py4ami and docanalysis
How we set up the software:
pip install docanalysisthis will install bothpy4amianddocanalysisHow to create dictionaries?
We create 3 different dictionaries which are an abbreviation dictionary, a manual dictionary, and a keyword dictionary.
py4amiprogram. This can be achieved by following the steps below:cdnavigate to the desired locationgit clone https://github.com/petermr/semanticClimatethis clones the repo to the desired location (Ideally a location that doesn't require Admin privileges) or clone using GitHub Desktop.cdagain navigate into semanticClimate and to your Chapter Directory in this casecd E:\git\semanticClimate\ipcc\ar6\wg3\Chapter05fulltext.pdfand place it inside the Chapter folder in this caseE:\git\semanticClimate\ipcc\ar6\wg3\Chapter05\fulltext.pdfpython -m py4ami.ami_pdf --inpath fulltext.pdf --outdir E:\git\semanticClimate\ipcc\ar6\wg3\Chapter05 --maxpage 190--maxpageis used to indicate the number of pages that need to be converted.--outdirprovides the path for the extra files generated (fulltext.flow_.html, fulltext_.html, TOC, Executive Summary, References and FAQs) while the finalfulltext.htmlis output to the directory the command is run from regardless of this argument.PS: The Log message provided by py4ami for
py4ami --helpis incorrect and has not been fixed yet, so do not use the example commands/arguments provided. ThePDFargument doesn't have clear documentation so I have not been able to use as it's supposed to be.The abbreviation dictionary is created by docanalysis using the spacy method using the Google Colab Notebook
docanalysisdocumentation could not be replicated in the venv without running into issues wrt to the C-Tree structure. Would prefer an argument to convert documents without using C-Tree structure.As an alternative the Google Colab Notebook created by one of our previous members which can run
docanalysiswithout any issues and is well documented.git pushcdto the root of the repositoryE:\git\sematicClimatein this case and rungit commitgit push -aThe keywords dictionary is created using the keyword extraction program which uses the gensim method.
The manual dictionary is manually created by the chapter champions from reading the chapter and picking out words or bi-grams that are less frequently used or are difficult to understand in the context of the report.
Chapter Annotation:
The complete chapter .html file is annotated with the abbreviations from the abbreviations dictionary to make it easier to read. It is done using the following command:
(https://colab.research.google.com/github/petermr/semanticClimate/blob/main/outreach/climate_knowledge_hunt_hackathon/Hackathon_Notebook/Chapter_Analysis_Notebook.ipynb)
py4ami HTML --annotate --dict {insert abbreviation dictionary path here} --inpath {insert the html path here} --outpath {insert output directory path here} --color YELLOWwhere,
--dict – the path of the dictionary to be annotated with the dictionary name as well.
--inpath – the path of the full text html with the name of the html file as well.
--outpath – the path of the output, with the name of the output html file.
--color - the colour in which the annotated words should be highlighted in the annotated html.
Beta Was this translation helpful? Give feedback.
All reactions