You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An IPCC Chapter in PDF is analysed and converted to semantic form
PDF2HTML
raw HTML
pdfplumber and pdfminer convert the text to character streams with coordinates and styles and these are converted to HTML line by line as divs. Output raw.html. This is largely automatic, with some heuristics to detect non-text items such as footnotes, table, etc.
raw HTML2 spans
Absolute coordinates are converted to structured HTML without explicit pages divs for paragraphs and spans for continuous chunks of the same style. This is probably good enough for word/phrase extraction and creation of nodes in a knowledge graph.
per-document semantics
recognition of components specific to the document and individual conversion.
Example.
Conversion of a typical chapter (WG3/Chapter15)
All input and output will be included in py4ami repository.
Code
Run from tests ate prersent - will move to commandline
def test_make_ipcc_html_spans(self):
"""
read some/all PDF pages in chapter
parse with pdfplumber into raw_html
uses PDFMiner library (no coordinates I think)
then
uses flow_tidy() to determine pages, wrap lines (span/div) in pages
join pages
trim headers and footers and sides
then
creates and HtmlTidy to remove or edit unwanted span/div/br
USED
MODEL
"""
chapters = {
"Chapter04": {
"pages": "107"
},
"Chapter15": {
"pages": "103"
}
}
chapter = "Chapter04"
chapter = "Chapter15"
chapters_dir = Path(Resources.TEST_IPCC_DIR)
unwanteds = {
"chapter": {
"xpath": ".//div/span",
"regex": "^Chapter\\s+\\d+\\s*$"
},
"final_gov": {
"xpath": ".//div/span",
"regex": "^\\s*Final Government Distribution\\s*$"
},
"page": {
"xpath": ".//div/a",
"regex": "^\\s*Page\\s*\\d+\\s*$",
},
"wg3": {
"xpath": ".//div/span",
"regex": "^\\s*(IPCC AR6 WGIII)|(IPCC WGIII AR6)\\s*$",
}
}
print(f"Converting chapter: {chapter}")
chapter_dir = Path(chapters_dir, chapter)
pdf_args = self.create_pdf_args_for_chapters(chapter, chapter_dir, chapters, unwanteds=unwanteds)
_, _ = pdf_args.convert_write() # refactor, please
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Overview
An IPCC Chapter in PDF is analysed and converted to semantic form
PDF2HTML
raw HTML
pdfplumber
andpdfminer
convert the text to character streams with coordinates and styles and these are converted to HTML line by line asdiv
s. Outputraw.html
. This is largely automatic, with some heuristics to detect non-text items such as footnotes, table, etc.raw HTML2 spans
Absolute coordinates are converted to structured HTML without explicit pages divs for paragraphs and spans for continuous chunks of the same style. This is probably good enough for word/phrase extraction and creation of nodes in a knowledge graph.
per-document semantics
recognition of components specific to the document and individual conversion.
Example.
Conversion of a typical chapter (WG3/Chapter15)
All input and output will be included in
py4ami
repository.Code
Run from tests ate prersent - will move to commandline
input
dir: test/resources/ipcc/Chapter15
file: fulltext.pdf
output
to:
temp/html/ipcc/Chapter15
files created:
comments
Most of the text is faithfully converted.
problems:
enhancements required
TODO
Beta Was this translation helpful? Give feedback.
All reactions