Complete analysis of IPCC Chapter 15 #37

petermr · 2023-03-10T10:23:07Z

petermr
Mar 10, 2023
Maintainer

Overview

An IPCC Chapter in PDF is analysed and converted to semantic form

PDF2HTML

raw HTML

pdfplumber and pdfminer convert the text to character streams with coordinates and styles and these are converted to HTML line by line as divs. Output raw.html. This is largely automatic, with some heuristics to detect non-text items such as footnotes, table, etc.

raw HTML2 spans

Absolute coordinates are converted to structured HTML without explicit pages divs for paragraphs and spans for continuous chunks of the same style. This is probably good enough for word/phrase extraction and creation of nodes in a knowledge graph.

per-document semantics

recognition of components specific to the document and individual conversion.

Example.

Conversion of a typical chapter (WG3/Chapter15)

All input and output will be included in py4ami repository.

Code

Run from tests ate prersent - will move to commandline

    def test_make_ipcc_html_spans(self):
        """
        read some/all PDF pages in chapter
        parse with pdfplumber into raw_html

        uses PDFMiner library (no coordinates I think)
        then
        uses flow_tidy() to determine pages, wrap lines (span/div) in pages
        join pages
        trim headers and footers and sides
        then
        creates and HtmlTidy to remove or edit unwanted span/div/br


        USED
        MODEL
        """
        chapters = {
            "Chapter04": {
                "pages": "107"
            },
            "Chapter15": {
                "pages": "103"
            }
        }
        chapter = "Chapter04"
        chapter = "Chapter15"
        chapters_dir = Path(Resources.TEST_IPCC_DIR)
        unwanteds = {
            "chapter": {
                "xpath": ".//div/span",
                "regex": "^Chapter\\s+\\d+\\s*$"
            },
            "final_gov": {
                "xpath": ".//div/span",
                "regex": "^\\s*Final Government Distribution\\s*$"
            },
            "page": {
                "xpath": ".//div/a",
                "regex": "^\\s*Page\\s*\\d+\\s*$",
            },
            "wg3": {
                "xpath": ".//div/span",
                "regex": "^\\s*(IPCC AR6 WGIII)|(IPCC WGIII AR6)\\s*$",
            }
        }

        print(f"Converting chapter: {chapter}")
        chapter_dir = Path(chapters_dir, chapter)
        pdf_args = self.create_pdf_args_for_chapters(chapter, chapter_dir, chapters, unwanteds=unwanteds)
        _, _ = pdf_args.convert_write()  # refactor, please

input

dir: test/resources/ipcc/Chapter15
file: fulltext.pdf

output

to:
temp/html/ipcc/Chapter15

files created:

chapter 15:.html << ??
executive summary.html << hardcoded selections (need to make parameters)
ipcc_spans.html << "final" output raw.html - >HtmlTidy (may rename later)
raw.html << output of first pass (PDF to HTML)
table of contents.html << not fully analysed

comments

Most of the text is faithfully converted.

problems:

some non unicode symbols (SymbolMT font) for degrees
BOX and CCB floats
FOOTNOTES
Tables need to identify captions and skip
images need to identify captions and skip

enhancements required

recognize page borders and compute absolute coordinates

TODO

remove launch from test framework and translate to commandline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete analysis of IPCC Chapter 15 #37

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Complete analysis of IPCC Chapter 15 #37

petermr Mar 10, 2023 Maintainer

Overview

PDF2HTML

raw HTML

raw HTML2 spans

per-document semantics

Example.

Code

input

output

comments

problems:

enhancements required

TODO

Replies: 0 comments

petermr
Mar 10, 2023
Maintainer