LaTeX conversion to intermediary format instead of HTML #2

rorybyrne · 2021-04-14T07:45:38Z

It might be worthwhile to convert LaTeX to some useful intermediate representation, instead of directly to HTML. That would allow this tool to behave like a compiler, with a front-end and a back-end.

There is a tool called LaTeXML which would allow conversion to XML as an intermediate representation.

rorybyrne · 2021-04-14T07:47:45Z

Another option written in Python: plastex

thesamovar · 2021-04-14T11:42:40Z

An intermediate format makes sense. Like that, could experiment with different tools.

rorybyrne · 2021-04-17T08:06:42Z

I think an abstract representation of a piece of research would be really valuable. It would need to be portable (i.e. dumpable to file that can be parsed by many languages), and friendlier than XML (imo).

JSON sounds like a good candidate to me, esp. if rendering it as a webpage is the first use-case.

rorybyrne · 2021-04-18T18:17:39Z

One idea I'm playing with is to separate the "content" of a paper from the "narrative" of a paper. In the below exampole, figures, tables, and paragraphs are indexed by ID under the content key, and then the narrative key contains a list of references to the various resources. This allows for multiple narratives, which represent different "views" on the research.

{
    title: "Some Title",
    abstract: "Abstract",
    authors: [ ... ],
    narratives: {
        full: [
            { type: "heading", content: "Introduction" },
            { type: "paragraph", id: "par01" },
            { type: "paragraph", id: "par02" },
            { type: "figure", id: "fig01" },
            ...
        ],
        short: [ ... ]
    },
    content: {
        paragraphsById: {
            "par01": "blah blah blah"
        },
        figuresbyId: {
            "fig01": "base64_image_data"
        }
    ]
}

thesamovar · 2021-04-19T09:06:30Z

An issue that might come up is that there is a LOT of detail to get right if you want to handle all possible papers. Might be better to use an existing standard like JATS (XML-based) and build on top of that?

rorybyrne · 2021-04-19T10:14:38Z

Oh nice, JATS sounds like it's exactly what we need. I'm not too familiar with existing publishing formats so thanks for pointing it out.

I think it sounds reasonable to use JATS as the target format for LaTeX/PDF conversion, and make the app take that as input. Modern (i.e. React) webapps maintain internal state in JSON format, so we'll have to convert it to some sort of JSON representation inside the app.

As long as there's a clean interface that takes JATS as input, I don't think we should necessarily use fully-compliant JATS for the app's internal state. We can try stick to it whenever possible, but it's more important to design state that suits the functionality requirements and keep it as lean as possible. Adding features and maintaining the code will be a nightmare otherwise (React code gets especially confusing if you're not careful).

I personally think it's fine if we don't support all the details of papers off-the-bat, there's a saying in the startup world that you shouldn't try to "boil the ocean". We can expand the internal state of the app as we add new functionality, and at some point it will probably reach parity with JATS.

What do you think?

thesamovar · 2021-04-19T10:31:36Z

I would assume it's very straightforward to switch between an XML and JSON representation? So it hardly matters much. The main thing would be to use JATS to make sure we haven't missed something that will come back and bite us later.

And yes absolutely, there's no way we'll support all features immediately.

rorybyrne · 2021-04-19T10:37:43Z

Yeah I assume XML -> JSON is a solved problem, and looking at the Pandoc docs it seems JSON is a supported --to option. Haven't had a chance to test it yet.

use JATS to make sure we haven't missed something that will come back and bite us later

I suppose a good rule of thumb would be "if it's not in JATS, it shouldn't be in our internal representation".

thesamovar · 2021-04-19T10:39:54Z

Well except that we want to potentially add fine grained metadata that won't be in JATS, but that's fine. Adding stuff on top can be added later. What we don't want to do is miss something fundamental that can't easily be added later on.

rorybyrne · 2021-04-19T12:16:46Z

Okay, noted. My React version isn't far off parity with your implementation, so once it's ready you can sanity check the data model I'm using and see if there's any major problems with it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LaTeX conversion to intermediary format instead of HTML #2

LaTeX conversion to intermediary format instead of HTML #2

rorybyrne commented Apr 14, 2021

rorybyrne commented Apr 14, 2021

thesamovar commented Apr 14, 2021

rorybyrne commented Apr 17, 2021 •

edited

Loading

rorybyrne commented Apr 18, 2021 •

edited

Loading

thesamovar commented Apr 19, 2021

rorybyrne commented Apr 19, 2021 •

edited

Loading

thesamovar commented Apr 19, 2021

rorybyrne commented Apr 19, 2021

thesamovar commented Apr 19, 2021

rorybyrne commented Apr 19, 2021

LaTeX conversion to intermediary format instead of HTML #2

LaTeX conversion to intermediary format instead of HTML #2

Comments

rorybyrne commented Apr 14, 2021

rorybyrne commented Apr 14, 2021

thesamovar commented Apr 14, 2021

rorybyrne commented Apr 17, 2021 • edited Loading

rorybyrne commented Apr 18, 2021 • edited Loading

thesamovar commented Apr 19, 2021

rorybyrne commented Apr 19, 2021 • edited Loading

thesamovar commented Apr 19, 2021

rorybyrne commented Apr 19, 2021

thesamovar commented Apr 19, 2021

rorybyrne commented Apr 19, 2021

rorybyrne commented Apr 17, 2021 •

edited

Loading

rorybyrne commented Apr 18, 2021 •

edited

Loading

rorybyrne commented Apr 19, 2021 •

edited

Loading