Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LaTeX conversion to intermediary format instead of HTML #2

Open
rorybyrne opened this issue Apr 14, 2021 · 10 comments
Open

LaTeX conversion to intermediary format instead of HTML #2

rorybyrne opened this issue Apr 14, 2021 · 10 comments

Comments

@rorybyrne
Copy link
Collaborator

It might be worthwhile to convert LaTeX to some useful intermediate representation, instead of directly to HTML. That would allow this tool to behave like a compiler, with a front-end and a back-end.

There is a tool called LaTeXML which would allow conversion to XML as an intermediate representation.

@rorybyrne
Copy link
Collaborator Author

Another option written in Python: plastex

@thesamovar
Copy link
Owner

An intermediate format makes sense. Like that, could experiment with different tools.

@rorybyrne
Copy link
Collaborator Author

rorybyrne commented Apr 17, 2021

I think an abstract representation of a piece of research would be really valuable. It would need to be portable (i.e. dumpable to file that can be parsed by many languages), and friendlier than XML (imo).

JSON sounds like a good candidate to me, esp. if rendering it as a webpage is the first use-case.

@rorybyrne
Copy link
Collaborator Author

rorybyrne commented Apr 18, 2021

One idea I'm playing with is to separate the "content" of a paper from the "narrative" of a paper. In the below exampole, figures, tables, and paragraphs are indexed by ID under the content key, and then the narrative key contains a list of references to the various resources. This allows for multiple narratives, which represent different "views" on the research.

{
    title: "Some Title",
    abstract: "Abstract",
    authors: [ ... ],
    narratives: {
        full: [
            { type: "heading", content: "Introduction" },
            { type: "paragraph", id: "par01" },
            { type: "paragraph", id: "par02" },
            { type: "figure", id: "fig01" },
            ...
        ],
        short: [ ... ]
    },
    content: {
        paragraphsById: {
            "par01": "blah blah blah"
        },
        figuresbyId: {
            "fig01": "base64_image_data"
        }
    ]
}

@thesamovar
Copy link
Owner

An issue that might come up is that there is a LOT of detail to get right if you want to handle all possible papers. Might be better to use an existing standard like JATS (XML-based) and build on top of that?

@rorybyrne
Copy link
Collaborator Author

rorybyrne commented Apr 19, 2021

Oh nice, JATS sounds like it's exactly what we need. I'm not too familiar with existing publishing formats so thanks for pointing it out.

I think it sounds reasonable to use JATS as the target format for LaTeX/PDF conversion, and make the app take that as input. Modern (i.e. React) webapps maintain internal state in JSON format, so we'll have to convert it to some sort of JSON representation inside the app.

As long as there's a clean interface that takes JATS as input, I don't think we should necessarily use fully-compliant JATS for the app's internal state. We can try stick to it whenever possible, but it's more important to design state that suits the functionality requirements and keep it as lean as possible. Adding features and maintaining the code will be a nightmare otherwise (React code gets especially confusing if you're not careful).

I personally think it's fine if we don't support all the details of papers off-the-bat, there's a saying in the startup world that you shouldn't try to "boil the ocean". We can expand the internal state of the app as we add new functionality, and at some point it will probably reach parity with JATS.

What do you think?

@thesamovar
Copy link
Owner

I would assume it's very straightforward to switch between an XML and JSON representation? So it hardly matters much. The main thing would be to use JATS to make sure we haven't missed something that will come back and bite us later.

And yes absolutely, there's no way we'll support all features immediately.

@rorybyrne
Copy link
Collaborator Author

Yeah I assume XML -> JSON is a solved problem, and looking at the Pandoc docs it seems JSON is a supported --to option. Haven't had a chance to test it yet.

use JATS to make sure we haven't missed something that will come back and bite us later

I suppose a good rule of thumb would be "if it's not in JATS, it shouldn't be in our internal representation".

@thesamovar
Copy link
Owner

Well except that we want to potentially add fine grained metadata that won't be in JATS, but that's fine. Adding stuff on top can be added later. What we don't want to do is miss something fundamental that can't easily be added later on.

@rorybyrne
Copy link
Collaborator Author

Okay, noted. My React version isn't far off parity with your implementation, so once it's ready you can sanity check the data model I'm using and see if there's any major problems with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants