-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LaTeX conversion to intermediary format instead of HTML #2
Comments
Another option written in Python: plastex |
An intermediate format makes sense. Like that, could experiment with different tools. |
I think an abstract representation of a piece of research would be really valuable. It would need to be portable (i.e. dumpable to file that can be parsed by many languages), and friendlier than XML (imo). JSON sounds like a good candidate to me, esp. if rendering it as a webpage is the first use-case. |
One idea I'm playing with is to separate the "content" of a paper from the "narrative" of a paper. In the below exampole, figures, tables, and paragraphs are indexed by ID under the
|
An issue that might come up is that there is a LOT of detail to get right if you want to handle all possible papers. Might be better to use an existing standard like JATS (XML-based) and build on top of that? |
Oh nice, JATS sounds like it's exactly what we need. I'm not too familiar with existing publishing formats so thanks for pointing it out. I think it sounds reasonable to use JATS as the target format for LaTeX/PDF conversion, and make the app take that as input. Modern (i.e. React) webapps maintain internal state in JSON format, so we'll have to convert it to some sort of JSON representation inside the app. As long as there's a clean interface that takes JATS as input, I don't think we should necessarily use fully-compliant JATS for the app's internal state. We can try stick to it whenever possible, but it's more important to design state that suits the functionality requirements and keep it as lean as possible. Adding features and maintaining the code will be a nightmare otherwise (React code gets especially confusing if you're not careful). I personally think it's fine if we don't support all the details of papers off-the-bat, there's a saying in the startup world that you shouldn't try to "boil the ocean". We can expand the internal state of the app as we add new functionality, and at some point it will probably reach parity with JATS. What do you think? |
I would assume it's very straightforward to switch between an XML and JSON representation? So it hardly matters much. The main thing would be to use JATS to make sure we haven't missed something that will come back and bite us later. And yes absolutely, there's no way we'll support all features immediately. |
Yeah I assume XML -> JSON is a solved problem, and looking at the Pandoc docs it seems JSON is a supported
I suppose a good rule of thumb would be "if it's not in JATS, it shouldn't be in our internal representation". |
Well except that we want to potentially add fine grained metadata that won't be in JATS, but that's fine. Adding stuff on top can be added later. What we don't want to do is miss something fundamental that can't easily be added later on. |
Okay, noted. My React version isn't far off parity with your implementation, so once it's ready you can sanity check the data model I'm using and see if there's any major problems with it. |
It might be worthwhile to convert LaTeX to some useful intermediate representation, instead of directly to HTML. That would allow this tool to behave like a compiler, with a front-end and a back-end.
There is a tool called LaTeXML which would allow conversion to XML as an intermediate representation.
The text was updated successfully, but these errors were encountered: