Mammoth is a python application that converts .docx documents into a semantic HTML. This Google Colab implementation utilizes Mammoth and additionally adds a responsive header and footer to each of the exported HTML files. This process requires some basic knowledge of Word or similar open-source alternatives such as OnlyOffice, HTML/CSS and a free Google Drive account.
The structure of .docx and HTML are quite different, so only particular Word paragraph styles can be converted by default. The demo file mammoth-demo.docx
included in this repository utilizes some core Word styles and a number of custom styles for block quotes, captions and so forth. Refer to the full Mammoth documentation for details on how Mammoth can be further customised.
The following default Word styles are automatically converted into HTML equivalents: Headings 1 to 6, Paragraphs, Line breaks, Lists, Hyperlinks, Tables, Footnotes and Text styles including Bold, Italic, Underline, Strikethrough, Superscript, Subscript.
Blockquotes require that a custom Word paragraph style Quote
is applied to text. This is then converted into the following HTML <blockquote class="quote">
Text Here</blockquote>
Captions require that a custom Word paragraph style imageCaption
is applied to text. This is then converted into the following HTML <figcaption class="imageCaption">
Text Here</figcaption>
Biblographic elements require a custom Word paragraph style bibloReference
to be applied to appropriate footer text. This is converted into the following HTML <p class="bibloReference">
Reference Text</p>
and creates an indented effect in the final document. Note the reference links themselves can convert unreliably within Mammoth, so they may need to be checked and added manually.
The copyright notice requires that a custom Word paragraph style copyrightMeta
is applied to appropriate text. Javascript within the HTML file footer.htm
automatically moves this text to the base of the exported HTML file. The converted HTML appears in the following format <div class="copyrightMetaFooter"><p class="copyrightMeta">
Text Here</p></div>
Mammoth supports basic tables and merged table cells in both horizontal and vertical axis. Table headings should use the standard Word heading styles. Table captions should use the above imageCaption
style.
This repository contains two files header.htm
and footer.htm
that can be adapted to alter the presentation of the output HTML files. The header file header.htm
contains inline CSS that controls the layout, fonts, colors and text sizes of all elements, alongside some custom styles that can be applied manually to the exported files to improve the layouts.
The output HTML uses free open-source Google Fonts
A wrapper tag is automatically added around tables, which facilitates horizontal scrolling on small devices <div class="tableWrap"><table>
Table Content</table></div>
The css additionally features other styles for table headers and striped rows. These must be added manually to the output HTML files:
- To add a table header. Wrap the table row in the following HTML
<thead><tr>
Table header content here</tr></thead>
- To add a striped row effect. Wrap the table body in the following HTML
<tbody class="zebraTable"><tr>
Table body content here</tr></tbody>
Images are directly embedded into the html document. It is possible within Mammoth to save them externally, refer to the full documentation for details.
Footnotes automatically appear at the base of the HTML file
Ordered lists that require special numbering formats require the manual addition of the following CSS styles to a lists <ol>
tag:
1. 2. 3. 4.
is the default behavior of<ol>
List Here</ol>
i. ii. iii.
=<ol class="listLowerRoman><li>
List Here</li></ol>
I. II. III.
=<ol class="listUpperRoman><li>
List Here</li></ol>
a. b. c.
=<ol class="listLowerLatin><li>
List Here</li></ol>
A. B. C.
=<ol class="listUpperLatin><li>
List Here</li></ol>
No bullets
=<ol class="listNone><li>
List Here</li></ol>
Open the link below or copy the file Mammoth.word-docx-to-html-github.ipynb
from the repository to your Google Drive and open in Google Colab. Follow the notebook instructions to complete the conversion process.
Follow this process to map new custom styles:
- In the .docx file create a custom paragraph style, name it with a unique
customName
and apply the style to text. - Open the Colab file and select the code immediately below
4. Convert .docx to .html
Within the editor new styles can be declared below the comment# Map custom styles here
Refer to the Mammoth documentation for full details. - Add CSS styles for the new element in the
header.htm
file.