Skip to content
/ catdoc Public
forked from petewarden/catdoc

Command-line utility for converting Microsoft Word documents to text

License

Notifications You must be signed in to change notification settings

ggeenn/catdoc

 
 

Repository files navigation

Forked from petewarden/catdoc,
AND :
1. CMakeLists.txt was added.
2. It is buildable on windows and mac.
3. It is a library - there is no main() anymore.
4. Charsets were hardcoded in charsets_bin.c, you haven't read files from charsets/ folder.
5. A lot of global variables, to use in multithreading environment please use mutex.
6. To build, define callbacks in your code (see usage bellow)

USAGE :

#define _CRT_SECURE_NO_WARNINGS
extern "C" { #include <catdoc.h> }
std::mutex gCatdocMtx;
std::string gCatdocContent;
extern "C" void output_paragraph(unsigned short int* buffer)// MS Doc handler
{
    std::string utf8 = boost::locale::conv::utf_to_utf<char, wchar_t>(
        reinterpret_cast<const wchar_t*>(buffer));
    gCatdocContent += utf8;
}
extern "C" void print_value(unsigned char* value)// MS Xls & Ppt handler
{
    gCatdocContent += reinterpret_cast<const char*>(value);
}
extern "C" void raise_error(const char* reason)
{
    throw std::runtime_error(reason);
}
std::string getTextContent(const fs::path& path)
{
    FILE* file = std::fopen(path.string().c_str(), "rb");
    // check file handle and guard it somehow you like to call std::fclose(file)

    std::scoped_lock<std::mutex> lck(gCatdocMtx);
    gCatdocContent.clear();

    set_std_func();
    if (!source_charset)
        source_charset = read_charset(source_csname);

    int res = analyze_format(file, (char*)path.string().c_str());
    // check error here

    return gCatdocContent;
}

Original description is bellow :

CATDOC version 0.93

CATDOC is program which reads MS-Word file and prints readable 
ASCII text to stdout, just like Unix cat command.
It also able to produce correct escape sequences if some UNICODE
charachers have to be represented specially in your typesetting system
such as (La)TeX.

This is completely new version of catdoc, rewritten from scratch. 
It features runtime configuration, proper charset handling,
user-definable output formats and support
for Word97 files, which contain UNICODE internally.

Since 0.93.0 catdoc parses OLE structure and extracts WordDocment
stream, but doesn't parse internal structure of it.

This rough approach inevitable results in some garbage in output file,
especially near the end of file and if file contains embedded OLE objects,
such as pictures or equations.

So, if you are looking for purely authomatic way to convert Word to LaTeX,
you can better investigate word2x, wvware or LAOLA.


Catdoc is distributed under GNU Public License version 2 or above.


Your bug reports and suggestions are welcome.

There is also major work to do - define correct TeX commands
for accented latin letters into tex.specchars file and commands
for mathematical symbols (unicode 20xx-25xx). 


Contributions are welcome.

See files INSTALL and INSTALL.dos for information about  compiling and
installing catdoc.

Catdoc is documented in its UNIX-style manual page. For those who don't
have man command (i.e. MS-DOS users) plain text and postscript versions
of manual are provided in doc directory
                    Victor Wagner <[email protected]>


About

Command-line utility for converting Microsoft Word documents to text

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 90.4%
  • Shell 4.5%
  • Tcl 3.2%
  • Makefile 1.6%
  • Other 0.3%