Saturday, September 20, 2014

LaTeX to Word conversion


Translating LaTeX documents into Word files is an ordeal we would gladly spare with. However, only rare publishers accept LaTeX documents. 

One may simply take the compiled pdf and reconstruct a Word file manually through copy-and-paste steps, but this is error-prone and time-consuming.
I tried a conversion of the .tex document to html, hoping to then import the html in OpenOffice. For this purpose, I tried LaTeX2html and TeX4ht. In both cases, I found that the use of XeTeX, and of unusual Unicode ranges as in my case, made things more complicate. Scared by the lack of support, I veered to the ifxetex package, to have a same document compiling both by xelatex and pdflatex.

TeX4ht is advertised as the best candidate. It has a htxelatex command, which implies a first run through XeLaTeX (and BibTex). Here I got several errors, apparently related to substitution of Latin Extended characters (e.g., Command `\acute' already defined in `'.), and no usable html output.

The same errors occurred with the ooxelatex option. The resulting .odt file did not have any text, just empty pages with the main heading and numbers generated by counters such as footnotes.

TeX4ht has the better supported htlatex command. With this, after a pdflatex and bibtex (actually biber) run I could get a complete html code, but the Latin Extended characters were not reproduced, substituted instead by .png images of the glyphs. Which meant that I could not import into Open Office a clean Unicode text.

I also tried Latex2Html, which yielded very similar results, i.e., Latin Extended characters with diacritics rendered as images and thus useless in a Word file.

I also tried to use a Perl script by Radhakrishnan, who actively maintains the TeX4ht package, but could not have it working.

I also had problems with the footnotes in these html outputs. The bibliography was there, but its formatting not anymore reliable.

I also tried another package, Latex2rtf, which produced a complete .rtf with unicode chars and footnotes, but without the list of references at the end.

Back to TeX4ht, I tried to go directly from .tex to .odt, via the mk4ht command with the oolatex commands. Here, after initial struggles with some packages and personal macros that I normally use (notably, hyperref and the memoir class) and that I had to disable, I partially succeeded. Once cleansed of conflictual macros, in fact, mk4ht oolatex produced a legible .odt.

One must first run (1) "latex filename" (since at this point I had ifxetex ... else in the preamble), then (2) "biber filename", and then (3) "mk4ht oolatex filename".

With (1) I got a single but major error about the incompatibility of ucs with biblatex. This seems an unsolved problem. By changing \usepackage[utf8x]{inputenc} with \usepackage[utf8]{inputenc}, as advised in the documentation, problems with the Latin Extended range of unicode ensue.

With (2) no errors were produced. Noticeably, the bibtex backend in the biblatex arguments did not work, so I am now using the biber backend. The .bib database which had worked fine with the bibtex backend was corrupting the process, so I had to clean it from several corrupting characters that had crept into it through the years, mostly in longer notes and abstracts hastily written.

With (3), which functions in three runs (perhaps (1) and (2) can be skipped? But I did not get any list of reference at the end without them) there were several errors (about 30 per run). These were all related to the list of references at the end, so they might be bugs in the .bib file.

The result is complete with proper formatting (italics, etc.), proper structure of the document (sections with hierarchy of heading styles, indented quotations), toc, synchronized footnotes. 

The final list of references was also properly formatted, except that its first part is a list of primary references that I order by shorthands through the biblatex command \printshorhands. This shorthands list was altogether lost in the .odt document, something that remains to be solved.  I took the shortcut of having a single bibliography with both primary and secondary sources, and no abbreviations' list.


All these experiments were done in Ubuntu 14.04.