Słomkowski's technical musings

Playing with software, hardware and touching the sky with a paraglider.

How to convert MS Word for DOS documents to modern format


From an ancient MFM hard drive MiniScribe 2012 I’ve recovered over 200 MS-Word for DOS 4.0 documents. Converting them manually to a modern format was not feasible, so I began searching for alternative solutions.

More recent versions of Microsoft Word (designed for Windows 32-bit, that is) include the so-called WinWord Converter API, which is available for MS Word 97 to Word 2013 (although I’m not sure if it is available for other versions). The converter library installs itself as a DLL, misleadingly named with the .cnv file extension, along with an associated registry entry. It enables MS Word to load and save files in previously unknown formats. The API is open, and Microsoft publishes the SDK with the manual (known as Application Note GC1039) along with code samples.

Fortunately, a converter library that supports Word for DOS exists and is available for download here as well as in other places on the web.

These libraries are made for MS Word, which I don’t have access to. Besides, converting over 200 files using that method would be cumbersome. However, there is a tool WinCvt that loads the plugin and facilitates batch conversion to the standardized and well-supported Rich Text Format (RTF). Unfortunately, I encountered issues with WinCvt as it kept failing during the RTF conversion process.

I began examining the aforementioned .cnv file, which eventually led me to write the wrapper program myself. In fact, I had done it before stumbling upon WinCvt. I simply started examining the .cnv file and found the GC1039 after looking for function names within the library.

Examining the Doswrd32.cnv file

The specific converter that supports Word for DOS is named Doswrd32.cnv. When analyzed using the Linux file command, it is identified as a regular DLL:

Doswrd32.cnv: PE32 executable (DLL) (GUI) Intel 80386, for MS Windows

Now, let’s display the list of functions available in this DLL:

winedump -j export Doswrd32.cnv

Executing this command yields a lot of text. The section of particular interest is the list of exports:

  Entry Pt  Ordn  Name
  00001020     1 DllMain
  000062F0     2 ForeignToRtf32
  00005F40     3 GetReadNames
  00005F80     4 GetWriteNames
  00001000     5 InitConverter32
  000061D0     6 IsFormatCorrect32
  00006130     7 RegisterApp
  000064B0     8 RtfToForeign32
  00007C70     9 my_PasswordDlg

Once the function names have been extracted, the next step is to search for distinctive ones, like ForeignToRtf32, on Google and GitHub. GitHub is especially significant as someone may have already solved the problem and shared helpful code snippets.

Googling these names reveals that there exists a document from Microsoft that describes the Converter API and even provides code samples. Note that the document is a DOC file packaged in EXE. To make it more accessible, I provide it in PDF format. On GitHub, I found code samples for a Wordpad-like example application and the WinCvt source code.

Upon reading the documentation, it becomes evident that the API is straightforward and high-level: three functions are required to perform the conversion: InitConverter32, IsFormatCorrect32, ForeignToRtf32.

Writing the converter wrapper application

I have set up a Clion CMake project for Windows target for this investigation.

I extracted the header convapi.h from the SDK, which provided me with the declarations of the relevant functions. Now, let’s define the types for function pointers that we need to load from the Converter DLL:

typedef short FCE;
extern "C" typedef long (PASCAL *lib_PFN_RTF)(long, long);
extern "C" typedef long PASCAL (*lib_InitConverter32)(HANDLE hWnd, char *szModule);
extern "C" typedef FCE  PASCAL (*lib_IsFormatCorrect32)(HANDLE ghszFile, HANDLE ghszClass);
extern "C" typedef FCE  PASCAL (*lib_ForeignToRtf32)(HANDLE ghszFile, void *pstgForeign, HANDLE ghBuff,
 HANDLE ghszClass, HANDLE ghszSubset, lib_PFN_RTF lpfnOut);

The PASCAL directive is used in convapi.h, and it is, in fact, an alias for __stdcall. If you are unfamiliar with what that means, I recommend reading the Wikipedia article about calling conventions.

We can proceed to load the DLL using the WinAPI LoadLibrary function:

const char *libraryName = "Doswrd32.cnv";
HMODULE hLib = LoadLibrary(libraryName);
if (hLib == nullptr) {
    throw runtime_error(string("Error: load library ") + libraryName);
}

The code above loads the DLL at runtime. Now, let’s create function pointers for the library functions:

auto GetProcAddressAndCheck = [&](LPCSTR procName) {
    auto farProc = GetProcAddress(hLib, procName);
    if (farProc == nullptr) {
        throw runtime_error(string("Error: cannot find procedure ") + procName);
    }
    return farProc;
};
auto fInitConverter32 = (lib_InitConverter32) GetProcAddressAndCheck("InitConverter32");
auto fIsFormatCorrect32 = (lib_IsFormatCorrect32) GetProcAddressAndCheck("IsFormatCorrect32");
auto fForeignToRtf32 = (lib_ForeignToRtf32) GetProcAddressAndCheck("ForeignToRtf32");

Now we are ready to use them like any other function. First, initialize the converter library as instructed in the documentation. Since we don’t have an application window, we will pass nullptr as the Window HANDLE:

if (fInitConverter32(nullptr, nullptr) == 0) {
    throw runtime_error("InitConverter32() failed");
}

It’s good to perform a sanity check on the document we are going to convert, so we call IsFormatCorrect32. It is important to note that the function arguments are of type HANDLE. This indicates that the function does not accept regular pointers but rather WinAPI-allocated memory blocks, such as those created by the HeapAlloc function. The allocated memory is freed using the HeapFree function.

Let’s allocate memory for the input file path and fill it with the path to the test document:

auto inputFilePathHandle = HeapAlloc(GetProcessHeap(), HEAP_GENERATE_EXCEPTIONS, _MAX_PATH + 1);
lstrcpyA((char *) inputFilePathHandle, "Z:\\test.txt");
auto ret = fIsFormatCorrect32(inputFilePathHandle, nullptr);
if (ret != 1) {
    throw runtime_error(string("IsFormatCorrect32 failed, return code ") + to_string(ret));
}

If this function executes successfully, the converter should generate correct results.

The ForeignToRtf32 function performs the conversion. You need to provide the buffer handle and a callback function. The callback is called after each chunk of the file is processed. I simply write the content of the buffer to file:

char *buffer;
ofstream rtfFileHandle;

PASCAL long callback(long cchBuff, long nPercent) {
    rtfFileHandle.write(buffer, cchBuff);
    return 0;
}

We allocate the buffer and call the function. For good measure, I allocated 10 KB. There is a way to determine the size of the allocated buffer from its HANDLE, so ForeignToRtf32 won’t overflow it.

rtfFileHandle.open("Z:\\test.rtf", ios::binary | ios::trunc | ios::out);

buffer = (char *) HeapAlloc(GetProcessHeap(), HEAP_GENERATE_EXCEPTIONS, 10 * 1024);
auto ret = fForeignToRtf32(inputFilePathHandle, nullptr, buffer, nullptr, nullptr, callback);

HeapFree(GetProcessHeap(), 0, buffer);
rtfFileHandle.close();

if (ret != 0) {
throw runtime_error(string("ForeignToRtf32 failed, return code ") + to_string(ret));
}

I have written a wrapper that takes the arguments from the command line, and I have published it on GitHub.

Converting all MS Word for DOS files in the directory tree

Copy the directory tree of the current directory:

find . -type d -not -path '.' -not -name 'pdf' -exec mkdir -vp pdf/{} \;

Word for DOS documents require a style sheet, typically named STANDARD.DFV. Let’s determine which styles are used in my documents:

find . -iname '*.txt' -exec strings {} \; | grep STANDARD.DFV | sort | uniq

For me, this command returned:

A:\STANDARD.DFV
C:\W4\STANDARD.DFV
C:\WORD\STANDARD.DFV
STANDARD.DFV

I placed these files in their respective directories within the ~/.wine/drive_c tree to prevent the converter from prompting for them.

Running conversion command for each found file:

find . -iname '*.txt' -exec wine ~/projects/word-converter/cmake-build-debug/word-converter.exe \
    ~/projects/word-converter/Doswrd32.cnv {} pdf/{}.rtf \;

This command successfully converted 238 out of 248 documents. The remaining TXT files were plain text files.

I have RTF files. A quick check of a couple of them shows that the layout is well-preserved. For reading by humans, I converted them to PDF.

find . -iname '*.rtf' -exec libreoffice --headless --invisible --norestore --convert-to pdf {} \;

Finally, we have the documents that are suitable for reading on modern computing platforms!