Słomkowski's technical musings

Playing with software, hardware and touching the sky with paraglider.

How to convert MS Word for DOS documents to modern format


From an ancient MFM hard drive MiniScribe 2012 I’ve recovered over 200 MS-Word for DOS 4.0 documents. Converting them to some modern format by hand was out of the question, therefore I’ve looked for available solutions.

More recent versions (that is, Windows 32-bit) of Microsoft Word have so called WinWord Converter API, available for (not sure of it) MS Word 97 to Word 2013. The converter library installs itself as DLL (misleadingly named with file extension .cnv) and associated registry entry. It allows MS Word to load and save files in previously unknown formats. The API is open, Microsoft publishes SDK with the manual (Application Note GC1039) and code samples.

Fortunately there exists a converter which supports Word for DOS, available for download here and in other places on the web.

These libraries are for MS Word. I don’t have this particular editor; converting over 200 files this way would be cumbersome anyway. Fortunately there exists the tool WinCvt which loads the plugin and allows the batch conversion to Rich Text Format (RTF), which is standardized and well supported. Unfortunately for me, WinCvt hasn’t worked for me, it kept failing when doing RTF conversion.

I decided to examine the provided .cnv file and eventually write the wrapper program myself. In fact I had done it before stumbling upon WinCvt - I simply started examining the .cnv file and found the GC1039 after looking for function names within the library.

Examining the Doswrd32.cnv file

The exact converter which supports Word for DOS is named Doswrd32.cnv. Linux file command shows that this is an ordinary DLL:

Doswrd32.cnv: PE32 executable (DLL) (GUI) Intel 80386, for MS Windows

Lets show the list of functions which are available in this DLL:

winedump -j export Doswrd32.cnv

This command returns a lot of text. The interesting section is a list of exports:

  Entry Pt  Ordn  Name
  00001020     1 DllMain
  000062F0     2 ForeignToRtf32
  00005F40     3 GetReadNames
  00005F80     4 GetWriteNames
  00001000     5 InitConverter32
  000061D0     6 IsFormatCorrect32
  00006130     7 RegisterApp
  000064B0     8 RtfToForeign32
  00007C70     9 my_PasswordDlg

The first thing after having the function names is to look for the characteristic ones like ForeignToRtf32 in Google and in Github. Github is especially important because someone might already have figured that out and provided code snippets.

Googling these names reveal that there exists a document from Microsoft which describes the Converter API and even provides samples. Because this document is DOC file packaged in EXE, I provide it here in PDF format. On Github I found code samples for Wordpad-like example application and WinCvt source code.

Reading the API document indicates that the API is quite straightforward and high level: three functions are required to perform the conversion: InitConverter32, IsFormatCorrect32, ForeignToRtf32.

Writing the converter wrapper application

I set-up Clion CMake project for Windows target for this investigation.

I took the header convapi.h from the SDK. It gave me the declarations of the relevant functions. Lets define types for function pointers, which we ought to load from Converter DLL:

typedef short FCE;
extern "C" typedef long (PASCAL *lib_PFN_RTF)(long, long);
extern "C" typedef long PASCAL (*lib_InitConverter32)(HANDLE hWnd, char *szModule);
extern "C" typedef FCE  PASCAL (*lib_IsFormatCorrect32)(HANDLE ghszFile, HANDLE ghszClass);
extern "C" typedef FCE  PASCAL (*lib_ForeignToRtf32)(HANDLE ghszFile, void *pstgForeign, HANDLE ghBuff,
 HANDLE ghszClass, HANDLE ghszSubset, lib_PFN_RTF lpfnOut);

The PASCAL directive is used in convapi.h and it is in fact an alias for __stdcall. If you don’t know what that means, read Wikipedia article about calling conventions.

Then we can load the DLL using WinApi LoadLibrary function:

const char *libraryName = "Doswrd32.cnv";
HMODULE hLib = LoadLibrary(libraryName);
if (hLib == nullptr) {
    throw runtime_error(string("Error: load library ") + libraryName);
}

The code above loads the DLL in runtime. Lets create function pointers for the library functions:

auto GetProcAddressAndCheck = [&](LPCSTR procName) {
    auto farProc = GetProcAddress(hLib, procName);
    if (farProc == nullptr) {
        throw runtime_error(string("Error: cannot find procedure ") + procName);
    }
    return farProc;
};
auto fInitConverter32 = (lib_InitConverter32) GetProcAddressAndCheck("InitConverter32");
auto fIsFormatCorrect32 = (lib_IsFormatCorrect32) GetProcAddressAndCheck("IsFormatCorrect32");
auto fForeignToRtf32 = (lib_ForeignToRtf32) GetProcAddressAndCheck("ForeignToRtf32");

Now we are ready to use them like any other function. Initiate the library first as told in the manual. We don’t have application window so we pass nullptr as Window HANDLE.

if (fInitConverter32(nullptr, nullptr) == 0) {
    throw runtime_error("InitConverter32() failed");
}

It’s good to do sanity check on the document we’re going to convert, co we call IsFormatCorrect32. It’s important to note that the function arguments are of type HANDLE. That indicates that the function doesn’t accept common pointers but WinApi-allocated memory blocks, namely by HeapAlloc function. The memory is freed with function HeapFree.

Lets allocate memory for input file path and fill it with the path to test document:

auto inputFilePathHandle = HeapAlloc(GetProcessHeap(), HEAP_GENERATE_EXCEPTIONS, _MAX_PATH + 1);
lstrcpyA((char *) inputFilePathHandle, "Z:\\test.txt");
auto ret = fIsFormatCorrect32(inputFilePathHandle, nullptr);
if (ret != 1) {
    throw runtime_error(string("IsFormatCorrect32 failed, return code ") + to_string(ret));
}

If this function executes successfully, the converter should work.

Function ForeignToRtf32 does the job of converting the file to RTF. You have to provide the buffer and callback function. The callback is called after chunk of file is processed. I simply write the content of the buffer to file:

char *buffer;
ofstream rtfFileHandle;

PASCAL long callback(long cchBuff, long nPercent) {
    rtfFileHandle.write(buffer, cchBuff);
    return 0;
}

Allocating the buffer and calling the function. I allocated 10 KB for good measure. There is a way to know the size of the allocated buffer from its HANDLE so ForeignToRtf32 won’t overflow it.

rtfFileHandle.open("Z:\\test.rtf", ios::binary | ios::trunc | ios::out);

buffer = (char *) HeapAlloc(GetProcessHeap(), HEAP_GENERATE_EXCEPTIONS, 10 * 1024);
auto ret = fForeignToRtf32(inputFilePathHandle, nullptr, buffer, nullptr, nullptr, callback);

HeapFree(GetProcessHeap(), 0, buffer);
rtfFileHandle.close();

if (ret != 0) {
throw runtime_error(string("ForeignToRtf32 failed, return code ") + to_string(ret));
}

I have written the wrapper which takes the arguments from the command line and published it on Github.

Converting all MS Word for DOS files in the directory tree

Copy the directory tree of the current directory:

find . -type d -not -path '.' -not -name 'pdf' -exec mkdir -vp pdf/{} \;

Word for DOS documents need style sheet, usually called STANDARD.DFV. Lets find which styles are used by my documents:

find . -iname '*.txt' -exec strings {} \; | grep STANDARD.DFV | sort | uniq

For me this command returned:

A:\STANDARD.DFV
C:\W4\STANDARD.DFV
C:\WORD\STANDARD.DFV
STANDARD.DFV

I put this files in the respective directories in ~/.wine/drive_c directory so the converter doesn’t ask for them.

The conversion command itself. For each TXT file it performs the conversion:

find . -iname '*.txt' -exec wine ~/projects/word-converter/cmake-build-debug/word-converter.exe \
    ~/projects/word-converter/Doswrd32.cnv {} pdf/{}.rtf \;

This command converted successfully 238 of 248 documents, the other TXT files were plain text files.

I have RTF files. Quick check of couple of them shows that the layout is well-preserved. For reading I converted them to PDF:

find . -iname '*.rtf' -exec libreoffice --headless --invisible --norestore --convert-to pdf {} \;

We finally have the documents which are suitable for reading on modern computing platform!