Web Request Spy Real-Time Web Scrapper for NHL Standings
Apr 21

Printer Friendly Version

Download Source Code: HelpFileExtractor.zip - 125.1KB

 

---
  User-defined .NET object model based on IStorage and IStream COM interfaces for Compound Documents. Extract all parts of a CFBF document in a hierarchy of original folders and files. Particular use case on CHM Help Files.  
---

Compound File Binary Format

Disassembled CHM file, with IStorage-folders and IStream-files
Disassembled CHM file, with IStorage-folders and IStream-files

CFBF (Compound File Binary Format) is a storage format originally developed by Microsoft, for on-disk storage of data, using the IStorage and IStream COM interfaces. Microsoft has opened the format for use by others and it is now used in a variety of programs.

CHM (Compiled Windows HTML Help) is a well known type of document saved in CFBF. It is used by Microsoft and most software vendors in their applications, for Help information.

This article presents a clean and simple .NET solution, in C#, to disassembly a file saved as CFBF into its constituent low-level data parts. We'll use in our demo only CHM Help files, but this generic method should work on any other file or repository using the same type of storage.

Compound documents save their data in hierarchically organized folder-like structures, similar to file system folders. These folders can be accessed through IStorage interface. Data itself is stored in streams of bytes, that can be read through IStream interface and is similar to file system files. In other words, the IStorage-IStream internal structure of a compound CFBF document looks like a tiny file system, saved in the file itself.

Fortunately, we don't need to know the format details of CFBF at the file level, as long as we'll never access data with I/O operations on the file itself, but rather through the specialized COM interfaces. What we need is to find out how to open a top IStorage-based document, like CHM, then walk through "subfolders", find actual IStream-based bytes of data and extract them.

Like in a file system, each block of data is identified by a Name and Path, determined by its position in the folders hierarchy.

Specialized compound documents, like CHM, establish high-level rules for stream naming, but we're not concerned with this here. We'll treat the same way any piece of IStream-based data extracted from the document. Other future article may look at extracted data at the logical level, dealing with semantic issues.

For us, here, it is enough to disassemble the document in its original constituent pieces. We'll see that most of these IStream-based pieces are valid parts, like images, HTML documents, CSS files, string tables, tables of contents, indexes. Most are not encoded and compressed, so they can be immediately used as they are.

Custom IStorage-based Object Model

We declared standard CFBF-related COM interfaces and other required OLE32 elements in COM.cs.

Custom CFBF Object Model
Custom CFBF Object Model

To instantiate and extract data from a compound document, we need an instantiable COM class. It's not possible to do it directly with the IStorage interface, but we will use ITStorageClass, which simply implements an IStorage-derived interface: ITStorage.

We get access to a top-level IStorage interface from a compound document, calling the StgOpenStorage method with a file name and read-only mode.

We'll make this call from the public constructor of a user-defined CompoundStorage class, and we will further associate its instance with the returned IStorage value. The ITStorageClass is necessary only on this first call:

private IStorage _storage;

// called from outside, to get access to the top IStorage
// for a compund document
public CompoundStorage(string name)
{
    _name = name;
    _path = _name + "_files";
    Debug.WriteLine(System.IO.Path.GetFileName(_name));
    
    _storage = ((ITStorage)new ITStorageClass())
        .StgOpenStorage(_name, IntPtr.Zero, 0x20, IntPtr.Zero, 0);
    Load();
}

A private Load method will enumerate the stored elements, and we will process only other IStorage or IStream-based parts. For child IStorage-based folders, we already have now a parent IStorage instance, so we can call the OpenStorage method on it and use a second private constructor.

For each IStream-based data, we will create an instance from a Stream-based user-defined CompoundStream class. To get an IStream, call OpenStream on the IStorage instance.

Each Load call will collect lists of child IStorage and IStream-based instances, that we will expose through Storages and Streams properties:

/// <summary>
/// Recursively enumerate through and load all
/// IStorage and IStream parts
/// </summary>
private void Load()
{
    System.Runtime.InteropServices.ComTypes.STATSTG stats;
    int i;
    IEnumSTATSTG enumStats;
    _storage.EnumElements(0, IntPtr.Zero, 0, out enumStats);
    enumStats.Reset();

    while (enumStats.Next(1, out stats, out i) == 0)
    {
        string name = stats.pwcsName;

        // if inner compond document
        if (stats.type == 1)
            _storages.Add(new CompoundStorage(name,
                _path + "\\" + name,
                _storage.OpenStorage(_name,
                    IntPtr.Zero, 0x10, IntPtr.Zero, 0)));

        // if stream
        else if (stats.type == 2)
            _streams.Add(new CompoundStream(name,
                _path + "\\" + name,
                _storage.OpenStream(name,
                    IntPtr.Zero, 0x10, 0)));
    }
}

// All CStorage and CStream parts of this compound document
private List<CompoundStorage> _storages
    = new List<CompoundStorage>();
public List<CompoundStorage> Storages
{ get { return _storages; } }

private List<CompoundStream> _streams
    = new List<CompoundStream>();
public List<CompoundStream> Streams
{ get { return _streams; } }

While System.IO.Stream is an abstract class, our derived CompoundStream class must implement all its abstract methods: CanRead/CanWrite, CanSeek/Seek, Length/SetLength, Position, Read/Write, Close and Flush.

CompoundStream must implement a stream of data, based on a COM IStream interface. .NET Framework does not offer this, so we will first implement an intermediate ComStream class that does just that: a transition from a IStream COM interface to a .NET Stream. Its implementation is not very complex, and there are books or web sites where this solutions has been already presented before

Extraction Process

In our demo, we dump constituent elements of each CHM file from the execution directory in a new top folder, with same name as the help file, plus "_files" suffix. This is associated with the top IStorage-based CompoundStorage instance for the file. A new subfolder will be created for any other IStorage-based instance.

Both CompoundStorage and CompoundStream implement a Save method. While CompoundStorage.Save recursively dumps document's structure, creating folders and subfolders, CompoundStream.Save dumps actual data, in files:

/// <summary>
/// Recursively save all IStorage and IStream parts
/// of the compound document as folders and files
/// </summary>
public void Save()
{
    // (re)create path
    if (Directory.Exists(_path))
        Directory.Delete(_path, true);
    Directory.CreateDirectory(_path);

    foreach (CompoundStream stream in _streams)
        stream.Save();

    foreach (CompoundStorage storage in _storages)
        storage.Save();
}

/// <summary>
/// Save the IStream as data file
/// </summary>
public void Save()
{
    if (base.IStream == null)
        throw new ObjectDisposedException("IStream");

    Debug.WriteLine('\t' + _path);
    byte[] buffer = new byte[1000];
    int i;

    using (Stream stream = File.OpenWrite(_path))
        while ((i = base.Read(buffer, 0, 1000)) > 0)
            stream.Write(buffer, 0, i);
}

Our demo will simply look for all CHM files you added to the execution directory (bin/Debug), instantiate a CompoundStorage object for each, and call the Save method to perform the full extraction of its constituent parts. Extraction is always non-destructive, and the original CHM file is never altered, deleted or moved. However, the extraction folder is each time recreated:

// Disassemble all CHM files found in the execution directory
FileInfo[] files = new DirectoryInfo(
    Directory.GetCurrentDirectory()).GetFiles("*.chm");
foreach (FileInfo file in files)
    using (CompoundStorage storage = new CompoundStorage(
        file.FullName, file.FullName + "_files"))
        storage.Save();

Once a compound file has been used, it is important to properly dispose of the object. This is why CompoundStorage class is disposable, and the loop makes use of the using keyword. The CompoundStorage.Dispose method is automatically called when the object is no longer used. This will recursively dispose of all IStorage-based parts, and explicitly Close its IStream-based streams:

public void Dispose()
{
    // close all IStorage
    foreach (CompoundStorage storage in _storages)
        storage.Dispose();
    _storages.Clear();

    // close all IStream
    foreach (CompoundStream stream in _streams)
        stream.Close();
    _streams.Clear();
}

Conclusions

This simple open-source method can be used to automate low-level data extraction from CHM Help files and other kind of IStorage-based compound documents.

While the Compound File Binary Format is an open format, and manual reverse-engineering of CHM files is already available, from tools such as HTML Help Workshop, it should not present legal issues. However, remark that, for the constituent parts of a proprietary CHM file, you are bound with the same kind of license as for the whole product. They may have copy rights. And you should not use the extracted parts to build other similar applications.

 

Subscribe and Share: Subscribe using any feed reader Bookmark and Share

1 Comment

1. Michael from N.J. Says:
Just wondering if anyone knows about other file format beside CHM using CFBF. So far I don't think this format was popular enough.
Was it abandoned? Is anybody else still using it?
Thanks,
Michael
 

Leave a Reply