Real-Time Web Scrapper for NHL Standings Simple Registry Browser
Apr 21

Printer Friendly Version

Download Source Code: PdfFileParser.zip - 205.25KB

---
  Custom object model for primitive types loaded from a PDF document. Use of simple sequential parsing techniques on text data. Tree-based visual PDF viewer, to inspect loaded primitive structures at this low data level.  
---

Overview

PDF (Portable Document Format) is a widely used proprietary file format, but it has been made public by Adobe, so it's an open format. You can look here for the specification or PDF Reference, in a huge 30MB+ PDF file.

There are not many open-source solutions for PDF files, especially in managed .NET code. Two few C# projects we found, deal with PDFs through regular expressions. And if you're not familiar with their complex syntax, it makes it hard for you to understand.

Beside Adobe's Acrobat Reader, most PDF utilities are not free. Some are even unfairly expensive, in the range $500 and upper. While PDF is an open format, it's somehow hard to understand why so few open-source solutions, especially for .Net.

This article is a first step in what is intended to become a foundation for multiple utilities for PDF files. We start with the simple implementation of a sequential parser, to load a PDF file into a custom data model, made-up of a minimal set of classes, to support all types of stored PDF structures. While current prototype has rather a tutorial value and may not be of great use, it's good for developers to have a visual look at how PDF files look like.

PDF Data Model

PDF documents have one or more sections, each with a body, xref and trailer
PDF documents have one or more sections, each with a body, xref and trailer

In Adobe's PDF specification, the file's structure is described as having a file header, a body, a cross-reference table and a trailer. This can be optionally followed by one or more sections for each incremental update. While each section has a similar body, cross-reference table and trailer, we thought it's better to group these last three specific blocks into a PdfSection class, leaving the one-line header - with file version information - to the PdfDocument class. A PdfDocument, after its instantiation with a file name, will sequentially read each line of the PDF text file, and parse its logical elements into one or more sections. Each Body is a list of object declarations - object changes or additions for an incremental update section - and the Trailer is a dictionary.

At the raw data level, there are just a few types of primary objects a PDF document can describe and store. Simple (not-composite) objects will be represented by the base class PdfObject, which exposes the actual data object through the Object property. This could be:

  1. null - for Null objects - and IsNull property returns true.
  2. true or false - for Boolean types - and IsBoolean property returns true.
  3. an integer or double value - for Numeric types - and IsNumber property return true.
  4. a Name string value, starting by / - when IsName returns true. A name is unique and is used either a key in a dictionary or to label other objects, which can be Name objects as well.
  5. an end-of-line Comment, starting by % - and IsComment property returns true.
  6. a String value surrounded by either (...) or <...> (for hexa representations) - and IsString property returns true. Remark that first character of a string value is always a specific delimiter or identifier. An empty String is represented as ().
PDF objects can specialize in composite structures and references, arrays and dictionaries
PDF objects can specialize in composite structures and references, arrays and dictionaries

Data-level PDF structures are all declared in a body sections as indirect objects, in obj...endobj blocks. Each indirect object is identified by an object number and a generation number. Indirect objects are represented in our model by the class PdfIndirectObject, which is a composite object, and its Object property holds a List of constituent PdfObject elements. References to indirect objects are represented by PdfIndirectReference, which is a simple object, with no aggregates. The Object property should point to the PdfIndirectObject with same object number and generation number.

The other two composite types of PDF objects are the arrays and dictionaries. Arrays are represented by PdfArray class, and its Object property holds a List of constituent PdfObject elements. Dictionaries are represented by PdfDictionary class, and its Object property holds a Dictionary of constituent PdfObject elements, localized by a Name string object key.

A PdfDictionary can also hold a Stream, declared in the file as a sequence of bytes between stream...endstream, immediately after the dictionary declaration.

All composite object types (indirect objects, arrays and dictionaries) can contain nested objects of other types, including other composite objects.

This minimal set of low data-level classes makes possible the representation of any data stored in a PDF file. At upper logical levels, specific interpretation must be given in particular to dictionary objects, which may represent specialized elements, such as table of contents, pages and so on. The section trailer is also such a specialized type of dictionary.

Continue reading »

Subscribe and Share: Subscribe using any feed reader Bookmark and Share

3 Comments

1. guillaume Says:
Hello,
does anyone know about such a thing being done in PHP ?
 

2. jithugrg Says:
how we can impliment this in asp.net with vb?
 

3. Cristian Says:
@jithugrg - try loading the compiled assembly in Reflector and generate a VB.NET project with its addon File Disassembler.
 

Leave a Reply