Sunday, July 27, 2008

PDF Miner

What's It?

PDFMiner is a suite of programs that aims to help analyzing text data from PDF documents. It includes a PDF parser, a PDF renderer (though only rendering text is supported for now), and a couple of nice tools to extract texts. Unlike other PDF-related tools, it allows to obtain the exact location of texts in a page, as well as other layout information such as font size or font name, which could be useful for analyzing the document.

Features:

  • Written entirely in Python. (for version 2.5 or newer)
  • Supports up to PDF-1.7 specification.
  • Supports Non-ASCII languages and vertical writing scripts.
  • Supports Various font types (Type1, TrueType, Type3, and CID).
  • Supports Basic encryption (RC4).
  • Supports PDF to HTML conversion.
  • Supports Outline (TOC) extraction.
  • Supports Tagged contents extraction.

No comments: