Without the right toolkit, extracting content from PDF documents can be extremely difficult. The ultimate solution is PDFlib TET (Text and Image Extraction Toolkit); one of many useful software developer components offered by Greatstone International. We will discuss what PDFlib TET is and some of its primary uses in this post.
What is PDFlib TET?
PDFlib TET, from the PDFlib suite, is a developer toolkit that allows developers to extract text, imagery, and metadata from PDF documents. Text is stripped into Unicode strings, detailed colour, glyph and font information, and page position, whilst raster images are extracted in popular image formats, including TIFF and JPEG.
How is it used?
Some of the core uses of PDFlib TET for PDFs include:
- Conversion – It can convert PDF documents to an XML-based format containing text, metadata, and resource information called TETML.
- Processing – When used alongside PDFlib PDI, TET can process PDF documents based on their contents, for example, separating PDFs using document headings.
- Indexing – It allows developers to implement a PDF indexer for a search engine.
- Repurposing – It can repurpose text and images in PDFs.
- Inspection – It enables developers to dynamically check if a target location on the page is empty before inserting a barcode or stamp.
- Analysis – With sophisticated content analysis algorithms, TET can verify word boundaries, group text into columns, detect table structures and delete unnecessary items.
- Querying – Developers can query PDF document details such as XMP metadata, font lists, page size and document information fields.
Speak to the PDFlib experts
It is one thing to understand PDFlib TET and what it can do, but how can you begin using it? Speak to the team at Greatstone International and we will help you get started.