Skip to main content

PDF Connector

PDF PDF (Portable Document Format) is a file format developed by Adobe for presenting documents independently of software, hardware, or operating systems. The pdf component allows finding text in PDF documents, listing page numbers, and extracting specific pages from a document.

Actions

Extract All Text

Extracts all text from the specified PDF document and returns it as an array of text strings.

InputCommentsDefault
PDF DataThe PDF file data to process. This can be a file reference from a previous step.

Extract Page

Extracts the specified page from the PDF document and returns it as a new separate PDF document.

InputCommentsDefault
PDF DataThe PDF file data to process. This can be a file reference from a previous step.
Page NumberThe page number to extract from the PDF.

Extract Page Text

Extracts text from the specified page range in the PDF document.

InputCommentsDefault
PDF DataThe PDF file data to process. This can be a file reference from a previous step.
Page StartThe starting page number for extraction.
Page EndThe ending page number for extraction. If not provided, only the start page is extracted.

Extract Structured Text

Extracts all text items from the PDF with their position coordinates, dimensions, font metadata, and layout flags for custom parsing.

InputCommentsDefault
PDF DataThe PDF file data to process. This can be a file reference from a previous step.
Page StartThe starting page number for extraction. If not provided, extraction starts from the first page.
Page EndThe ending page number for extraction. If not provided, extraction continues to the last page.

Extract Table Data

Detects and extracts tabular structures from the PDF using coordinate-based row and column clustering, returning two-dimensional string arrays.

InputCommentsDefault
PDF DataThe PDF file data to process. This can be a file reference from a previous step.
Row ToleranceY-coordinate tolerance in PDF points for grouping text items into table rows. Default is 3 points.3
Column ToleranceX-coordinate tolerance in PDF points for detecting table column boundaries. Default is 10 points. Decrease for dense tables, increase for tables with wider spacing.10
Page StartThe starting page number for extraction. If not provided, extraction starts from the first page.
Page EndThe ending page number for extraction. If not provided, extraction continues to the last page.

Extract Text by Pattern

Extracts text from the specified PDF document that matches the search text.

InputCommentsDefault
PDF DataThe PDF file data to process. This can be a file reference from a previous step.
Search PatternThis is the text to search for in the PDF document.
Characters AfterThe number of characters to extract after the search pattern. If not provided, the entire page is returned.
Case SensitiveWhen true, the search is case-sensitive.false

Extract Text with Layout

Extracts text from the PDF with line breaks and paragraph spacing preserved from the original document layout.

InputCommentsDefault
PDF DataThe PDF file data to process. This can be a file reference from a previous step.
Line ToleranceY-coordinate tolerance in PDF points for grouping text items into lines. Items within this vertical distance are considered same-line. Default is 2 points. Increase for PDFs with inconsistent text positioning.2
Page StartThe starting page number for extraction. If not provided, extraction starts from the first page.
Page EndThe ending page number for extraction. If not provided, extraction continues to the last page.

Find Pattern

Searches the PDF document and returns page numbers containing text that matches the search criteria.

InputCommentsDefault
PDF DataThe PDF file data to process. This can be a file reference from a previous step.
Search PatternThe text pattern to search for in the PDF document.
Case SensitiveWhen true, the search is case-sensitive.false
Use RegexWhen true, treats the search pattern as a regular expression.false
ContainsWhen true, returns pages containing the pattern; when false, returns pages without the pattern.true

Find Text Position

Searches the PDF document and returns the position coordinates of all occurrences of the specified text.

InputCommentsDefault
PDF DataThe PDF file data to process. This can be a file reference from a previous step.
Search TextThe text to search for in the PDF document.
Case SensitiveWhen true, the search is case-sensitive.false
Page NumberLimit the search to a specific page number. If not provided, all pages are searched.

Page Numbers

Returns a sequence of page numbers for the PDF document, from 1 to the last page.

InputCommentsDefault
PDF DataThe PDF file data to process. This can be a file reference from a previous step.