Partitioning strategies
For certain document types, such as images and PDFs, for example, Unstructured products offer a variety of different
ways to preprocess them, controlled by the strategy
parameter.
PDF documents, for example, vary in quality and complexity. In simple cases, traditional NLP extraction techniques may
be enough to extract all the text out of a document. In other cases, advanced image-to-text models are required
to process a PDF. You can think of the strategies as being “rule-based” workflows (thus they are “fast”), or
“model-based” workflows (slower workflow because it requires model inference, but you get “higher resolution”, thus “hi_res”).
When choosing a partitioning strategy for your files, you have to be mindful of the quality/speed trade-off.
To give you an example, the fast
strategy is roughly 100x faster than leading image-to-text models.
Available options:
auto
(default strategy): The “auto” strategy will choose the partitioning strategy based on document characteristics and the function kwargs.fast
: The “rule-based” strategy leverages traditional NLP extraction techniques to quickly pull all the text elements. “Fast” strategy is not recommended for image-based file types.hi_res
: The “model-based” strategy identifies the layout of the document. The advantage of “hi_res” is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements.ocr_only
: Another “model-based” strategy that leverages Optical Character Recognition to extract text from the image-based files.
These strategies are available on the following partition functions:
Document Type | Partition Function | Strategies | Table Support | Options |
---|---|---|---|---|
Images (.png/.jpg/.heic) | partition_image | “auto”, “hi_res”, “ocr_only” | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy |
PDFs (.pdf) | partition_pdf | “auto”, “fast”, “hi_res”, “ocr_only” | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy |
Was this page helpful?