Pipeline

POST

general

object

files

file

coordinates

boolean

content_type

string

encoding

string

extract_image_block_types

array

gz_uncompressed_content_type

string

hi_res_model_name

string

include_page_breaks

boolean

languages

array

ocr_languages

array

output_format

enum<string>

pdf_infer_table_structure

boolean

skip_infer_table_types

array

starting_page_number

integer

strategy

enum<string>

unique_element_ids

boolean

xml_keep_tags

boolean

chunking_strategy

enum<string>

combine_under_n_chars

integer

include_orig_elements

boolean

max_characters

integer

multipage_sections

boolean

new_after_n_chars

integer

overlap

integer

overlap_all

boolean

similarity_threshold

number

include_slide_notes

boolean

Authorizations

unstructured-api-key

string

headerrequired

Headers

unstructured-api-key

string | null

Body

multipart/form-data

files

file

required

The file to extract

coordinates

boolean

default: false

If True, return coordinates for each element extracted via OCR. Default: False

content_type

string | null

A hint about the content type to use (such as text/markdown), when there are problems processing a specific file. This value is a MIME type in the format type/subtype.

encoding

string | null

The encoding method used to decode the text input. Default: utf-8

extract_image_block_types

string[]

The types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields.

gz_uncompressed_content_type

string | null

If file is gzipped, use this content type after unzipping.

hi_res_model_name

string | null

The name of the inference model used when strategy is hi_res

include_page_breaks

boolean

default: false

If true, the output will include page breaks if the filetype supports it. Default: false

languages

string[]

The languages present in the document, for use in partitioning and/or OCR. See the Tesseract documentation for a full list of languages.

ocr_languages

string[]

Deprecated! The languages present in the document, for use in partitioning and/or OCR

output_format

enum<string>

default: application/json

The format of the response. Supported formats are application/json and text/csv. Default: application/json.

Available options:

application/json,

text/csv

pdf_infer_table_structure

boolean

default: true

Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.

skip_infer_table_types

string[]

The document types that you want to skip table extraction with. Default: []

starting_page_number

integer | null

When PDF is split into pages before sending it into the API, providing this information will allow the page number to be assigned correctly. Introduced in 1.0.27.

strategy

enum<string>

default: hi_res

The strategy to use for partitioning PDF/image. Options are fast, hi_res, auto. Default: hi_res

Available options:

fast,

hi_res,

auto,

ocr_only

unique_element_ids

boolean

default: false

When True, assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False

xml_keep_tags

boolean

default: false

If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to XML documents.

chunking_strategy

enum<string> | null

Use one of the supported strategies to chunk the returned elements after partitioning. When 'chunking_strategy' is not specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: 'basic', 'by_page', 'by_similarity', or 'by_title'

Available options:

basic,

by_page,

by_similarity,

by_title

combine_under_n_chars

integer | null

If chunking strategy is set, combine elements until a section reaches a length of n chars. Default: 500

include_orig_elements

boolean | null

When a chunking strategy is specified, each returned chunk will include the elements consolidated to form that chunk as .metadata.orig_elements. Default: true.

max_characters

integer | null

If chunking strategy is set, cut off new sections after reaching a length of n chars (hard max). Default: 500

multipage_sections

boolean

default: true

If chunking strategy is set, determines if sections can span multiple sections. Default: true

new_after_n_chars

integer | null

If chunking strategy is set, cut off new sections after reaching a length of n chars (soft max). Default: 1500

overlap

integer

default: 0

Specifies the length of a string ('tail') to be drawn from each chunk and prefixed to the next chunk as a context-preserving mechanism. By default, this only applies to split-chunks where an oversized element is divided into multiple chunks by text-splitting. Default: 0

overlap_all

boolean

default: false

When True, apply overlap between 'normal' chunks formed from whole elements and not subject to text-splitting. Use this with caution as it entails a certain level of 'pollution' of otherwise clean semantic chunk boundaries. Default: False

similarity_threshold

number | null

A value between 0.0 and 1.0 describing the minimum similarity two elements must have to be included in the same chunk. Note that similar elements may be separated to meet chunk-size criteria; this value can only guarantees that two elements with similarity below the threshold will appear in separate chunks.

include_slide_notes

boolean

default: true

When True, slide notes from .ppt and .pptx files will be included in the response. Default: True

Response

200 - application/json

A list of element dictionaries extracted from the file

Was this page helpful?

Chunking strategies

Unstructured API services

Getting started with API services

Process individual files

Batch processing and ingestion

How to

Best practices

Troubleshooting

Concepts

Endpoints

Authorizations

Headers

Body

Response