The Unstructured Python SDK client allows you to send an individual file for processing by Unstructured API services. Whether you’re using the Free Unstructured API, the Unstructured Serverless API, the Unstructured API on Azure/AWS, or your local deployment of the Unstructured API, you can access the API using the Python SDK.

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

  • You need to work with documents in cloud storage.
  • You want to cache the results of processing multiple files in batches.
  • You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

To use the Python SDK, you’ll need:

These environment variables:

  • UNSTRUCTURED_API_KEY - Your Unstructured API key value.
  • UNSTRUCTURED_API_URL - Your Unstructured API URL.

If you do not specify the API URL, your Unstructured Serverless API pay-as-you-go account will be used by default. You must always specify your Serverless API key.

To use the Free Unstructured API, you must always specify your Free API key, and the Free API URL which is https://api.unstructured.io/general/v0/general

To use the pay-as-you-go Unstructured API on Azure or AWS with the SDKs, you must always specify the corresponding API URL. See the Azure or AWS instructions.

Installation

Before using the SDK to interact with Unstructured API services, install the library:

Python
pip install unstructured-client

The SDK uses semantic versioning and major bumps could bring breaking changes. It is advised to pin your installed version. See the migration guide, later on this page, for breaking change announcements.

Basics

Let’s start with a simple example in which you send a PDF document to be partitioned with the Unstructured API. Note that the Python SDK gives you a choice of two interfaces. The functionality is identical, but the Pydantic model syntax may provide completion hints in your IDE, for instance.

Async partitioning

The Python SDK also has a partition_async. This call is equivalent to partition except that it can be used in a non blocking context. For instance, asyncio.gather can be used to concurrently process multiple files inside of a directory hierarchy, as demonstrated here:

Page splitting

In order to speed up processing of large PDF files, the split_pdf_page* parameter is True by default. This causes the PDF to be split into small batches of pages before sending requests to the API. The client awaits all parallel requests and combines the responses into a single response object. This is specific to PDF files and other filetypes are ignored.

The number of parallel requests is controlled by split_pdf_concurrency_level*. The default is 8 and the max is set to 15 to avoid high resource usage and costs.

If at least one request is successful, the responses are combined into a single response object. An error is returned only if all requests failed or there was an error during splitting.

This feature may lead to unexpected results when chunking because the server does not see the entire document context at once. If you’d like to chunk across the whole document and still get the speedup from parallel processing, you can:

  • Partition the PDF with split_pdf_page set to True, without any chunking parameters.
  • Store the returned elements in results.json.
  • Partition this JSON file with the desired chunking parameters.
Python
req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=file.read(),
            file_name=filename,
        ),
        strategy=shared.Strategy.HI_RES,
        split_pdf_page=True,            # If True, splits the PDF file into smaller chunks of pages.
        split_pdf_allow_failed=True,    # If True, the partitioning continues even if some pages fail.
        split_pdf_concurrency_level=15  # Set the number of concurrent request to the maximum value: 15.
    )
)
res = client.general.partition(req)

Customizing the client

Retries

You can also change the defaults for retries through the retry_config* when initializing the client. If a request to the API fails, the client will retry the request with an exponential backoff strategy up to a maximum interval of one minute. The function keeps retrying until the total elapsed time exceeds max_elapsed_time*, which defaults to one hour:

Python
import os 

client = UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
    retry_config=RetryConfig(
        strategy="backoff",
        retry_connection_errors=True,
        backoff=BackoffStrategy(
            # time intervals are defined in milliseconds
            initial_interval=500,
            max_interval=60000,
            exponent=1.5,
            max_elapsed_time=900000,  # 15min*60sec*1000ms = 15 minutes
        ),
    )
)

Disabling SSL validation

If you disable SSL validation, requests will accept any TLS certificate presented by the server and ignore hostname mismatches and/or expired certificates, which will make your application vulnerable to man-in-the-middle (MitM) attacks. Only set this to False for testing.

Python
http_client = requests.Session()
http_client.verify = False

client = UnstructuredClient(
    client=http_client,
    ...
)

Handling the response

The partition response defaults to a dict format that can be converted to Unstructured elements with the elements_from_dicts utility function as seen below. Otherwise, the API response can be sent directly to your vector store or another destination.

Python
from unstructured.staging.base import elements_from_dicts

# ...

if res.elements is not None:
    elements = elements_from_dicts(response.elements)

Parameters & examples

The parameter names used in this document are for the Python SDK, which follow snake_case convention. The JavaScript/TypeScript SDK follows camelCase convention. Other than this difference in naming convention, the names used in the SDKs are the same across all methods.

  • Refer to the API parameters page for the full list of available parameters.
  • Refer to the Examples page for some inspiration on using the parameters.

Migration guide

There are minor breaking changes in 0.26.0. If you encounter any errors when upgrading, please find the solution below.

If you see the error: AttributeError: 'PartitionParameters' object has no attribute 'partition_parameters'

Previously, the SDK accepted a PartitionParameters object as input to the sdk.general.partition function. Now, this object must be wrapped in a PartitionRequest object. The old behavior was deprecated in 0.23.0 and removed in 0.26.0.

# Instead of:
from unstructured_client.models import shared

req = shared.PartitionParameters(
    files=files,
)

resp = s.general.partition(request=req)


# Switch to:
from unstructured_client.models import shared, operations

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=files,
    )
)

resp = s.general.partition(request=req)

If you see the error: TypeError: BaseModel.__init__() takes 1 positional argument but 2 were given

In 0.26.0, the PartitionRequest constructor no longer allows for positional arguments. You must specify partition_parameters by name.

# Instead of:
req = operations.PartitionRequest(
    shared.PartitionParameters(
        files=files,
    )
)

# Switch to:
req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=files,
    )
)

If you see the error: TypeError: General.partition() takes 1 positional argument but 2 were given

In 0.26.0, the partition function no longer allows for positional arguments. You must specify request by name.

# Instead of:
resp = s.general.partition(req)

# Switch to:
resp = s.general.partition(request=req)