Examples

For each of these examples, you’ll need:

These environment variables:

UNSTRUCTURED_API_KEY - Your Unstructured API key value.
UNSTRUCTURED_API_URL - Your Unstructured API URL.

If you do not specify the API URL, your Unstructured Serverless API pay-as-you-go account will be used by default. You must always specify your Serverless API key.

To use the Free Unstructured API, you must always specify your Free API key, and the Free API URL which is https://api.unstructured.io/general/v0/general

To use the pay-as-you-go Unstructured API on Azure or AWS with the SDKs, you must always specify the corresponding API URL. See the Azure or AWS instructions.

Changing partition strategy for a PDF

Here’s how you can modify partition strategy for a PDF file, and select an alternative model to use with Unstructured API. The hi_res strategy supports different models, and the default is layout_v1.1.0.

CLI
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --strategy hi_res \
    --hi-res-model-name layout_v1.1.0 \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"

Python
import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
    LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig

if __name__ == "__main__":
    Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            strategy="hi_res",
            hi_res_model_name="layout_v1.0.0",
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
        uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
    ).run()

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

POST
curl -X 'POST' $UNSTRUCTURED_API_URL \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
-F 'files=@sample-docs/layout-parser-paper.pdf' \
-F 'strategy=hi_res' \
-F 'hi_res_model_name=layout_v1.1.0'

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

Python
import asyncio
import os
import json
import unstructured_client
from unstructured_client.models import shared

client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
)

async def call_api(filename, input_dir, output_dir):
    req = {
        "partition_parameters": {
            "files": {
                "content": open(filename, "rb"),
                "file_name": os.path.basename(filename),
            },
            "strategy": shared.Strategy.HI_RES,
            "hi_res_model_name": "layout_v1.1.0",
            "split_pdf_page": True,
            "split_pdf_allow_failed": True,
            "split_pdf_concurrency_level": 15
        }
    }

    try:
        res = await client.general.partition_async(request=req)
        element_dicts = [element for element in res.elements]
        json_elements = json.dumps(element_dicts, indent=2)

        # Create the output directory structure.
        relative_path = os.path.relpath(os.path.dirname(filename), input_dir)
        output_subdir = os.path.join(output_dir, relative_path)
        os.makedirs(output_subdir, exist_ok=True)

        # Write the output file.
        output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json")
        with open(output_filename, "w") as file:
            file.write(json_elements)

    except Exception as e:
        print(f"Error processing {filename}: {e}")

async def process_files(input_directory, output_directory):
    tasks = []

    for root, _, files in os.walk(input_directory):
        for file in files:
            if not file.endswith('.json'):
                full_path = os.path.join(root, file)
                tasks.append(call_api(full_path, input_directory, output_directory))

    await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(process_files(
        input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), 
        output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR")
    ))

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

TypeScript
import { UnstructuredClient } from "unstructured-client";
import * as fs from "fs";
import * as path from "path";
import { Strategy } from "unstructured-client/sdk/models/shared/index.js";
import { PartitionResponse } from "unstructured-client/sdk/models/operations";

// Send all files in the source path to Unstructured for processing.
// Send the processed data to the destination path.
function processFiles(
    client: UnstructuredClient,
    sourcePath: string,
    destinationPath: string
): void {

    // If an output directory does not exist for the corresponding input
    // directory, then create it.
    if (!fs.existsSync(destinationPath)) {
        fs.mkdirSync(destinationPath, { recursive: true });
    }

    // Get all folders and files at the current level of the input directory.
    const items = fs.readdirSync(sourcePath);

    // For each folder and file in the input directory...
    for (const item of items) {
        const inputPath = path.join(sourcePath, item);
        const outputPath = path.join(destinationPath, item)

        // If it's a folder, call this function recursively.
        if (fs.statSync(inputPath).isDirectory()) {
            processFiles(client, inputPath, outputPath);
        } else {
            // If it's a file, send it to Unstructured for processing.
            const data = fs.readFileSync(inputPath);

            client.general.partition({
                partitionParameters: {
                    files: {
                        content: data,
                        fileName: inputPath
                    },
                    strategy: Strategy.HiRes,
                    hiResModelName: "layout_v1.1.0", 
                    splitPdfPage: true,
                    splitPdfConcurrencyLevel: 15,
                    splitPdfAllowFailed: true
                }
            }).then((res: PartitionResponse) => {
                // If successfully processed, write the processed data to
                // the destination directory.
                if (res.statusCode == 200) {
                    const jsonElements = JSON.stringify(res.elements, null, 2)
                    fs.writeFileSync(outputPath + ".json", jsonElements)
                }
            }).catch((e) => {
                if (e.statusCode) {
                    console.log(e.statusCode);
                    console.log(e.body);
                } else {
                    console.log(e);
                }
            });
        }
    }
}

const client = new UnstructuredClient({
    security: { apiKeyAuth: process.env.UNSTRUCTURED_API_KEY },
    serverURL: process.env.UNSTRUCTURED_API_URL
});

processFiles(
    client,
    process.env.LOCAL_FILE_INPUT_DIR,
    process.env.LOCAL_FILE_OUTPUT_DIR
);

If you have a local deployment of the Unstructured API, you can use other supported models, such as yolox.

Specifying the language of a document for better OCR results

For better OCR results, you can specify what languages your document is in using the languages parameter. View the list of available languages.

CLI
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --strategy ocr_only \
    --ocr-languages kor \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"

Python
import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
    LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig

if __name__ == "__main__":
    Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            strategy="ocr_only",
            ocr_languages=["kor"],
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
        uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
    ).run()

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

POST
curl -X 'POST' $UNSTRUCTURED_API_URL \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
-F 'files=@sample-docs/korean.png' \
-F 'strategy=ocr_only' \
-F 'languages=kor'

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

Python
import asyncio
import os
import json
import unstructured_client
from unstructured_client.models import shared

client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
)

async def call_api(filename, input_dir, output_dir):
    req = {
        "partition_parameters": {
            "files": {
                "content": open(filename, "rb"),
                "file_name": os.path.basename(filename),
            },
            "strategy": shared.Strategy.OCR_ONLY,
            "languages": ["kor"],
            "split_pdf_page": True,
            "split_pdf_allow_failed": True,
            "split_pdf_concurrency_level": 15
        }
    }

    try:
        res = await client.general.partition_async(request=req)
        element_dicts = [element for element in res.elements]
        json_elements = json.dumps(element_dicts, indent=2)

        # Create the output directory structure.
        relative_path = os.path.relpath(os.path.dirname(filename), input_dir)
        output_subdir = os.path.join(output_dir, relative_path)
        os.makedirs(output_subdir, exist_ok=True)

        # Write the output file.
        output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json")
        with open(output_filename, "w") as file:
            file.write(json_elements)

    except Exception as e:
        print(f"Error processing {filename}: {e}")

async def process_files(input_directory, output_directory):
    tasks = []

    for root, _, files in os.walk(input_directory):
        for file in files:
            if not file.endswith('.json'):
                full_path = os.path.join(root, file)
                tasks.append(call_api(full_path, input_directory, output_directory))

    await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(process_files(
        input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), 
        output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR")
    ))

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

TypeScript
import { UnstructuredClient } from "unstructured-client";
import * as fs from "fs";
import * as path from "path";
import { Strategy } from "unstructured-client/sdk/models/shared/index.js";
import { PartitionResponse } from "unstructured-client/sdk/models/operations";

// Send all files in the source path to Unstructured for processing.
// Send the processed data to the destination path.
function processFiles(
    client: UnstructuredClient,
    sourcePath: string,
    destinationPath: string
): void {

    // If an output directory does not exist for the corresponding input
    // directory, then create it.
    if (!fs.existsSync(destinationPath)) {
        fs.mkdirSync(destinationPath, { recursive: true });
    }

    // Get all folders and files at the current level of the input directory.
    const items = fs.readdirSync(sourcePath);

    // For each folder and file in the input directory...
    for (const item of items) {
        const inputPath = path.join(sourcePath, item);
        const outputPath = path.join(destinationPath, item)

        // If it's a folder, call this function recursively.
        if (fs.statSync(inputPath).isDirectory()) {
            processFiles(client, inputPath, outputPath);
        } else {
            // If it's a file, send it to Unstructured for processing.
            const data = fs.readFileSync(inputPath);

            client.general.partition({
                partitionParameters: {
                    files: {
                        content: data,
                        fileName: inputPath
                    },
                    strategy: Strategy.OcrOnly,
                    languages: ["kor"],
                    splitPdfPage: true,
                    splitPdfConcurrencyLevel: 15,
                    splitPdfAllowFailed: true
                }
            }).then((res: PartitionResponse) => {
                // If successfully processed, write the processed data to
                // the destination directory.
                if (res.statusCode == 200) {
                    const jsonElements = JSON.stringify(res.elements, null, 2)
                    fs.writeFileSync(outputPath + ".json", jsonElements)
                }
            }).catch((e) => {
                if (e.statusCode) {
                    console.log(e.statusCode);
                    console.log(e.body);
                } else {
                    console.log(e);
                }
            });
        }
    }
}

const client = new UnstructuredClient({
    security: { apiKeyAuth: process.env.UNSTRUCTURED_API_KEY },
    serverURL: process.env.UNSTRUCTURED_API_URL
});

processFiles(
    client,
    process.env.LOCAL_FILE_INPUT_DIR,
    process.env.LOCAL_FILE_OUTPUT_DIR
);

Saving bounding box coordinates

When elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well. Set the coordinates parameter to true to add this field to the elements in the response.

CLI
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --strategy hi_res \
    --additional-partition-args="{\"coordinates\":\"true\", \"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"

Python
import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
    LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig

if __name__ == "__main__":
    Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            strategy="hi_res",
            additional_partition_args={
                "coordinates": True,
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
        uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
    ).run()

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

POST
curl -X 'POST' $UNSTRUCTURED_API_URL \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
-F 'files=@sample-docs/layout-parser-paper.pdf' \
-F 'coordinates=true' \
-F 'strategy=hi_res'

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

Python
import asyncio
import os
import json
import unstructured_client
from unstructured_client.models import shared

client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
)

async def call_api(filename, input_dir, output_dir):
    req = {
        "partition_parameters": {
            "files": {
                "content": open(filename, "rb"),
                "file_name": os.path.basename(filename),
            },
            "strategy": shared.Strategy.HI_RES,
            "coordinates": True,
            "split_pdf_page": True,
            "split_pdf_allow_failed": True,
            "split_pdf_concurrency_level": 15
        }
    }

    try:
        res = await client.general.partition_async(request=req)
        element_dicts = [element for element in res.elements]
        json_elements = json.dumps(element_dicts, indent=2)

        # Create the output directory structure.
        relative_path = os.path.relpath(os.path.dirname(filename), input_dir)
        output_subdir = os.path.join(output_dir, relative_path)
        os.makedirs(output_subdir, exist_ok=True)

        # Write the output file.
        output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json")
        with open(output_filename, "w") as file:
            file.write(json_elements)

    except Exception as e:
        print(f"Error processing {filename}: {e}")

async def process_files(input_directory, output_directory):
    tasks = []

    for root, _, files in os.walk(input_directory):
        for file in files:
            if not file.endswith('.json'):
                full_path = os.path.join(root, file)
                tasks.append(call_api(full_path, input_directory, output_directory))

    await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(process_files(
        input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), 
        output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR")
    ))

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

TypeScript
import { UnstructuredClient } from "unstructured-client";
import * as fs from "fs";
import * as path from "path";
import { Strategy } from "unstructured-client/sdk/models/shared/index.js";
import { PartitionResponse } from "unstructured-client/sdk/models/operations";

// Send all files in the source path to Unstructured for processing.
// Send the processed data to the destination path.
function processFiles(
    client: UnstructuredClient,
    sourcePath: string,
    destinationPath: string
): void {

    // If an output directory does not exist for the corresponding input
    // directory, then create it.
    if (!fs.existsSync(destinationPath)) {
        fs.mkdirSync(destinationPath, { recursive: true });
    }

    // Get all folders and files at the current level of the input directory.
    const items = fs.readdirSync(sourcePath);

    // For each folder and file in the input directory...
    for (const item of items) {
        const inputPath = path.join(sourcePath, item);
        const outputPath = path.join(destinationPath, item)

        // If it's a folder, call this function recursively.
        if (fs.statSync(inputPath).isDirectory()) {
            processFiles(client, inputPath, outputPath);
        } else {
            // If it's a file, send it to Unstructured for processing.
            const data = fs.readFileSync(inputPath);

            client.general.partition({
                partitionParameters: {
                    files: {
                        content: data,
                        fileName: inputPath
                    },
                    strategy: Strategy.HiRes,
                    coordinates: true,
                    splitPdfPage: true,
                    splitPdfConcurrencyLevel: 15,
                    splitPdfAllowFailed: true
                }
            }).then((res: PartitionResponse) => {
                // If successfully processed, write the processed data to
                // the destination directory.
                if (res.statusCode == 200) {
                    const jsonElements = JSON.stringify(res.elements, null, 2)
                    fs.writeFileSync(outputPath + ".json", jsonElements)
                }
            }).catch((e) => {
                if (e.statusCode) {
                    console.log(e.statusCode);
                    console.log(e.body);
                } else {
                    console.log(e);
                }
            });
        }
    }
}

const client = new UnstructuredClient({
    security: { apiKeyAuth: process.env.UNSTRUCTURED_API_KEY },
    serverURL: process.env.UNSTRUCTURED_API_URL
});

processFiles(
    client,
    process.env.LOCAL_FILE_INPUT_DIR,
    process.env.LOCAL_FILE_OUTPUT_DIR
);

Returning unique element IDs

By default, the element ID is a SHA-256 hash of the element text. This is to ensure that the ID is deterministic. One downside is that the ID is not guaranteed to be unique. Different elements with the same text will have the same ID, and there could also be hash collisions. To use UUIDs in the output instead, set unique_element_ids=true. Note: this means that the element IDs will be random, so with every partition of the same file, you will get different IDs. This can be helpful if you’d like to use the IDs as a primary key in a database, for example.

CLI
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --strategy hi_res \
    --additional-partition-args="{\"unique_element_ids\":\"true\", \"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"

Python
import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
    LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig

if __name__ == "__main__":
    Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            strategy="hi_res",
            additional_partition_args={
                "unique_element_ids": True,
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
        uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
    ).run()

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

POST
curl -X 'POST' $UNSTRUCTURED_API_URL \
-H 'accept: application/json'  \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
-F 'files=@sample-docs/layout-parser-paper-fast.pdf' \
-F 'unique_element_ids=true'

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

Python
import asyncio
import os
import json
import unstructured_client
from unstructured_client.models import shared

client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
)

async def call_api(filename, input_dir, output_dir):
    req = {
        "partition_parameters": {
            "files": {
                "content": open(filename, "rb"),
                "file_name": os.path.basename(filename),
            },
            "strategy": shared.Strategy.HI_RES,
            "unique_element_ids": True,
            "split_pdf_page": True,
            "split_pdf_allow_failed": True,
            "split_pdf_concurrency_level": 15
        }
    }

    try:
        res = await client.general.partition_async(request=req)
        element_dicts = [element for element in res.elements]
        json_elements = json.dumps(element_dicts, indent=2)

        # Create the output directory structure.
        relative_path = os.path.relpath(os.path.dirname(filename), input_dir)
        output_subdir = os.path.join(output_dir, relative_path)
        os.makedirs(output_subdir, exist_ok=True)

        # Write the output file.
        output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json")
        with open(output_filename, "w") as file:
            file.write(json_elements)

    except Exception as e:
        print(f"Error processing {filename}: {e}")

async def process_files(input_directory, output_directory):
    tasks = []

    for root, _, files in os.walk(input_directory):
        for file in files:
            if not file.endswith('.json'):
                full_path = os.path.join(root, file)
                tasks.append(call_api(full_path, input_directory, output_directory))

    await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(process_files(
        input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), 
        output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR")
    ))

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

TypeScript
import { UnstructuredClient } from "unstructured-client";
import * as fs from "fs";
import * as path from "path";
import { Strategy } from "unstructured-client/sdk/models/shared/index.js";
import { PartitionResponse } from "unstructured-client/sdk/models/operations";

// Send all files in the source path to Unstructured for processing.
// Send the processed data to the destination path.
function processFiles(
    client: UnstructuredClient,
    sourcePath: string,
    destinationPath: string
): void {

    // If an output directory does not exist for the corresponding input
    // directory, then create it.
    if (!fs.existsSync(destinationPath)) {
        fs.mkdirSync(destinationPath, { recursive: true });
    }

    // Get all folders and files at the current level of the input directory.
    const items = fs.readdirSync(sourcePath);

    // For each folder and file in the input directory...
    for (const item of items) {
        const inputPath = path.join(sourcePath, item);
        const outputPath = path.join(destinationPath, item)

        // If it's a folder, call this function recursively.
        if (fs.statSync(inputPath).isDirectory()) {
            processFiles(client, inputPath, outputPath);
        } else {
            // If it's a file, send it to Unstructured for processing.
            const data = fs.readFileSync(inputPath);

            client.general.partition({
                partitionParameters: {
                    files: {
                        content: data,
                        fileName: inputPath
                    },
                    uniqueElementIds: true,
                    strategy: Strategy.HiRes,
                    splitPdfPage: true,
                    splitPdfConcurrencyLevel: 15,
                    splitPdfAllowFailed: true
                }
            }).then((res: PartitionResponse) => {
                // If successfully processed, write the processed data to
                // the destination directory.
                if (res.statusCode == 200) {
                    const jsonElements = JSON.stringify(res.elements, null, 2)
                    fs.writeFileSync(outputPath + ".json", jsonElements)
                }
            }).catch((e) => {
                if (e.statusCode) {
                    console.log(e.statusCode);
                    console.log(e.body);
                } else {
                    console.log(e);
                }
            });
        }
    }
}

const client = new UnstructuredClient({
    security: { apiKeyAuth: process.env.UNSTRUCTURED_API_KEY },
    serverURL: process.env.UNSTRUCTURED_API_URL
});

processFiles(
    client,
    process.env.LOCAL_FILE_INPUT_DIR,
    process.env.LOCAL_FILE_OUTPUT_DIR
);

Adding the chunking step after partitioning

You can combine partitioning and subsequent chunking in a single request by setting the chunking_strategy parameter. By default, the chunking_strategy is set to None, and no chunking is performed.

CLI
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --chunking-strategy by_title \
    --chunk-max-characters 1024 \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --strategy hi_res \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"

Python
import os

from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig,
    LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.chunker import ChunkerConfig

if __name__ == "__main__":
    Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            strategy="hi_res",
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
        chunker_config=ChunkerConfig(
            chunking_strategy="by_title",
            chunk_max_characters=1024
        ),
        uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
    ).run()

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

POST
curl -X 'POST' $UNSTRUCTURED_API_URL \
-H 'accept: application/json'  \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
-F 'files=@sample-docs/layout-parser-paper-fast.pdf' \
-F 'chunking_strategy=by_title' \
-F 'max_characters=1024' \
-F 'strategy=hi_res'

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

Python
import asyncio
import os
import json
import unstructured_client
from unstructured_client.models import shared

client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
)

async def call_api(filename, input_dir, output_dir):
    req = {
        "partition_parameters": {
            "files": {
                "content": open(filename, "rb"),
                "file_name": os.path.basename(filename),
            },
            "chunking_strategy": "by_title",
            "max_characters": 1024,
            "strategy": shared.Strategy.HI_RES,
            "split_pdf_page": True,
            "split_pdf_allow_failed": True,
            "split_pdf_concurrency_level": 15
        }
    }

    try:
        res = await client.general.partition_async(request=req)
        element_dicts = [element for element in res.elements]
        json_elements = json.dumps(element_dicts, indent=2)

        # Create the output directory structure.
        relative_path = os.path.relpath(os.path.dirname(filename), input_dir)
        output_subdir = os.path.join(output_dir, relative_path)
        os.makedirs(output_subdir, exist_ok=True)

        # Write the output file.
        output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json")
        with open(output_filename, "w") as file:
            file.write(json_elements)

    except Exception as e:
        print(f"Error processing {filename}: {e}")

async def process_files(input_directory, output_directory):
    tasks = []

    for root, _, files in os.walk(input_directory):
        for file in files:
            if not file.endswith('.json'):
                full_path = os.path.join(root, file)
                tasks.append(call_api(full_path, input_directory, output_directory))

    await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(process_files(
        input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), 
        output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR")
    ))

Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:

You need to work with documents in cloud storage.
You want to cache the results of processing multiple files in batches.
You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.

TypeScript
import { UnstructuredClient } from "unstructured-client";
import * as fs from "fs";
import * as path from "path";
import { ChunkingStrategy, Strategy } from "unstructured-client/sdk/models/shared/index.js";
import { PartitionResponse } from "unstructured-client/sdk/models/operations";

// Send all files in the source path to Unstructured for processing.
// Send the processed data to the destination path.
function processFiles(
    client: UnstructuredClient,
    sourcePath: string,
    destinationPath: string
): void {

    // If an output directory does not exist for the corresponding input
    // directory, then create it.
    if (!fs.existsSync(destinationPath)) {
        fs.mkdirSync(destinationPath, { recursive: true });
    }

    // Get all folders and files at the current level of the input directory.
    const items = fs.readdirSync(sourcePath);

    // For each folder and file in the input directory...
    for (const item of items) {
        const inputPath = path.join(sourcePath, item);
        const outputPath = path.join(destinationPath, item)

        // If it's a folder, call this function recursively.
        if (fs.statSync(inputPath).isDirectory()) {
            processFiles(client, inputPath, outputPath);
        } else {
            // If it's a file, send it to Unstructured for processing.
            const data = fs.readFileSync(inputPath);

            client.general.partition({
                partitionParameters: {
                    files: {
                        content: data,
                        fileName: inputPath
                    },
                    strategy: Strategy.HiRes,
                    chunkingStrategy: ChunkingStrategy.ByTitle,
                    maxCharacters: 1024,
                    splitPdfPage: true,
                    splitPdfConcurrencyLevel: 15,
                    splitPdfAllowFailed: true
                }
            }).then((res: PartitionResponse) => {
                // If successfully processed, write the processed data to
                // the destination directory.
                if (res.statusCode == 200) {
                    const jsonElements = JSON.stringify(res.elements, null, 2)
                    fs.writeFileSync(outputPath + ".json", jsonElements)
                }
            }).catch((e) => {
                if (e.statusCode) {
                    console.log(e.statusCode);
                    console.log(e.body);
                } else {
                    console.log(e);
                }
            });
        }
    }
}

const client = new UnstructuredClient({
    security: { apiKeyAuth: process.env.UNSTRUCTURED_API_KEY },
    serverURL: process.env.UNSTRUCTURED_API_URL
});

processFiles(
    client,
    process.env.LOCAL_FILE_INPUT_DIR,
    process.env.LOCAL_FILE_OUTPUT_DIR
);

Was this page helpful?

API validation errors Overview

On this page

Changing partition strategy for a PDF
Specifying the language of a document for better OCR results
Saving bounding box coordinates
Returning unique element IDs
Adding the chunking step after partitioning

Unstructured API services

Getting started with API services

Process individual files

Batch processing and ingestion

How to

Best practices

Troubleshooting

Concepts

Endpoints

Changing partition strategy for a PDF

Specifying the language of a document for better OCR results

Saving bounding box coordinates

Returning unique element IDs

Adding the chunking step after partitioning

Unstructured API services

Getting started with API services

Process individual files

Batch processing and ingestion

How to

Best practices

Troubleshooting

Concepts

Endpoints

​Changing partition strategy for a PDF

​Specifying the language of a document for better OCR results

​Saving bounding box coordinates

​Returning unique element IDs

​Adding the chunking step after partitioning

Changing partition strategy for a PDF

Specifying the language of a document for better OCR results

Saving bounding box coordinates

Returning unique element IDs

Adding the chunking step after partitioning