var script = document.createElement("script");
script.src = "https://widget.kapa.ai/kapa-widget.bundle.js";
script.setAttribute("data-website-id", "8ae12a97-484a-4704-8127-b6f17ebc6bcf");
script.setAttribute("data-project-name", "Unstructured");
script.setAttribute("data-project-color", "#0CDDF8");
script.setAttribute(
  "data-project-logo",
  "https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/img/unstructured_logo.png"
);
document.body.appendChild(script);

Method 1: Using partition_pdf

Method 2: Using Auto Partition or Unstructured API

Support

Community

Feedback

Blog

Product

Unstructured

Welcome

Platform

Open source

Example code

Ingestion

Glossary

This section describes two methods for extracting tables from PDF files.

Table extraction from PDF

Overview

Unstructured Open Source

Supported file types

Quickstart

Unstructured open source library offers flexible options for installation.

Full Installation

The instructions below guide you on how to use the unstructured library inside a Docker container.

Docker Installation

The `unstructured` library includes functions to partition, chunk, clean, and stage raw source documents. These functions serve as the primary public interfaces within the library.

Core Functionality

Partitioning functions in `unstructured` allow users to extract structured content from a raw unstructured document. These functions break a document down into elements such as `Title`, `NarrativeText`, and `ListItem`, enabling users to decide what content they’d like to keep for their particular application. If you’re training a summarization model, for example, you may only be interested in `NarrativeText`.

Partitioning

As part of data preparation for an NLP model, it’s common to need to clean up your data prior to passing it into the model. If there’s unwanted content in your output, for example, it could impact the quality of your NLP model. To help with this, the `unstructured` library includes cleaning functions to help users sanitize output before sending it to downstream applications.

Cleaning

Extracting

Staging

Chunking functions in `unstructured` use metadata and document elements detected with `partition` functions to post-process elements into more useful "chunks" for uses cases such as Retrieval Augmented Generation (RAG).

Chunking

Embedding

The Unstructured Python Ingest library is a powerful tool designed to coordinate the process of pulling data from data providers, partitioning the content, and pushing that new content to a desired location. This technical documentation will provide an in-depth understanding of the Python Ingest library, including its features, architecture, installation, configuration, usage, API reference, troubleshooting, examples, and more.

Ingest

Processor configuration

Read configuration

Partition configuration

Permissions configuration

Retry strategy configuration

Chunking configuration

Embedding configuration

Fsspec configuration

Airtable

Astra DB

Azure

Biomed

Confluence

Couchbase

Databricks Volumes

Delta Table

Discord

Dropbox

Elasticsearch

GitHub

GitLab

Google Cloud Storage

Google Drive

HubSpot

Jira

Kafka

Local

MongoDB

Notion

OneDrive

OpenSearch

Outlook

PostgreSQL

Reddit

Salesforce

SFTP

SharePoint

SingleStore

Slack

Snowflake

SQLite

Wikipedia

Azure AI Search

Chroma

Clarifai

KDB.AI

LanceDB

Milvus

Pinecone

Qdrant

Vectara

Weaviate

Set the OCR agent

Document elements and metadata

Depending on your need, `Unstructured` provides OCR-based and Transformer-based models to detect elements in the documents. The models are useful to detect the complex layout in the documents and predict the element types.

Models

Partitioning strategies

Integrate your model development pipeline with your favorite machine learning frameworks and libraries, and prepare your data for ingestion into downstream systems. Most of our integrations come in the form of [staging functions](/open-source/core-functionality/staging), which take a list of `Element` objects as input and return formatted dictionaries as output.

Integrations

Unstructured Serverless API services

This page describes how to get started with the Unstructured Serverless API. Learn how to obtain an API key to use with the Unstructured Serverless API, and get started in no time.

Unstructured Serverless API

This page describes how to obtain an API key to use with the free Unstructured API, the limitations of the free Unstructured API, and provides a quickstart example.

Free Unstructured API

Unstructured API on Azure

Follow these steps to deploy the Unstructured API service into your AWS account.

Unstructured API on AWS

Python SDK

Process an individual file by using the Unstructured Python SDK

JavaScript/TypeScript SDK

Process an individual file by using the Unstructured JavaScript/TypeScript SDK

POST request

Process an individual file by making a direct POST request

Open source Python library

Process an individual file by using the Unstructured open source Python library

Unstructured API services provide parameters to customize the processing of documents. Below are the details for these parameters.

API parameters

This section details the structure of HTTP validation errors returned by the API.

API validation errors

This page provides some examples of accessing Unstructured API via different methods.

Examples

Ingest CLI

Process files in batches by using the Unstructured Ingest CLI

Ingest Python library

Process files in batches by using the Unstructured Ingest Python library

Ingest dependencies

Choose a partitioning strategy

Choose a hi-res model

Get element contents

Process a subset of files

Set embedding behavior

Parse simple PDFs and HTML

Set partitioning behavior

Set chunking behavior

Output unique element IDs

Output bounding box coordinates

Set document language for better OCR

Extract tables as HTML

Extract images and tables from documents

Get chunked elements

Change element coordinate systems

Work with PowerPoint files

Use LangChain and Ollama

Use LangChain and Llama 3

Transform a JSON file into a different schema

Generate a JSON schema for a file

Speed up processing of large files and batches

API keys and URLs

Troubleshooting Unstructured API keys and API URLs

Chunking strategies

Pipeline

Supported connectors

Unstructured Platform quickstart

Source connectors are essential components in data integration systems that establish a link between your files and the data ingestion process. They facilitate the batch processing of files, allowing for the systematic retrieval and ingestion of data stored in various file formats.

Destination connectors in the Unstructured Platform are designed to specify the endpoint for data processed within the platform. These connectors ensure that the transformed and analyzed data is securely and efficiently transferred to a storage system for future use, often to a vector database for tasks that involve high-speed retrieval and advanced data analytics operations.

Notebooks

Code samples

Tool demos

Table extraction from PDF

Method 1: Using partition_pdf

Method 2: Using Auto Partition or Unstructured API

Notebooks

Code samples

Tool demos

​Method 1: Using partition_pdf

​Method 2: Using Auto Partition or Unstructured API

Method 1: Using partition_pdf

Method 2: Using Auto Partition or Unstructured API