Preprocess

Super Simple Ingestion Pipelines

Your documents will be automatically processed according to the file type.
Get the output quality of a custom preprocessing pipeline in a simple API call.

cURL
curl --location
--request POST 'https://chunk.ing' \
--header 'Content-Type: multipart/form-data' \
--header 'x-api-key: your_api_key' \
--form 'file=@"/your_file.ext"'

Word-like

.DOCX, .PDF, .ODT, .DOC

During the conversion preprocess takes into account all elements and semantics of the content.
It divides the text following the hierarchical structure of the sections and then further divides the text into optimal chunks.
Keeps lists together if they are short, splits them if they contain long points.
Divide the text into paragraphs, taking care to keep together what is semantically linked, just like you would.

Excel-like

.XLSX, .CSV, .ODS, .XLS

This type of file is converted taking into account the writing orientation, headings and the lecotion of the elements.
Preprocess is able to differentiate data tables from textual ones by treating them differently.
By setting the table_output_format parameter you can decide whether to receive the output of the tables in text, markdown or html form.
By setting repeat_table_header = true you will find the header included in each chunk.

PowerPoint-like

.PPTX, .PDF, .ODP, .PPT

Presentations are a graphic-visual format that contains concepts in slides.
Preprocess recognizes which PDFs were originally presentations.
The content is divided by slide and if necessary further divided in the case of long texts.
The order of the text is important: each element on the slide is converted into consecutive text.

HTML & Text

.HTML, .EML, .TXT

They are the most used formats but also the least consistent ones.
Cleaning HTML files of unwanted elements automatically is essential to obtain processable data.
Recognizing titles and graphic elements is not always easy, especially when complex UX elements come into play.
Similarly, for plain texts, identifying the titles semantically is essential to divide the text coherently.

PREPROCESS

Document
preprocessing
for LLMs

Document preprocessing for LLMs

The API to convert and split any kind of document into optimal chunks of text
without the hassle of building an in-house solution.

The problem

Basic chunking:
Garbage in, Garbage out

The solution

Unleash the
True Potential of Data

Super Simple Ingestion Pipelines

Already Ready

Try it now

PREPROCESS

Document preprocessing for LLMs

Document preprocessing for LLMs

The API to convert and split any kind of document into optimal chunks of text without the hassle of building an in-house solution.

The problem

Basic chunking: Garbage in, Garbage out

The solution

Unleash the True Potential of Data

Super Simple Ingestion Pipelines

Already Ready

Try it now

Document
preprocessing
for LLMs

The API to convert and split any kind of document into optimal chunks of text
without the hassle of building an in-house solution.

Basic chunking:
Garbage in, Garbage out

Unleash the
True Potential of Data