Splitting text based on a fixed word count leads to
Each element has it's carachteristics:
Your documents will be automatically processed according to the file type.
Get the output quality of a custom preprocessing pipeline in a simple API call.
During the conversion preprocess takes into account all elements and semantics of the content.
It divides the text following the hierarchical structure of the sections and then further divides the text into optimal chunks.
Keeps lists together if they are short, splits them if they contain long points.
Divide the text into paragraphs, taking care to keep together what is semantically linked, just like you would.
This type of file is converted taking into account the writing orientation, headings and the lecotion of the elements.
Preprocess is able to differentiate data tables from textual ones by treating them differently.
By setting the table_output_format parameter you can decide whether to receive the output of the tables in text, markdown or html form.
By setting repeat_table_header = true you will find the header included in each chunk.
Presentations are a graphic-visual format that contains concepts in slides.
Preprocess recognizes which PDFs were originally presentations.
The content is divided by slide and if necessary further divided in the case of long texts.
The order of the text is important: each element on the slide is converted into consecutive text.
They are the most used formats but also the least consistent ones.
Cleaning HTML files of unwanted elements automatically is essential to obtain processable data.
Recognizing titles and graphic elements is not always easy, especially when complex UX elements come into play.
Similarly, for plain texts, identifying the titles semantically is essential to divide the text coherently.
Integrate preprocess in your data pipeline with a few lines of code. Check the repositories.
Request an API key and start using our chunking API for free. You’ll recieve 1000 pages in free credits.
If you need specific SLA and support please reach out at support@preprocess.co