Documentation

TextSplitterService

Service for splitting large texts into smaller chunks.

Splits text at paragraph boundaries, keeping chunks under a maximum byte size suitable for storage in the database.

Tags
since
3.0.0

Table of Contents

Constants

ABSOLUTE_MAX_BYTES  = 65000
Absolute maximum bytes (MySQL TEXT column limit).
DEFAULT_MAX_BYTES  = 60000
Default maximum bytes per chunk (60KB - leaves room for DB overhead).

Methods

estimateChunkCount()  : int
Estimate how many chunks a text will be split into.
getByteSize()  : int
Get the byte size of text.
needsSplit()  : bool
Check if text needs to be split.
split()  : array<string|int, array{num: int, title: string, content: string}>
Split text into chunks at paragraph boundaries.
combineUnitsIntoChunks()  : array<string|int, string>
Combine text units into chunks under the byte limit.
generateChunkTitle()  : string
Generate a title for a chunk based on its content.
normalizeText()  : string
Normalize text for consistent splitting.
splitAtParagraphs()  : array<string|int, array{num: int, title: string, content: string}>
Split text at paragraph boundaries.
splitIntoSentences()  : array<string|int, string>
Split text into sentences.
splitLongParagraph()  : array<string|int, string>
Split a single paragraph that exceeds the limit.

Constants

ABSOLUTE_MAX_BYTES

Absolute maximum bytes (MySQL TEXT column limit).

public mixed ABSOLUTE_MAX_BYTES = 65000

DEFAULT_MAX_BYTES

Default maximum bytes per chunk (60KB - leaves room for DB overhead).

public mixed DEFAULT_MAX_BYTES = 60000

Methods

estimateChunkCount()

Estimate how many chunks a text will be split into.

public estimateChunkCount(string $text[, int $maxBytes = self::DEFAULT_MAX_BYTES ]) : int
Parameters
$text : string

The text

$maxBytes : int = self::DEFAULT_MAX_BYTES

Maximum bytes per chunk

Return values
int

Estimated number of chunks

getByteSize()

Get the byte size of text.

public getByteSize(string $text) : int
Parameters
$text : string

The text

Return values
int

Size in bytes

needsSplit()

Check if text needs to be split.

public needsSplit(string $text[, int $maxBytes = self::DEFAULT_MAX_BYTES ]) : bool
Parameters
$text : string

The text to check

$maxBytes : int = self::DEFAULT_MAX_BYTES

Maximum bytes threshold

Return values
bool

True if text exceeds maxBytes

split()

Split text into chunks at paragraph boundaries.

public split(string $text[, int $maxBytes = self::DEFAULT_MAX_BYTES ]) : array<string|int, array{num: int, title: string, content: string}>
Parameters
$text : string

The text to split

$maxBytes : int = self::DEFAULT_MAX_BYTES

Maximum bytes per chunk (default 60KB)

Return values
array<string|int, array{num: int, title: string, content: string}>

combineUnitsIntoChunks()

Combine text units into chunks under the byte limit.

private combineUnitsIntoChunks(array<string|int, string> $units, int $maxBytes[, string $separator = ' ' ]) : array<string|int, string>
Parameters
$units : array<string|int, string>

Array of text units (sentences or words)

$maxBytes : int

Maximum bytes per chunk

$separator : string = ' '

Separator between units

Return values
array<string|int, string>

Array of chunks

generateChunkTitle()

Generate a title for a chunk based on its content.

private generateChunkTitle(int $num, string $content) : string
Parameters
$num : int

Chapter number

$content : string

The chunk content

Return values
string

Generated title

normalizeText()

Normalize text for consistent splitting.

private normalizeText(string $text) : string
Parameters
$text : string

The text to normalize

Return values
string

Normalized text

splitAtParagraphs()

Split text at paragraph boundaries.

private splitAtParagraphs(string $text, int $maxBytes) : array<string|int, array{num: int, title: string, content: string}>
Parameters
$text : string

The text to split

$maxBytes : int

Maximum bytes per chunk

Return values
array<string|int, array{num: int, title: string, content: string}>

splitIntoSentences()

Split text into sentences.

private splitIntoSentences(string $text) : array<string|int, string>
Parameters
$text : string

The text to split

Return values
array<string|int, string>

Array of sentences

splitLongParagraph()

Split a single paragraph that exceeds the limit.

private splitLongParagraph(string $paragraph, int $maxBytes) : array<string|int, string>

Falls back to sentence-level splitting, then word-level if needed.

Parameters
$paragraph : string

The long paragraph

$maxBytes : int

Maximum bytes per chunk

Return values
array<string|int, string>

Array of sub-chunks


        
On this page

Search results