TextSplitterService
in package
Service for splitting large texts into smaller chunks.
Splits text at paragraph boundaries, keeping chunks under a maximum byte size suitable for storage in the database.
Tags
Table of Contents
Constants
- ABSOLUTE_MAX_BYTES = 65000
- Absolute maximum bytes (MySQL TEXT column limit).
- DEFAULT_MAX_BYTES = 60000
- Default maximum bytes per chunk (60KB - leaves room for DB overhead).
Methods
- estimateChunkCount() : int
- Estimate how many chunks a text will be split into.
- getByteSize() : int
- Get the byte size of text.
- needsSplit() : bool
- Check if text needs to be split.
- split() : array<string|int, array{num: int, title: string, content: string}>
- Split text into chunks at paragraph boundaries.
- combineUnitsIntoChunks() : array<string|int, string>
- Combine text units into chunks under the byte limit.
- generateChunkTitle() : string
- Generate a title for a chunk based on its content.
- normalizeText() : string
- Normalize text for consistent splitting.
- splitAtParagraphs() : array<string|int, array{num: int, title: string, content: string}>
- Split text at paragraph boundaries.
- splitIntoSentences() : array<string|int, string>
- Split text into sentences.
- splitLongParagraph() : array<string|int, string>
- Split a single paragraph that exceeds the limit.
Constants
ABSOLUTE_MAX_BYTES
Absolute maximum bytes (MySQL TEXT column limit).
public
mixed
ABSOLUTE_MAX_BYTES
= 65000
DEFAULT_MAX_BYTES
Default maximum bytes per chunk (60KB - leaves room for DB overhead).
public
mixed
DEFAULT_MAX_BYTES
= 60000
Methods
estimateChunkCount()
Estimate how many chunks a text will be split into.
public
estimateChunkCount(string $text[, int $maxBytes = self::DEFAULT_MAX_BYTES ]) : int
Parameters
- $text : string
-
The text
- $maxBytes : int = self::DEFAULT_MAX_BYTES
-
Maximum bytes per chunk
Return values
int —Estimated number of chunks
getByteSize()
Get the byte size of text.
public
getByteSize(string $text) : int
Parameters
- $text : string
-
The text
Return values
int —Size in bytes
needsSplit()
Check if text needs to be split.
public
needsSplit(string $text[, int $maxBytes = self::DEFAULT_MAX_BYTES ]) : bool
Parameters
- $text : string
-
The text to check
- $maxBytes : int = self::DEFAULT_MAX_BYTES
-
Maximum bytes threshold
Return values
bool —True if text exceeds maxBytes
split()
Split text into chunks at paragraph boundaries.
public
split(string $text[, int $maxBytes = self::DEFAULT_MAX_BYTES ]) : array<string|int, array{num: int, title: string, content: string}>
Parameters
- $text : string
-
The text to split
- $maxBytes : int = self::DEFAULT_MAX_BYTES
-
Maximum bytes per chunk (default 60KB)
Return values
array<string|int, array{num: int, title: string, content: string}>combineUnitsIntoChunks()
Combine text units into chunks under the byte limit.
private
combineUnitsIntoChunks(array<string|int, string> $units, int $maxBytes[, string $separator = ' ' ]) : array<string|int, string>
Parameters
- $units : array<string|int, string>
-
Array of text units (sentences or words)
- $maxBytes : int
-
Maximum bytes per chunk
- $separator : string = ' '
-
Separator between units
Return values
array<string|int, string> —Array of chunks
generateChunkTitle()
Generate a title for a chunk based on its content.
private
generateChunkTitle(int $num, string $content) : string
Parameters
- $num : int
-
Chapter number
- $content : string
-
The chunk content
Return values
string —Generated title
normalizeText()
Normalize text for consistent splitting.
private
normalizeText(string $text) : string
Parameters
- $text : string
-
The text to normalize
Return values
string —Normalized text
splitAtParagraphs()
Split text at paragraph boundaries.
private
splitAtParagraphs(string $text, int $maxBytes) : array<string|int, array{num: int, title: string, content: string}>
Parameters
- $text : string
-
The text to split
- $maxBytes : int
-
Maximum bytes per chunk
Return values
array<string|int, array{num: int, title: string, content: string}>splitIntoSentences()
Split text into sentences.
private
splitIntoSentences(string $text) : array<string|int, string>
Parameters
- $text : string
-
The text to split
Return values
array<string|int, string> —Array of sentences
splitLongParagraph()
Split a single paragraph that exceeds the limit.
private
splitLongParagraph(string $paragraph, int $maxBytes) : array<string|int, string>
Falls back to sentence-level splitting, then word-level if needed.
Parameters
- $paragraph : string
-
The long paragraph
- $maxBytes : int
-
Maximum bytes per chunk
Return values
array<string|int, string> —Array of sub-chunks