SimilarityCalculator
in package
Service class for calculating term similarity.
Contains algorithms for phonetic normalization and similarity ranking using the Sørensen–Dice coefficient.
Tags
Table of Contents
Constants
- STATUS_WEIGHT_IGNORED = 0.5
- Weight multiplier for ignored words (status 98).
- STATUS_WEIGHT_IN_PROGRESS = 1.15
- Weight multiplier for words in progress (status 2-4).
- STATUS_WEIGHT_LEARNED = 1.3
- Weight multiplier for learned words (status 5).
- STATUS_WEIGHT_NEW = 1.0
- Weight multiplier for new words (status 1).
- STATUS_WEIGHT_WELL_KNOWN = 1.25
- Weight multiplier for well-known words (status 99).
Properties
- $phoneticMap : array<string, string>
- Phonetic character mapping for normalization.
Methods
- getCombinedSimilarityRanking() : float
- Combined similarity ranking using character pairs and phonetic matching.
- getSimilarityRanking() : float
- Similarity ranking of two UTF-8 strings using Sørensen–Dice coefficient.
- getStatusWeight() : float
- Get weight multiplier based on word status.
- letterPairs() : array<string|int, string>
- Get letter pairs from string.
- phoneticNormalize() : string
- Normalize a string for phonetic comparison.
- wordLetterPairs() : array<string|int, string>
- Get word letter pairs from string.
Constants
STATUS_WEIGHT_IGNORED
Weight multiplier for ignored words (status 98).
public
mixed
STATUS_WEIGHT_IGNORED
= 0.5
STATUS_WEIGHT_IN_PROGRESS
Weight multiplier for words in progress (status 2-4).
public
mixed
STATUS_WEIGHT_IN_PROGRESS
= 1.15
STATUS_WEIGHT_LEARNED
Weight multiplier for learned words (status 5).
public
mixed
STATUS_WEIGHT_LEARNED
= 1.3
STATUS_WEIGHT_NEW
Weight multiplier for new words (status 1).
public
mixed
STATUS_WEIGHT_NEW
= 1.0
STATUS_WEIGHT_WELL_KNOWN
Weight multiplier for well-known words (status 99).
public
mixed
STATUS_WEIGHT_WELL_KNOWN
= 1.25
Properties
$phoneticMap
Phonetic character mapping for normalization.
private
static array<string, string>
$phoneticMap
= [
// Vowel groups
'a' => 'a',
'à' => 'a',
'á' => 'a',
'â' => 'a',
'ã' => 'a',
'ä' => 'a',
'å' => 'a',
'ā' => 'a',
'ă' => 'a',
'ą' => 'a',
'æ' => 'ae',
'e' => 'e',
'è' => 'e',
'é' => 'e',
'ê' => 'e',
'ë' => 'e',
'ē' => 'e',
'ĕ' => 'e',
'ė' => 'e',
'ę' => 'e',
'ě' => 'e',
'i' => 'i',
'ì' => 'i',
'í' => 'i',
'î' => 'i',
'ï' => 'i',
'ĩ' => 'i',
'ī' => 'i',
'ĭ' => 'i',
'į' => 'i',
'ı' => 'i',
'y' => 'i',
'o' => 'o',
'ò' => 'o',
'ó' => 'o',
'ô' => 'o',
'õ' => 'o',
'ö' => 'o',
'ō' => 'o',
'ŏ' => 'o',
'ő' => 'o',
'ø' => 'o',
'œ' => 'oe',
'u' => 'u',
'ù' => 'u',
'ú' => 'u',
'û' => 'u',
'ü' => 'u',
'ũ' => 'u',
'ū' => 'u',
'ŭ' => 'u',
'ů' => 'u',
'ű' => 'u',
'ų' => 'u',
// Consonant groups - similar sounds
'b' => 'b',
'p' => 'p',
'c' => 'k',
'k' => 'k',
'q' => 'k',
'ç' => 's',
'ć' => 'c',
'č' => 'c',
'd' => 'd',
't' => 't',
'ð' => 'd',
'þ' => 't',
'f' => 'f',
'v' => 'v',
'ph' => 'f',
'g' => 'g',
'ğ' => 'g',
'ģ' => 'g',
'j' => 'j',
'h' => 'h',
'l' => 'l',
'ł' => 'l',
'ľ' => 'l',
'ĺ' => 'l',
'ļ' => 'l',
'm' => 'm',
'n' => 'n',
'ñ' => 'n',
'ń' => 'n',
'ň' => 'n',
'ņ' => 'n',
'r' => 'r',
'ŕ' => 'r',
'ř' => 'r',
'ŗ' => 'r',
's' => 's',
'z' => 's',
'ś' => 's',
'š' => 's',
'ş' => 's',
'ź' => 's',
'ż' => 's',
'ž' => 's',
'ß' => 'ss',
'w' => 'w',
'x' => 'ks',
]
Maps similar-sounding characters to a common representation.
Methods
getCombinedSimilarityRanking()
Combined similarity ranking using character pairs and phonetic matching.
public
getCombinedSimilarityRanking(string $str1, string $str2[, float $phoneticWeight = 0.3 ]) : float
Parameters
- $str1 : string
-
First string (lowercase)
- $str2 : string
-
Second string (lowercase)
- $phoneticWeight : float = 0.3
-
Weight for phonetic similarity (0-1)
Return values
float —Combined similarity ranking (0-1)
getSimilarityRanking()
Similarity ranking of two UTF-8 strings using Sørensen–Dice coefficient.
public
getSimilarityRanking(string $str1, string $str2) : float
Source http://www.catalysoft.com/articles/StrikeAMatch.html Source http://stackoverflow.com/questions/653157
Parameters
- $str1 : string
-
First string
- $str2 : string
-
Second string
Return values
float —Similarity ranking (0-1)
getStatusWeight()
Get weight multiplier based on word status.
public
getStatusWeight(int $status) : float
Parameters
- $status : int
-
Word status (1-5, 98=ignored, 99=well-known)
Return values
float —Weight multiplier
letterPairs()
Get letter pairs from string.
public
letterPairs(string $str) : array<string|int, string>
Parameters
- $str : string
-
Input string
Return values
array<string|int, string>phoneticNormalize()
Normalize a string for phonetic comparison.
public
phoneticNormalize(string $str) : string
Applies phonetic transformations to make similar-sounding words more likely to match.
Parameters
- $str : string
-
Input string (should be lowercase)
Return values
string —Phonetically normalized string
wordLetterPairs()
Get word letter pairs from string.
public
wordLetterPairs(string $str) : array<string|int, string>
Parameters
- $str : string
-
Input string