Keep accented characters while highlighting text (wrapping in <span> tags)

Question

I am using the following code to search and highlight accented text. The problem I am facing is that it removes accented text while highlighting. Is there anyway to keep accents?

echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");

function highlightTerm($text, $keyword) {
    $text = iconv('utf-8', 'ISO-8859-1//IGNORE', Normalizer::normalize($text, Normalizer::FORM_D));
    $words = explode(" ", $keyword);
    $p = implode('|', array_map('preg_quote', $words));
    return preg_replace(
        "/($p)/ui", 
        '<span style="background:yellow;">$1</span>', 
        $text
    );
}

preg_quote() does not escape foreard slashes by default. If the values have foward slashes, then your pattern will break because of your choice of patterns delimiters. — mickmackusa
– mickmackusa ♦, Commented Nov 12, 2022 at 7:19
So, you are normalizing the text, then wondering why it is normalized? Shoild $keywords be normalized too, or are those words 100% developer controlled? — mickmackusa
– mickmackusa ♦, Commented Nov 12, 2022 at 7:21
What I do not know is how to highlight without normalizing or how to match from normalized string and highlight from original string. — user934820
– user934820, Commented Nov 12, 2022 at 7:24
Warning: iconv(): Wrong encoding, conversion from "utf-8" to "ISO-8859-1//IGNORE" is not allowed — mickmackusa
– mickmackusa ♦, Commented Nov 12, 2022 at 12:59

ThW · Accepted Answer · 2022-11-12 18:37:09Z

A simple replace will not work for this. You have to split the text into words and compare the normalized words. You should use DOM to iterate and replace the text nodes. This avoids replacing the terms inside other node types (attributes, comments, ...) and takes care of escaping.

Splitting could be done with Regular Expression, however here is a specific tool for it in the ext/intl extension called IntlBreakIterator. The extension has a Collator for string compare, too.

Here is a example for whole words:

$html = <<<'HTML'
<div>
Would you like a café, Mister Kàpêk?
</div>
HTML;

// prepare the text breaker
$breaker = IntlBreakIterator::createWordInstance('en_US');
// prepare the compare
$collator = new Collator('en_US');
$collator->setStrength(Collator::PRIMARY);

// wrap terms for easy use
$terms = new Terms(
    function($word) use ($collator) {
        return $collator->getSortKey($word);
    },
    'cafe',
    'kapek'
);

// load HTML fragment into DOM
$document = new DOMDocument();
$document->loadHTML(
    "<?xml encoding='UTF-8'?>\n$html"
);
$xpath = new DOMXpath($document); 

// iterate text nodes
foreach ($xpath->evaluate('//text()') as $textNode) {
    // feed text into word breaker
    $breaker->setText($textNode->textContent);
    // prepare a fragment for new nodes
    $fragment = $document->createDocumentFragment();
    $replace = false; 
    // iterate words
    foreach ($breaker->getPartsIterator() as $word) {
        // find word in terms
        $index = $terms->indexOf($word) + 1;
        if ($index > 0) {
            $replace = true;
            // wrap in a "span" element
            $span = $document->createElement('span');
            $span->textContent = $word;
            $span->setAttribute('class', 'term');
            $span->setAttribute('data-term-index', $index);
            $fragment->appendChild($span);
        } else {
            $fragment->appendChild($document->createTextNode($word));
        }
    }
    if ($replace) {
        // replace original text node with new fragment
        $textNode->parentNode->replaceChild($fragment, $textNode);
    }
}

// DOMDocument::loadHTML() will have wrapped the HTML to 
// create a whole document
$result = '';
foreach ($xpath->evaluate('//body/node()') as $node) {
    $result .= $document->saveHTML($node);
}
echo $result;

class Terms {

    private $_normalize;    
    private $_hashes;
    
    public function __construct(
        callable $normalize, 
        string ...$terms
    ) {
        $this->_normalize = $normalize;
        $this->_hashes = array_flip(
            array_map(
                function(string $term): string { 
                   $normalize = $this->_normalize;
                   return $normalize($term);
                },
                $terms
            )
        );
    }
    
    public function indexOf(string $word): int {
       $normalize = $this->_normalize;
       $hash = $normalize($word);
       return $this->_hashes[$hash] ?? -1;
    }
}

Output:

<div>
Would you like a <span class="term" data-term-index="1">café</span>, Mister <span class="term" data-term-index="2">Kàpêk</span>?
</div>

Extending this to partial matches is possible but it can get complex. You would have to simplify the current word (and keep track of the position) until it matches a term, then build a the output fragment.

Thank you for your time and writing this super complex code. It is giving some errors while running "Undefined property: Terms::$normalize in /home/ccp21q8hki8i/public_html/.....php on line 77" "PHP Fatal error: Uncaught Error: Function name must be a string in..." Stack trace: #0 [internal function]: Terms->{closure}('cafe')... Moreover, partial match is necessary in my case.
You should update your PHP version - it seems to be PHP 7 not 8. I refactored it to PHP >= 7. Like I wrote partial is more complex, you can start with this and extend it.

mickmackusa · Accepted Answer · 2022-11-13 02:19:46Z

Here is a not-so-pretty approach to isolate the search terms in the normalized input string, then perform multibyte-safe surgery on the original string based on the offsets of the matches and the lengths of substrings.

I replaced your pattern delimiters with a symbol that preg_quote() will escape by default.

The replacements must be done in reverse so that the offset and length calculations are not skewed.

Normally this sort of task calls for preg_replace_callback(), but because the search is on the normalized string and the replacement is on the original string, the replacement step must be separated from the matching step.

I used strtr() to bruteforce the normalization because I am not very aware of the most reliable way to normalized accented characters. Feel free to replace that subprocess.

Code: (Demo)

define(
    'ACCENT_MAP',
    [
        "ъ" => "-", "ь" => "-", "Ъ" => "-", "Ь" => "-",
        "А" => "A", "Ă" => "A", "Ǎ" => "A", "Ą" => "A", "À" => "A", "Ã" => "A", "Á" => "A", "Æ" => "A", "Â" => "A", "Å" => "A", "Ǻ" => "A", "Ā" => "A", "א" => "A",
        "Б" => "B", "ב" => "B", "Þ" => "B",
        "Ĉ" => "C", "Ć" => "C", "Ç" => "C", "Ц" => "C", "צ" => "C", "Ċ" => "C", "Č" => "C", "©" => "C", "ץ" => "C",
        "Д" => "D", "Ď" => "D", "Đ" => "D", "ד" => "D", "Ð" => "D",
        "È" => "E", "Ę" => "E", "É" => "E", "Ë" => "E", "Ê" => "E", "Е" => "E", "Ē" => "E", "Ė" => "E", "Ě" => "E", "Ĕ" => "E", "Є" => "E", "Ə" => "E", "ע" => "E",
        "Ф" => "F", "Ƒ" => "F",
        "Ğ" => "G", "Ġ" => "G", "Ģ" => "G", "Ĝ" => "G", "Г" => "G", "ג" => "G", "Ґ" => "G",
        "ח" => "H", "Ħ" => "H", "Х" => "H", "Ĥ" => "H", "ה" => "H",
        "I" => "I", "Ï" => "I", "Î" => "I", "Í" => "I", "Ì" => "I", "Į" => "I", "Ĭ" => "I", "I" => "I", "И" => "I", "Ĩ" => "I", "Ǐ" => "I", "י" => "I", "Ї" => "I", "Ī" => "I", "І" => "I",
        "Й" => "J", "Ĵ" => "J",
        "ĸ" => "K", "כ" => "K", "Ķ" => "K", "К" => "K", "ך" => "K",
        "Ł" => "L", "Ŀ" => "L", "Л" => "L", "Ļ" => "L", "Ĺ" => "L", "Ľ" => "L", "ל" => "L",
        "מ" => "M", "М" => "M", "ם" => "M",
        "Ñ" => "N", "Ń" => "N", "Н" => "N", "Ņ" => "N", "ן" => "N", "Ŋ" => "N", "נ" => "N", "ŉ" => "N", "Ň" => "N",
        "Ø" => "O", "Ó" => "O", "Ò" => "O", "Ô" => "O", "Õ" => "O", "О" => "O", "Ő" => "O", "Ŏ" => "O", "Ō" => "O", "Ǿ" => "O", "Ǒ" => "O", "Ơ" => "O",
        "פ" => "P", "ף" => "P", "П" => "P",
        "ק" => "Q",
        "Ŕ" => "R", "Ř" => "R", "Ŗ" => "R", "ר" => "R", "Р" => "R", "®" => "R",
        "Ş" => "S", "Ś" => "S", "Ș" => "S", "Š" => "S", "С" => "S", "Ŝ" => "S", "ס" => "S",
        "Т" => "T", "Ț" => "T", "ט" => "T", "Ŧ" => "T", "ת" => "T", "Ť" => "T", "Ţ" => "T",
        "Ù" => "U", "Û" => "U", "Ú" => "U", "Ū" => "U", "У" => "U", "Ũ" => "U", "Ư" => "U", "Ǔ" => "U", "Ų" => "U", "Ŭ" => "U", "Ů" => "U", "Ű" => "U", "Ǖ" => "U", "Ǜ" => "U", "Ǚ" => "U", "Ǘ" => "U",
        "В" => "V", "ו" => "V",
        "Ý" => "Y", "Ы" => "Y", "Ŷ" => "Y", "Ÿ" => "Y",
        "Ź" => "Z", "Ž" => "Z", "Ż" => "Z", "З" => "Z", "ז" => "Z",
        "а" => "a", "ă" => "a", "ǎ" => "a", "ą" => "a", "à" => "a", "ã" => "a", "á" => "a", "æ" => "a", "â" => "a", "å" => "a", "ǻ" => "a", "ā" => "a", "א" => "a",
        "б" => "b", "ב" => "b", "þ" => "b",
        "ĉ" => "c", "ć" => "c", "ç" => "c", "ц" => "c", "צ" => "c", "ċ" => "c", "č" => "c", "©" => "c", "ץ" => "c",
        "Ч" => "ch", "ч" => "ch",
        "д" => "d", "ď" => "d", "đ" => "d", "ד" => "d", "ð" => "d",
        "è" => "e", "ę" => "e", "é" => "e", "ë" => "e", "ê" => "e", "е" => "e", "ē" => "e", "ė" => "e", "ě" => "e", "ĕ" => "e", "є" => "e", "ə" => "e", "ע" => "e",
        "ф" => "f", "ƒ" => "f",
        "ğ" => "g", "ġ" => "g", "ģ" => "g", "ĝ" => "g", "г" => "g", "ג" => "g", "ґ" => "g",
        "ח" => "h", "ħ" => "h", "х" => "h", "ĥ" => "h", "ה" => "h",
        "i" => "i", "ï" => "i", "î" => "i", "í" => "i", "ì" => "i", "į" => "i", "ĭ" => "i", "ı" => "i", "и" => "i", "ĩ" => "i", "ǐ" => "i", "י" => "i", "ї" => "i", "ī" => "i", "і" => "i",
        "й" => "j", "Й" => "j", "Ĵ" => "j", "ĵ" => "j",
        "ĸ" => "k", "כ" => "k", "ķ" => "k", "к" => "k", "ך" => "k",
        "ł" => "l", "ŀ" => "l", "л" => "l", "ļ" => "l", "ĺ" => "l", "ľ" => "l", "ל" => "l",
        "מ" => "m", "м" => "m", "ם" => "m",
        "ñ" => "n", "ń" => "n", "н" => "n", "ņ" => "n", "ן" => "n", "ŋ" => "n", "נ" => "n", "ŉ" => "n", "ň" => "n",
        "ø" => "o", "ó" => "o", "ò" => "o", "ô" => "o", "õ" => "o", "о" => "o", "ő" => "o", "ŏ" => "o", "ō" => "o", "ǿ" => "o", "ǒ" => "o", "ơ" => "o",
        "פ" => "p", "ף" => "p", "п" => "p",
        "ק" => "q",
        "ŕ" => "r", "ř" => "r", "ŗ" => "r", "ר" => "r", "р" => "r", "®" => "r",
        "ş" => "s", "ś" => "s", "ș" => "s", "š" => "s", "с" => "s", "ŝ" => "s", "ס" => "s",
        "т" => "t", "ț" => "t", "ט" => "t", "ŧ" => "t", "ת" => "t", "ť" => "t", "ţ" => "t",
        "ù" => "u", "û" => "u", "ú" => "u", "ū" => "u", "у" => "u", "ũ" => "u", "ư" => "u", "ǔ" => "u", "ų" => "u", "ŭ" => "u", "ů" => "u", "ű" => "u", "ǖ" => "u", "ǜ" => "u", "ǚ" => "u", "ǘ" => "u",
        "в" => "v", "ו" => "v",
        "ý" => "y", "ы" => "y", "ŷ" => "y", "ÿ" => "y",
        "ź" => "z", "ž" => "z", "ż" => "z", "з" => "z", "ז" => "z", "ſ" => "z",
        "™" => "tm",
        "@" => "at",
        "Ä" => "ae", "Ǽ" => "ae", "ä" => "ae", "æ" => "ae", "ǽ" => "ae",
        "ĳ" => "ij", "Ĳ" => "ij",
        "я" => "ja", "Я" => "ja",
        "Э" => "je", "э" => "je",
        "ё" => "jo", "Ё" => "jo",
        "ю" => "ju", "Ю" => "ju",
        "œ" => "oe", "Œ" => "oe", "ö" => "oe", "Ö" => "oe",
        "щ" => "sch", "Щ" => "sch",
        "ш" => "sh", "Ш" => "sh",
        "ß" => "ss",
        "Ü" => "ue",
        "Ж" => "zh", "ж" => "zh",
    ]);

With:

function highlightTerm($text, $keyword) {
    $mbLength = mb_strlen($text);
    $unaccented = strtr($text, ACCENT_MAP);
    $words = explode(" ", $keyword);
    $regex = implode('|', array_map('preg_quote', $words));
    if (preg_match_all("#$regex#ui", $unaccented, $m, PREG_OFFSET_CAPTURE)) {
        foreach (array_reverse($m[0]) as [$match, $offset]) {

            // normalized length
            $length = strlen($match);

            // new multibyte-safe substring
            $tag = '<span style="background:yellow;">'
                . mb_substr($text, $offset, $length)
                . '</span>';

            // actual multibyte-safe replacement on original text
            $text = mb_substr($text, 0, $offset)
                . $tag
                . mb_substr($text, $offset + $length);
        }
    }
    return $text;
}

echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");

Output:

Would you like a <span style="background:yellow;">caf</span>é, Mister <span style="background:yellow;">Kàpê</span>k?

mickmackusa · Accepted Answer · 2022-11-13 02:17:13Z

Instead of normalizing the text, you can use the tedious approach of creating a dynamic, accent-agnostic regex pattern and then directly perform replacements on the input string.

The regex map (based on the second code block of this answer):

define(
    'ACCENT_MAP',
    [
        "A" => "[AАĂǍĄÀÃÁÆÂÅǺĀא]",
        "B" => "[BБבÞ]",
        "C" => "[CĈĆÇЦצĊČץ]",
        "D" => "[DДĎĐדÐ]",
        "E" => "[EÈĘÉËÊЕĒĖĚĔЄƏע]",
        "F" => "[FФƑ]",
        "G" => "[GĞĠĢĜГגҐ]",
        "H" => "[HחĦХĤה]",
        "I" => "[IIÏÎÍÌĮĬIИĨǏיЇĪІ]",
        "J" => "[JЙĴ]",
        "K" => "[KĸכĶКך]",
        "L" => "[LŁĿЛĻĹĽל]",
        "M" => "[MמМם]",
        "N" => "[NÑŃНŅןŊנŉŇ]",
        "O" => "[OØÓÒÔÕОŐŎŌǾǑƠ]",
        "P" => "[PפףП]",
        "Q" => "[Qק]",
        "R" => "[RŔŘŖרР]",
        "S" => "[SŞŚȘŠСŜס]",
        "T" => "[TТȚטŦתŤŢ]",
        "U" => "[UÙÛÚŪУŨƯǓŲŬŮŰǕǛǙǗ]",
        "V" => "[VВו]",
        "Y" => "[YÝЫŶŸ]",
        "Z" => "(?:Z|ŹŽŻЗז",
        "a" => "[aаăǎąàãáæâåǻāא]",
        "b" => "[bбבþ]",
        "c" => "[cĉćçцצċčץ]",
        "ch" => "(?:ch|ч)",
        "d" => "[dдďđדð]",
        "e" => "[eèęéëêеēėěĕєəע]",
        "f" => "[fфƒ]",
        "g" => "[gğġģĝгגґ]",
        "h" => "[hחħхĥה]",
        "i" => "[iiïîíìįĭıиĩǐיїīі]",
        "j" => "[jйĵ]",
        "k" => "[kĸכķкך]",
        "l" => "[lłŀлļĺľל]",
        "m" => "[mמмם]",
        "n" => "[nñńнņןŋנŉň]",
        "o" => "[oøóòôõоőŏōǿǒơ]",
        "p" => "[pפףп]",
        "q" => "[qק]",
        "r" => "[rŕřŗרр]",
        "s" => "[sşśșšсŝס]",
        "t" => "[tтțטŧתťţ]",
        "u" => "[uùûúūуũưǔųŭůűǖǜǚǘ]",
        "v" => "[vвו]",
        "y" => "[yýыŷÿ]",
        "z" => "[zźžżзזſ]",
        "ae" => "(?:ae|[ÄǼäæǽ])",
        "ch" => "(?:ch|[Чч])",
        "ij" => "(?:ij|[ĳĲ])",
        "ja" => "(?:ja|[яЯ])",
        "je" => "(?:je|[Ээ])",
        "jo" => "(?:jo|[ёЁ])",
        "ju" => "(?:ju|[юЮ])",
        "oe" => "(?:oe|[œŒöÖ])",
        "sch" => "(?:sch|[щЩ])",
        "sh" => "(?:sh|[шШ])",
        "ss" => "(?:ss|[ß])",
        "ue" => "(?:ue|[Ü)",
        "zh" => "(?:zh|[Жж])"
    ]);

Code: (Demo)

function highlightTerm($text, $keyword) {
    $regex = implode(
        '|',
        array_map(
            fn($w) => strtr(preg_quote($w), ACCENT_MAP),
            explode(" ", $keyword)
        )
    );
    return preg_replace(
               "#$regex#ui",
               '<span style="background:yellow;">$0</span>',
               $text
           );
}

echo highlightTerm("Would you like a café, Mister Kàpêk?", "kape caf");

Output:

Would you like a <span style="background:yellow;">caf</span>é, Mister <span style="background:yellow;">Kàpê</span>k?

Thank you very much again and again for your time. I works as expected. Just one more request. Can we add Arabic/Persian support in it as well. Arabic has diphthongs such as "َ", "ِ" ، "ُ" ... Usually search is done without using diphthongs but our text could have diphthongs. In this way, the search term "لحم" will not return "لَحَم" Ignoring diphthongs will solve this issue.
I am afraid we are going out of my realm of understanding. You should have included this ramge of characters in your original sample data (a sufficiently complex minimal reproducible example is critical to a clear, complete question). I don't know how these characters should be translated, so I cannot adjust my answers.

Collectives™ on Stack Overflow

Keep accented characters while highlighting text (wrapping in <span> tags)

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related