Forum Moderators: coopster

Message Too Old, No Replies

Searching for string and ignoring accents (ajax/json)

Need to find a string in a textarea, accented or not.

         

ianevans

3:51 am on Jan 13, 2020 (gmt 0)

10+ Year Member



I'm hoping this is a) caffeine-deprivation b) a misunderstanding of PHP multi-byte functions or c) a misunderstanding of passing stuff via ajax.

Site and database are UTF-8 encoded. I have a database of people's names and urls to their site bios. Let's say one of those names is Renée Zellweger. If I go into PHPMyAdmin I can find her db entry whether I search for "Renée" or "Renee" without the accent.

I have a page where I enter news articles in a textarea. I pass the textarea to another PHP script that searches the textarea for names in my database and replaces them with a Textile-formatted link.

Unlike the PHPMyAdmin search, the code appears not to be ignoring whether there are accents or not. If my textarea contains "Renée Zellweger" it will match to the "Renée Zellweger" in the database, but "Renee Zellweger" without accents in the textarea will not match the "Renée Zellweger" in the db.

Here's the jQuery that's triggered to pass the textarea to the database:

<script type="text/javascript">
<!--
$("#link-bios").on("click", function(e) {
e.preventDefault();
var textarea = $("#markItUp");
var peopleid = "<?php echo $peopleid; ?>";
var db_request = $.ajax({
url: "/rather/linkcelebs.php",
type: "POST",
data: {
textarea: textarea.val(),
peopleid: peopleid
},
dataType: "json"
});
db_request.done(function(response) {
textarea.val(response.text);
})
});
// -->
</script>


And here's the applicable part of the code from script it calls $name and $url are results from the database. So the $name from the db is the needle and the textarea is the haystack.


while ($stmt->fetch()) {
if (mb_stristr($_POST['textarea'], $name) == TRUE AND mb_stristr($_POST['textarea'], $url) == FALSE) {
$_POST['textarea'] = str_replace($name, "\"$name\":/cr/$url/", $_POST['textarea']);
}
}


Is mb_stristr the wrong function to call if I want to ignore accents?

Many thanks for advice/insights.

ianevans

3:58 am on Jan 13, 2020 (gmt 0)

10+ Year Member



Sheesh. As I read this over, I do see one possible part of the puzzle:

$_POST['textarea'] = str_replace($name, "\"$name\":/cr/$url/", $_POST['textarea']);

If the name in the next area is "Renee Zellweger" the str_replace will never find it if the DB entry is "Renée Zellweger". So is there a way to make the str_replace accent insensitive?

phranque

5:15 am on Jan 13, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i don't see any way to do this without doing several find-and-replace methods to replace any expected letters with diacritics with the non-diacritic equivalent for both "the needle and the haystack" before doing the str-replace.

e.g., replacing:
[àáâãäå]
with:
a
etc

lucy24

10:28 pm on Jan 13, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



“Ignore accent”, like “case-insensitive”, is one of those rare concepts that is easier for a human than for a computer. Some individual programs may include a set of arrays: “e and its relatives”, “a and its relatives” and so on, but there’s no inherent connection between, say, e and é, or n and ñ.

phranque, every time I see å listed as an a-type letter it makes my skin crawl, because in the Scandinavian languages it is an entirely separate letter, 29th in the alphabet, not replaceable by “a” alone. (I think the same grapheme is used in Turkish, where it may really be an a-relative because they do all that weird business with vowel harmony.)

phranque

11:11 pm on Jan 13, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I think the only thing that's relevant here is that 99+% of all users with an English keyboard would use the 'a' to substitute for that letter.

lucy24

11:29 pm on Jan 13, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What’s relevant is that users can do this because their human brain says “Oh, that’s an ‘a’ with some doodahs on it, which in turn is identical to an ‘A’”. A computer can’t do either of those substitutions unless it has been explicitly taught that suchandsuch codepoint belongs to suchandsuch subset.

Jonesy

4:55 pm on Jan 18, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



I wonder how Soundex plays in the UTF-8 (et.al.) world?
[en.wikipedia.org ]
Even if it could be an option, it would mean a DB re-design and
logic re-work for the OP.

lammert

6:06 pm on Jan 18, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is possible without much hassle if you let the database do the comparison work rather than PHP. You already mentioned that in PHPMyAdmin an accent independent search works as you expected. This is because PHPMyAdmin sends a query directly to the MySQL which uses the collation for the table/field to determine how texts should be compared.

So instead of taking all names out of the database and comparing them in PHP with your text, you should take all words in your text and send them in a query to your database.

ianevans

6:23 pm on Jan 18, 2020 (gmt 0)

10+ Year Member



That thought had crossed my mind but I wasn't able to think of the logic to extract the names from the articles.

NickMNS

7:23 pm on Jan 18, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Step one split your text into an array of words.
Step two remove words that are not likely to be names but are almost certainly contained in the text eg: the, was, when, where, one, two, if, and etc... There are lists of these words available, these are commonly referred to as "stop words". Here is an example. [gist.github.com...]
Then compare each word or two words to the DB. If you are using two words you will need to compare (word 0 + word 1) and then (word 1 + word 2) and so on.
This makes up the basic structure, then you'll need to refine it for edge cases and other exceptions.

Note that this will work fine for small blocks of text that are being compared to short lists of names. But this could be a very long process for big lists as each word in the text needs to be compared to each term in the DB. O(n**2) if I'm not mistaken.

lammert

7:51 pm on Jan 18, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You might try if the following MySQL statement works:
SELECT * FROM `peopledatabase` WHERE 'This is your text area with Renee Zellweger without accent and a lot of other words' LIKE CONCAT('%',`name`,'%')
This statement should return all records from your people database where the field `name` matches part of the text while ignoring accents.

lucy24

8:30 pm on Jan 18, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



php-as-such may not be able to associate é with e or ã with a, but it can definitely identify character ranges.* So rather than looking at all words, another option is to look for the ones that contain non-ASCII (but still \w) characters.


* Disclaimer: I don’t know the actual command, but am working on the assumption that anything javascript can do, php can do.

ianevans

12:17 am on Jan 19, 2020 (gmt 0)

10+ Year Member



lammert,

Dang, will have to do some test coding using that when my schedule is clear.

I can think of some issues because the purpose of this is to look for instances of non-Textile linked names in the articles and link them with the names url field. So:

Renee Zellweger

turns into:

"Renée Zellweger":/url-of-bio

If I try this method, I'll have to be able to do the regex search and replace in MySQL.

Putting my thinking cap &#129506; on.

topr8

11:56 am on Jan 19, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



i'm thinking the down and dirty - and probably not the best way to do this ... but it's the easiest if you have the system otherwise in place and working except for the lack of accents.

why don't you just create another database field, which would be the actor name but just using the english alpabet characters - it would be easy enough to populate the field from the field with tha accented value and then run a few search and replaces to be left with just english alphabet characters. then however you are currently doing it, use this other field as well.

(it could be a seperate table rather than adding a field to the current table - especially as very few of the names are actually accented i imagine - so it would be a lot smaller that way.)

it occurs to me that not all actors names are two words, some are 3 or even 4...