This is in a Perl script using the DBI module, so I wasn't really sure whether to put this under Perl or MySQL.
The first part of the script checks the database to see if there's already a match; if so, $dupe is set to a true value:
##
$var = param('submitted_content');
$dupe = 0;
my $temp = $dbh->prepare("SELECT COUNT(1) FROM table WHERE col=?");
..$temp->execute($var);
..$dupe = $temp->fetchrow_array();
##
That's not the exact code, of course, but pretty close. It's been working for a few years, so please ignore any typos.
Now, though, I'm wanting to test for people trying to create a value that LOOKS like a match, but isn't; ie, exchanging a capital "i" with a lowercase "L", or a capital "o" with a number "0".
This is what I've written to follow the above, but not tested:
##
if (!$dupe && $var =~ /il10\./i) {
..# Remove opening or trailing .
..if ($var =~ /^\.|\.$/) { $t_user = $var =~ s/^\.|\.$//; }
..# Replace common hacks with _ wildcard
..$t_user = $var =~ s/[il10\.]/_/gi;
..my $temp = $dbh->prepare("SELECT COUNT(1) FROM table WHERE col REGEXP ?");
..$temp->execute($t_user);
..$dupe = $temp->fetchrow_array();
}
##
I'm not sure that this will work, though; shouldn't using the ? wildcard automatically escape the _ so that it's no longer a regex wildcard?
I'm sure that I could just send "...REGEXP $t_user", but then of course I'm opened up to all kinds of problems.
Of course, it's not really perfect, anyway; even if the code was fine, then a submission of "live10" would match "a very", which really isn't even close to a duplicate.
Can you guys suggest a better way to catch these types of intentional copies?
# Note: the .. are added above to space the code out, hopefully making it more legible.