Forum Moderators: coopster & phranque

Message Too Old, No Replies

Using a variable for the regex pattern

         

csdude55

1:27 am on May 21, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have a regex that I use fairly often, to see if the text submitted is a valid email:

[\w\-\.\+]+\@[a-zA-Z0-9\.\-]+\.[a-zA-z0-9]{2,4}


I'm sure that it's not exactly perfect (I think I picked it up somewhere in the late 90s), but it's good enough for my purposes.

Since I use it a lot, I'm moving it to a string variable:

$emailPattern = "[\w\-\.\+]+\@[a-zA-Z0-9\.\-]+\.[a-zA-z0-9]{2,4}";

if ($text =~ /$emailPattern/) { whatever; }


The question is... is the "proper" method to double-escape everything since I'm using it in a string? Like this?

$emailPattern = "[\\w\\-\\.\\+]+\\@[a-zA-Z0-9\\.\\-]+\.[a-zA-z0-9]{2,4}";


This works, I'm just not sure if it's the "approved" method. But I tried qq~ ~; and quotemeta() but neither worked. Double escaping looks ugly and is hard to read, but I don't know of a better option.

lucy24

2:51 am on May 21, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Psst! You don't need to escape . and + inside grouping brackets. (Hyphens depend on where in the group it is, and then on the exact RegEx engine, so play it safe and escape them regardless.) And I shouldn’t think you need to escape @ anywhere, unless this is in some weird dialect where @ has meaning.

\.[a-zA-z0-9]{2,4}
If you mean “no more than 4 characters”, as in “.com”, you should put a \b at the end. Otherwise there could be more of them after the pattern is done.

Or a \\b as the case may be.

$emailPattern = "[\\w\\-\\.\\+]+\\@[a-zA-Z0-9\\.\\-]+\.[a-zA-z0-9]{2,4}"
Don't you need to double-escape the final, and most crucial, \. then too?

How come it’s \w in one place and [A-Za-z0-9] in the other? Do you especially need to exclude lowlines and/or non-ASCII alphabetics?

csdude55

2:19 am on May 22, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Psst! You don't need to escape . and + inside grouping brackets. (Hyphens depend on where in the group it is, and then on the exact RegEx engine, so play it safe and escape them regardless.)

Great tip, thanks! I think I knew that at some point, but haven't really worked in Perl much for the last few years. And I'm getting old. And I've developed a taste for dark beer :-D

In retrospect, if you put the - at the beginning or end then you don't need to escape it. So this is good:

[\w+.-]

But this is bad, because it would be seen as a range:

[\w-+.]

And I shouldn’t think you need to escape @ anywhere, unless this is in some weird dialect where @ has meaning.

Since it was in double quotes, I thought that Perl would think the @ was designating an array. But not escaping it seemed to work fine, so I guess not :-)

If you mean “no more than 4 characters”, as in “.com”, you should put a \b at the end. Otherwise there could be more of them after the pattern is done.

Or a \\b as the case may be.

Great tip again, thanks!

Don't you need to double-escape the final, and most crucial, \. then too?

Yup, good catch! It didn't throw an error, of course, so I missed that one...

How come it’s \w in one place and [A-Za-z0-9] in the other? Do you especially need to exclude lowlines and/or non-ASCII alphabetics?

Well, that's a good question. Are _ valid now in domain names? I'm getting mixed information on that when I Google it. Like this:

[ubuntu101.co.za...]

And I can't think of any domain extension that allows a _... but I can't think of any that allows a number, either, so maybe that should just be [a-zA-z]{2,4}.

But then again, now that you can get a variety of weird extensions, I guess that's not even really right anymore. Someone could realistically have a .international now, right? I think it would be very rare for any of my users to have something other than a .com or .net, but not impossible. So I should probably drop the {2,4} altogether.

Maybe this would be better:

$emailPattern = "[\\w.+-]+@[a-zA-Z0-9.-]+\\.[a-zA-z](\.[a-zA-z])?\\b";


That's a-z, 0-9, _, ., +, or - (repeated)... followed by @... followed by a-z, 0-9, ., or - (repeated) for the domain name, followed by .whatever, and optionally followed by a second .whatever (to cover something like .co.za).

Edit: weird thing, though... with the last (\.[a-zA-z])?, if I double-escape the . then it's not matching csdude@example.com, but if I single-escape then it matches csdude@example.com and csdude@example.co.za. What's up with that?

lucy24

3:16 am on May 22, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



\\.[a-zA-z](\.[a-zA-z])?
I think you need a plus sign in there, and again a double-escape:
\\.[a-zA-z]+(\\.[a-zA-z])?
Unless I've really misunderstood the type of pattern you're trying to match.

A non-escaped . --or a single-escaped one where a double escape is required-- means “any one character". That's why it has to be escaped. (You knew that. You just forgot for a moment.)

csdude55

5:25 am on May 22, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sheesh... I gotta start sleeping more. Yeah, the + is what messed me up that time... which makes sense. Without the + and not double-escaping the second ., I was matching the required 3 characters. Duh.

So the final is:

$emailPattern = "[\\w.+-]+@[a-zA-Z0-9.-]+\\.[a-zA-z]+(\\.[a-zA-z]+)?\\b";

lucy24

6:16 am on May 22, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wait, wait, come back!
@[a-zA-Z0-9.-]+\\.
With that literal . in the group that comes immediately after the @, then the final, optional bit
(\\.[a-zA-z]+)?
becomes superfluous, because the only requirement is that there be at least one literal period somewhere in the pattern; you’ve now set it up to allow for more than one anyway. It’s possible you meant something like
[\\w.+-]+@[a-zA-Z0-9-]+(\\.[a-zA-z]+)+
or, more fancily,
[\\w.+-]+@[a-zA-Z0-9-]{2,}(\\.[a-zA-z]{2,}){1,2}$

Now that you've dropped the {2,4} constraint, the \b is no longer needed. You might instead want a $ meaning "the end of the entire string that we’re evaluating”.

I don't remember if PCRE-as-such allows the open-ended {2,} form or if you have to pop in a second number like {2,20}. It's also a question of whether you will need to be very persnickety to weed out not-quite-right fakers, or if you only need to weed out entirely spurious strings created by someone's cat walking across the keyboard while they were in the middle of filling out the Contact form.

csdude55

6:35 pm on May 22, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmm, that's a very good point, @lucy24.

Which way would you do it?

# with the . in the first group
[\\w.+-]+@[a-zA-Z0-9.-]+\\.[a-zA-z]+$

# minus the . in the first group, with the optional 3rd group
[\\w.+-]+@[a-zA-Z0-9-]+\\.[a-zA-z]+(\\.[a-zA-z]+)?$

# the two that you posted
[\\w.+-]+@[a-zA-Z0-9-]+(\\.[a-zA-z]+)+$

[\\w.+-]+@[a-zA-Z0-9-]{2,}(\\.[a-zA-z]{2,}){1,2}$


The first one that you posted seems like the most versatile in the long run... we used to see the occasional 1-character domain (overstock.com used to promote o.co), and while I don't expect any of them to be signing up on my site, it's not impossible that it would become more common in the future. But I can't imagine any 1-letter extensions, so maybe just:

[\\w.+-]+@[a-zA-Z0-9-]+(\\.[a-zA-z]{2,}){1,2}$


(I'm 99% sure that using the open ended {2,} is fine)

I don't need to be too persnickety (haha)... like you said, I just need to protect against evil cats, or sometimes the people that don't know their own email address (I get a ton of "example@yahoo" type entries).

lucy24

8:11 pm on May 22, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



get a ton of "example@yahoo" type entries
If you really do get a lot, you could always expand to

^(?!example)[\\w.+-]+@[a-zA-Z0-9-]+(\\.[a-zA-z]{2,}){1,2}$
or even--if you find yourself vexed with stupid robots claiming to be from example.tld--
^(?!example)[\\w.+-]+@(?!example)[a-zA-Z0-9-]+(\\.[a-zA-z]{2,}){1,2}$
The exact pattern is pretty much six of one, half a dozen of the other, whatever exact form looks best to you. You probably should include both ^ and $ anchors, since the point is to evaluate the whole string, not just pick out the part of it that matches. I expect you've already got something that strips away leading and trailing spaces.

fishmonger

4:25 pm on Jun 22, 2019 (gmt 0)

5+ Year Member



I realize that I'm coming in late and you already have a solution, but here's some food for thought.

The @ symbol should be escaped when used in a double quoted string. And, you should probably use the qr operator.

Even the most "complete" regex given so far is very rudimentary in its ability to validate RFC822 addresses and will fail some properly formatted addresses. The better approach would be to use the well established and proven Email::Valid module. [metacpan.org ]

Here's the regex it uses to validate RFC822 addresses.

# Regular expression built using Jeffrey Friedl's example in
# _Mastering Regular Expressions_ (http://www.ora.com/catalog/regexp/).

$RFC822PAT = <<'EOF';
[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\
xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xf
f\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\x
ff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015
"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\
xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80
-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*
)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\
\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\
x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n
\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([
^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\
\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\
x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-
\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()
]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\
x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\04
0\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\
n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\
015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?!
[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\
]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\
x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\01
5()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]
)|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^
()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\0
15()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][
^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\
n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\
x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?
:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-
\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\t]*
(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015
()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()
]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\0
40)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\
[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\
xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*
)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80
-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x
80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t
]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\
\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])
*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x
80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80
-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*(?:\([^\\\x80-\xff\n\015(
)]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\
\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*@[\040\t
]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0
15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015
()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(
\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|
\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80
-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()
]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff
])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\
\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x
80-\xff\n\015()]*)*\)[\040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015
()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\
\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^
(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-
\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\
n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|
\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))
[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff
\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\x
ff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(
?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\
000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\
xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\x
ff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)
*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\x
ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-
\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)
*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\
]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]
)[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-
\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\x
ff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(
?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80
-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<
>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:
\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]
*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)
*\)[\040\t]*)*)*>)
EOF

$RFC822PAT =~ s/\n//g;

tangor

8:32 pm on Jun 22, 2019 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@fishmonger ...

Welcome to Webmasterworld!

Nice tips. :)