Forum Moderators: phranque

Message Too Old, No Replies

More fun with regex, matching multiple ( ) from IF condition

         

csdude55

6:00 pm on Jan 27, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Let's say that I have this:

if ($foo =~ /(\w+)/ && $bar =~ /(\w+)/) { ... }

I originally thought that $1 would match the "foo" result and $2 would match the "bar" result, but nope! The "bar" result is $1 and overwrites the "foo" result.

Is there a way to get both of the matches without doing them as separate conditions?

I'm doing this in Apache, but I posted it here because I'm hoping for a solution that's not language specific :-)

lucy24

7:21 pm on Jan 27, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This may actually be language-specific. It looks as if each element is treated separately, so if you have
if ($foo =~ /(jiggery)(pokery)/ && $bar =~ /(hoity)(toity)/)
then $1 would be "jiggery", later replaced by "hoity", while $2 is "pokery" later replaced by "toity".

Think of a RewriteCond, where only captures from the most recently matched Condition can be used in the target:
RewriteCond {foo} (jiggery)(pokery) [OR]
RewriteCond {bar} (hoity)(toity)
RewriteRule blahblah /%1%2 [L]
{simplified for posting purposes) If the first Condition matches but the second doesn't, then the target is "jiggerypokery"; it the second Condition matches--whether or not the first one did--then it’s "hoitytoity".

phranque

11:25 pm on Jan 27, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



this is similar to the case when using a repetition operator [* or +] on a capture group in a regular expression.
a repeated capture group is overwritten each time it matches, so only the last match is captured.

csdude55

12:32 am on Jan 28, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The problem with RewriteCond, though, is that it runs out of order and won't match against ENVs set by SetEnvIf :-/

I guess one option would be to use named groups; eg,

if ($foo =~ /(?P<one>\w+)/ && $bar =~ /(?P<two>\w+)/) { ... }

I'm not sure if you can do that in Apache, though, or how to refer back to the named group :-/

If that's not an option then I guess my only option will be to do a series of conditions:

if ($foo =~ /(\w+)/) {
$one = $1;
}

if ($one && $bar =~ /(\w+)/) {
$two = $2;
}

if ($one && $two && $this =~ /(\w+)/) {
$three = $3;
}

// and so on

I hate doing that because, when I have several conditions like that, it has to run through ALL of them even after one of them failed. And, of course, creating extra variables for a one-time use just takes up resources unnecessarily. I guess that in a programming language then I could use an array instead of unique variables, but I don't think that's an option in Apache.

Blerg.

lucy24

2:41 am on Jan 28, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oops, sorry, I didn't mean to suggest that you should use mod_rewrite. I just brought it up as analogy to what happens in the OP with a later thing overwriting an older one. And phranque's point explains why repeating captures have to be expressed as ((blahblah)+) or ((?:blahblah)+) so you get the whole thing. (I have been bitten by this many a time.) But that, too, is an analogy.

Do you really mean (\w+) or are you just simplifying for posting purposes? Otherwise it seems like you could just as well say
if ($foo && $bar && $this && $that)
and so on.

But, again, it probably will help to lay out the real-life issues behind the programming questions.

csdude55

6:54 am on Jan 28, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you really mean (\w+) or are you just simplifying for posting purposes? Otherwise it seems like you could just as well say
if ($foo && $bar && $this && $that)
and so on.

I'm technically using [A-Za-z] because any preferred match will be letters and I want it to end when it gets to any /. But since I'm trying to save what it matches, I have to have something there.

But, again, it probably will help to lay out the real-life issues behind the programming questions.

In this particular case, remember in a previous thread I'd figured out a hacky way to get environment variable values in the SetEnvIf by using:

SetEnvIfExpr "%{ENV:foo} =~ /(.+)/" this=$1


But then I ran in to an issue where I needed to compare a few things and use multiple results; like so:

# breaks added for readability
SetEnvIfExpr "-z %{ENV:default} && %{ENV:linker} =~ /^example$/ && %{REQUEST_URI} =~ m#^/(?:foo:bar)([a-z]+)#i"
default=$1
home=%{ENV:home}/$1
'siteName=%{ENV:siteName} - %{ENV:ext} - %{ENV:home} - %{ENV:default}'

I now know that this won't work since I can't use the ENV variables in the values. But my previous hack doesn't work if I can only refer to the first result.

And in retrospect, my lengthier option of multiple conditions won't work, either, if I can't refer back to an ENV. I can't think of how I would append "home" with the value of "default" if I can't say home=$1/%{ENV:default}.

I thought maybe I could get $1 and $2 by concatenating the ENVs and then delimiting them in the expression, like so:

SetEnvIfExpr "%{ENV:home}|:|%{ENV:linker} =~ /([^|]+)\|:\|(.+)/" home=$1/$2

But that throws an error. I tried everything I could think of to join the variables, but nothing worked:

%{ENV:home}|:|%{ENV:linker}
%{ENV:home} . |:| . %{ENV:linker}
%{ENV:home} + |:| + %{ENV:linker}
(%{ENV:home} . |:| . %{ENV:linker})
(%{ENV:home} + |:| + %{ENV:linker})

Blerg :-/

csdude55

6:56 am on Jan 28, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oops, typo in my 8:32 post; it should have said:

if ($foo =~ /(\w+)/) {
$one = $1;
}

if ($one && $bar =~ /(\w+)/) {
$two = $1;
}

if ($one && $two && $this =~ /(\w+)/) {
$three = $1;
}

// and so on

Sorry if that confused you!

lucy24

4:54 pm on Jan 28, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When I said “lay it out in English” I really did mean in English, before you get to Regular Expressions and SetEnvIf directives and configuration files: “If the user requests suchandsuch, and the existing file is thisandthat and the surrounding circumstances are thusandso, then take the following action”.

csdude55

6:23 pm on Jan 28, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmm. OK, lemme give it a shot. It gets pretty complicated, so I'll try to keep simplify it the best that I can.

I set a bunch of variables in Apache. I used to set them in PHP, but I moved them here because it processed each page a LOT faster.

So first I set a bunch of default variables, just standards to be used unless a condition changes them later. I'll use "home" as an example because it's simple; it's literally "https://" + the lowercase version of HTTP_HOST. So home = "https://www.foo.com".

Next, I try to set "default". If QUERY_STRING contacts a "default" param then I set "default" to that value; if not, I look for a specific keyword in the domain name. If there's no param or the keyword doesn't exist in the domain then "default" is not set. So if they go to foo.com (a pre-approved domain), default = "foo"; if they go to foo.com?default=bar, default = "bar".

Then, if the the URL matches a list of specific domains with no keyword (eg, ex.net or example.com), "default" has not been set (so no param), AND the first subdirectory doesn't match a predefined list, then I set "default" to that first subdirectory and change "home" to "$home/$default". So if they go to example.com, "default" is not set; if they go to example.com/bar, default = "bar" and home is changed to "https://www.example.com/bar".

That's where I'm hitting a stumbling block; appending to "home" when I can't use the value of "home".

I can't manually write it since there are 2 potential domains, unless I write 2 identical rules (one for each domain).

But you see, this is my simple example. I have a more complicated variable where I literally build the logo based on this criteria, too. That one uses 4 ENV variables whose values I need to access, and I don't think it's physically possible to write a unique rule for each of them.

Better, or still too much?

lucy24

7:43 pm on Jan 28, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can't manually write it since there are 2 potential domains, unless I write 2 identical rules (one for each domain).
Sometimes writing the same rule twice may be less trouble than putting together a single hideously complicated rule to do it all at once. Similarly, it is often simpler to first do something globally, and then un-do it in selected situations, rather than put together a “do this thing” rule that does the picking-and-choosing in advance. Obvious example: set a bunch of environmental variables, and then un-set them (“poke holes”) as needed.

Even though config has an advantage over htaccess in that all those Regular Expressions only have to be compiled once (point that phranque made in one of these threads), it isn't an infinite advantage: the rules still have to be executed.

I have a more complicated variable where I literally build the logo based on this criteria, too.
OK, see, that’s what I was getting at. What is the ultimate purpose of all these environmental variables?

In all honesty, I myself would long since have thrown in the towel and done the whole thing in php-or-similar ;)

csdude55

9:07 pm on Jan 28, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What is the ultimate purpose of all these environmental variables?

That's a little harder to explain, and I don't really want to say it all publicly because I've said a LOT on here, and posting my identity with all of that could come back to haunt me :-O But in short, I have a little over 100 domains parked on top of the main domain, and then I show content based on "default". There are 20-25 variables for each potential "default".

In all honesty, I myself would long since have thrown in the towel and done the whole thing in php-or-similar ;)

That's how it was originally built (with all of the variables in MySQL, which was then read by PHP or Perl), but then I discovered that moving them to Apache was causing a much smaller load on my server and each page was loading faster. Like, a LOT faster! And in my case, load time has a direct impact on pages per session, which has a direct impact on my Adsense revenue. So I began moving as many variables to Apache as I could, using mod_rewrite.

I thought it was all good until a couple of weeks ago, when I discovered a bug in the system. In trying to fix that I kept hitting stumbling blocks that led me to using IF-ELSE, but then THAT hit a stumbling block that led me to use SetEnvIf. And now the order of operations is messing me up; I realize that if RewriteRule would process before SetEnvIf then everything would be fine! But nooooooo...

I realize that I've probably gone a LOT farther than Apache intended, but I'm too deep to turn back now! LOL

phranque

10:01 pm on Jan 28, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i agree with lucy24 that you should go back to using perl or PHP as a solution.
you're trying to carve something with a hammer.
there is certainly a solution for making your script more responsive.

csdude55

5:24 am on Feb 1, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



FYI, I had a solution on this one given to me on another site. I was pretty close with this:

SetEnvIfExpr "%{ENV:home}|:|%{ENV:linker} =~ /([^|]+)\|:\|(.+)/" home=$1/$2

but the way that works is:

SetEnvIfExpr "env('home') . '|:|' . env('linker') =~ /([^|]+)\|:\|(.+)/" home=$1/$2

Or more accurately, using negative lookahead to look for the multi-character delimiter and making "linker" optional:

SetEnvIfExpr "env('home') . '|:|' . env('linker') =~ /((?!\|:\|).+?)\|:\|(.*)/" home=$1/$2

In theory I could add as many as I wanted, delimited by |:| (or whatever else you want, as long as it would never, ever be in the value of the string).

Should I do it this way? I dunno, that seems to be a matter of contention 'round these here parts. But now I know that I can do it, anyway :-)