Fun with regex, getting the entire last number

Forum Moderators: phranque

Message Too Old, No Replies

Fun with regex, getting the entire last number

csdude55

7:42 pm on Dec 23, 2022 (gmt 0)

I've been dealing with this one for 2 days and it's still not right!

Let's say that I have:

$str = 'csdude555';

I want a regex to match the opening string and the trailing number separately; eg:

$1 = 'csdude';
$2 = '555';

I had this:

/([[A-Za-z_-]+)(\d+)$/

but since it's possible to have a number within the opening string ('cs5dude55') then I had to change it to:

/([\w-]+)(\d+)$/

But that only gives me the very last numeric digit; eg, '5' instead of '555'.

Any suggestions on a modification to the regex to match the entire trailing number? My only thought was to put it in a while loop and keep grabbing the last number (then prepending it to a different string) until there aren't any trailing numbers left.

robzilla

8:04 pm on Dec 23, 2022 (gmt 0)

it's possible to have a number within the opening string ('cs5dude55')

What if there's a number within but at the very end of the string?

A while loop could work, or a single regex to only fetch the last digits and then strip that from the end of the string to get the first part.

phranque

8:42 pm on Dec 23, 2022 (gmt 0)

i would try something like this:

/([\w-]*[^\d]+)(\d+)$/

csdude55

10:02 pm on Dec 23, 2022 (gmt 0)

That seems to work great, @phranque, thanks! I had to make a slight modification, though, to match in case there's no trailing number:

/([\w-]*[^\d]+)(\d*)$/

What if there's a number within but at the very end of the string?

@robzilla, do you mean at the end of the opening string? I would want it to be within $2, not $1.

A while loop could work, or a single regex to only fetch the last digits and then strip that from the end of the string to get the first part.

I had gone the route of the while loop, like so (in Perl):

while ($str =~ /[0-9]$/) {
 ($string, $number) = $str =~ /(.+)(\d)$/;
 $append = $number . $append;
}

And while that worked, in production I'll be looping over about 500,000 iterations of $str! So that could create a pretty big series of loops that I really would like to avoid. I think phranque's suggestion will work for me, though, so that'll save me a ton of stress :-D

Thanks for the quick replies!

lucy24

10:03 pm on Dec 23, 2022 (gmt 0)

Edit: Overlapping again, darn it!

/([\w-]+)(\d+)$/

You'll need to say

/([\w-]*\D)(\d+)$/

instead. This is essentially the same thing phranque said, under the head of Individual Coding Preferences. But I'm pretty sure all RegEx engines recognize the \X = [^\x] syntax.

csdude55

6:51 am on Dec 24, 2022 (gmt 0)

To be clear, @lucy24, phranque's had [^\d]+. I understand that \D is equivalent to [^\d], but do you also suggest leaving off the +?

In preliminary tests it works just fine (eg, "cs1234dude55"), but I wanted to make sure that I'm not overlooking something. I honestly don't understand why either work, anyway, so it's tough for me to read it and understand the logic.

phranque

8:23 am on Dec 24, 2022 (gmt 0)

do you also suggest leaving off the +?

six of one, half a dozen of the other.
my suggestion means the strings ends in "one or more non-digit character" whereas lucy24's suggestion specifies that the strings ends in exactly "one non-digit character" but the preceding [\w-]* will include any alphanumerics that precede the final non-digit character.

I honestly don't understand why either work, anyway, so it's tough for me to read it and understand the logic.

the solutions suggested by lucy24 and i require that the first capture group ends in a non-digit character.
therefore the second capture group will contain any and all trailing digits in the string.

robzilla

11:21 am on Dec 24, 2022 (gmt 0)

@robzilla, do you mean at the end of the opening string?

Yes. If there can be a number within the opening string, can't there also be a number at the very end of it? That would make it impossible to separate the final number from the opening string. Example: a username formula122 where formula1 is the opening string and 22 the ID or whatever. But perhaps this doesn't apply to your situation.

lucy24

6:13 pm on Dec 24, 2022 (gmt 0)

do you also suggest leaving off the +

In my version + is replaced with * to allow for the possibility that the first part of the string is just one character, so the \D or [^\d] is all you get.

csdude55

6:31 pm on Dec 24, 2022 (gmt 0)

the solutions suggested by lucy24 and i require that the first capture group ends in a non-digit character.
therefore the second capture group will contain any and all trailing digits in the string.

Am I the only one that lays in bed and can''t sleep, cause I'm going over code in my head?

It clicked at around 6am, while I was staring into the darkness. The [\w-]* would match any letter, number, or -, so it catches those last numbers. Adding the \D in there ensures that the match ends on something that's not a number. Makes sense now :-D

If there can be a number within the opening string, can't there also be a number at the very end of it?

@robzilla, in your example of formula122, I would want:

$1 = formula
$2 = 122

I get what you're saying, but I don't think it's applicable to mine.

My end goal here is that I'm creating a unique URL-friendly identifier for every username. The usernames are unique, of course, but for this identifier I'm removing anything following an @, then converting anything that's not [\w-] to a -.

While I might have a username of "csdude55" and another of "csdude55@gmail.com", when I remove anything following the @ they would be the same. So to fix that, I'm stripping off the last number and incrementing it by 1 (or, if a number doesn't exist, I just append a 1).

The regex here applies to the issue when the string ends with, say, 99. Stripping off the last digit, adding 1, then appending it back to the string would make it 910, where I wanted 100. So it got more complicated, requiring me to strip off the entire number.

lucy24

5:38 am on Dec 25, 2022 (gmt 0)

Makes sense now

Yay!