Forum Moderators: phranque

Message Too Old, No Replies

Help with the logic on this RegExp

Why does this need to be in a loop?

         

csdude55

7:28 pm on Dec 1, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm hoping you all can help me understand the logic of this scenario. I'm specifically looking at this; in JavaScript, but I don't think the language really matters:

var a = `
Test this one:
<a href="www.google.com/?id=1234&utm_foo=ONE&utm_bar=TWO&start=0">one</a>
and this one:
<a href="www.google.com/?id=4321&ocid=THREE&startview=20&gclid=FOUR&start=0">two</a>
`.trim();

// remove params utm_\w+, ocid, trkid, and gclid
// split to rows for readability here, I don't think you can actually do this in JS
var utm_match = /
// $1
(
<a[^>]+href=
// $2
(
" |
'
)
[^>]+
[?&]
)
(?:
(?:
// the list of params to be removed
utm_\w+? |
ocid |
trkid |
gclid
)
=[^\2&]+&?
)
// $3
(
[^>]*>
)/gi;

while (utm_match.test(a))
a = a.replace(utm_match, '$1$3');


Live:
[jsfiddle.net...]

In a while loop, the regex removes all of the param values listed properly. But if I remove the while, it only removes utm_bar=TWO and gclid=FOUR (presumably, the last ones in each link).

The question is... why? Why does it need to be in a loop, instead of /g doing the work?

NickMNS

2:23 am on Dec 2, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A Regex free solutions. ES6 notation because it's time to move on, but it can been done legacy js, it'll just be lot uglier.

const div = document.createElement('div')
div.innerHTML = a
const nodeArr = Array.from(div.querySelectorAll("a"))
nodeArr.forEach((href) => {
if (href.href.toString().indexOf("utm_") >= 0 ) {
const [url,params] = href.href.toString().split('?')
console.log(
[
url,
params.split('&').filter(param => param.indexOf("utm_") < 0).join('&')
].join("?")
)
} else {
console.log(href.href)
}
})


And a fiddle to boot:
[jsfiddle.net...]

I should add an explanation.
The first step is to create an element in the DOM, such that the browser does the heavy lifting in terms of parsing the code. You can think of it as a "shadow DOM", as the element is never displayed on the page. Then find the <a> tags and take the href attribute, split it by "?" then with by "&", take out what you don't need and glue all back together.

One big caveat, if the string is from a user input you really need to be careful of what is in the string before parsing it, it could be used to inject code into your page. So this may not be the ideal solution for you. But it really depends on where the string come from and to a lesser extent what you plan on doing with it after.