Forum Moderators: coopster

Message Too Old, No Replies

Function to clean up a string using preg replace

Leaves me with spaces

         

trillianjedi

12:50 pm on Aug 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi all,

I've built a PHP function to strip certain characters from strings for an application I'm building.

My intention is to remove anything that is alphanum (case insensitive) or certain functional characters and replace them with nothing.

Some searching around the net landed me with this (which I've tweaked slightly from the original).

function stripJunk($junk) {
$cleanedString = preg_replace("/[^A-Za-z0-9\s\s+\.\-\/+\!;\n\t\r]/","", $junk);

return $cleanedString;
}

This works great, but replaces some of these chars with a space and I don't understand why?

Any pointers ?

Here's some test code displaying this:-

<?php

function stripJunk($junk) {
$cleanedString = preg_replace("/[^A-Za-z0-9\s\s+\.\-\/+\!;\n\t\r]/","", $junk);

return $cleanedString;
}

$string = "This & contains () %$£ junk and test 1234 text in the' ' \" same sentence!?\!±";

echo $string."\n";
echo stripjunk($string)."\n";

?>

For me this outputs:-

<edit>WebmasterWorld strips out the double chars, but basically the ( and the ) and some others are replaced with spaces, rather than nothing which is what I itended.

Thanks!

rocknbil

3:43 pm on Aug 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't know if you're seeing it right. I ran the code as posted. First I got an undefined variable due to "...$....", so changed the string to single quotes:

$string = 'This & contains () %$£ junk and test 1234 text in the\' \' " same sentence!?\!±';

then noted how the last bit, "sentence!?\!±" correctly left no spaces:

sentence!

Site is zapping my double exclamation points, carrying on . . .

So I did a slight mod to NOT replace the pipe character,

$cleanedString = preg_replace("/[^A-Za-z0-9\s\s+\.\-\/+\!;\n\t\r]/","", $junk);

And inserted it as a "marker" surrounding all the "junk" so I could see what's getting replaced:

$string = 'This ¦&¦ contains ¦()¦ ¦%$£¦ junk and test 1234 text in the¦\'¦ ¦\'¦ " same sentence¦!?\!±¦';

... and noted the characters are being replaced with nothing, but what's left is the space characters after the substitution. As you said, you can't see them here, but you can see by the pipes there's "nothing" where the subs were.

This ¦¦ contains ¦¦ ¦¦ junk and test 1234 text in the¦¦ ¦¦ same sentence¦!¦

So I think you're going to have to add a space preg immediately after stripping out unwanted characters,

<?php
function stripJunk($junk) {
$cleanedString = preg_replace("/[^A-Za-z0-9\s\.\-\/+\!;\n\t\r]/","", $junk);
$cleanedString = preg_replace("/\s+/"," ",$cleanedString);
return $cleanedString;
}

$string = 'This & contains () %$£ junk and test 1234 text in the\' \' " same sentence!?\!±';

echo $string."\n";
echo stripjunk($string)."\n";
?>

Which gives you single spaces between all words:

This contains junk and test 1234 text in the same sentence!

Note that:

-- A-Z with i (case insensitive) modifer and \w are identical to A-Za-z. The following do exactly the same thing:

$cleanedString = preg_replace("/[^A-Za-z0-9\s\.\-\/+\!;\n\t\r]/","", $junk);

$cleanedString = preg_replace("/[^A-Z\d\s\.\-\/+\!;\n\t\r]/i","", $junk);

$cleanedString = preg_replace("/[^\w\d\s\.\-\/+\!;\n\t\r]/i","", $junk);

-- \d is identical to 0-9.

-- In the original regexp, \s\s+ is redundant. \s+ is "one or more white spaces," \s is a single space. Normally only \s+ is needed, but since your regexp is a class, basically saying "anything that is not these characters," you really only need the single \s. There are some cases where you may want to use "zero or more," \s* (but this is not one of them.)

An alternative: you can play with zero or more spaces around your "junk" to try and get it in a single regexp, like

$cleanedString = preg_replace("/\s*[^\w\d\s\.\-\/+\!;\n\t\r]\s*/i","", $junk);

But this may give unexpected results, removing spaces where you wouldn't want it to:

$string='Oops &no ±spaces';

trillianjedi

5:12 pm on Aug 9, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Bill - many thanks for your help.

I ran the code as posted. First I got an undefined variable due to "...$....",

I think WebmasterWorld changes some things and messes with the char set (quite rightly, I might be malicious :) ).

Your version of the function with the additional space preg works great for me.

Thanks also for the pointers on what the REGEX's all mean. I never do enough of them unfortunately to get it memorised.

rocknbil

3:30 pm on Aug 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I ran the code as posted. First I got an undefined variable due to "...$....",

I think WebmasterWorld changes some things and messes with the char set ...

Just a note, in that case that wasn't what was happening. When you double quote strings,

$string = "this is my text $£";

Variables are interpolated. So it was expecting $£ to be a variable, and it was not defined.

When you single quote,

$string = 'this is my text $£';

$£ is not interpolated and is handled as a literal character.