Forum Moderators: phranque

Message Too Old, No Replies

Find this range of Unicode characters

         

csdude55

7:16 am on Nov 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm looking at someone else's code that modifies HTML entities to decimal references, using a range of:

[\u0080-\u024F]

I know that the \u means something like "use Unicode", but what I can't figure is what the range of 0080-024F is.

Can you suggest how I could print every character within that range to see what they are? And possibly expand it to other non-English characters if necessary?

not2easy

12:56 pm on Nov 7, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



When I need to look up seldom used things I have found Wikipedia useful: https://en.wikipedia.org/wiki/Unicode
May or may not help for your purpose, but offers a lot of further roads until Lucy can help.

lucy24

5:41 pm on Nov 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: detour to Character Viewer (which is now called something else, but to heck with that) ::

0080 - 009F are forbidden characters in unicode, causing problems if you’re trying to reinterpret (not convert) from a one-byte encoding such as 1252 which does use this range. They can’t possibly have named entities.

00A0-00FF is Latin-1.

0100-024F takes you through Latin-Extended-A and Latin-Extended-B, the next two Unicode blocks. Roughly speaking, they’re the less common diacritics, plus assorted things that look like letters and are probably used mainly by linguists.

I don’t think most of the latter even have named html entities, while quite a few entities--for example the Greek letters--live much further up, well beyond 024F.

csdude55

6:30 pm on Nov 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I know that I can easily get all of the decimals by printing � through ✏ to the screen and see what they are, so my original plan was to convert anything that looks like an English letter to an actual English letter.

But as my list hit 3,000 (most of them on the subjective side of things) and I still have a long way to go, I realized that I'm probably going backwards: I should whitelist instead of blacklist.

So now the idea is to convert anything that's not a letter, number, or anything else common for an HTML page (<, >, =, ?, &, @, etc... I guess anything found on the standard keyboard without using tricks to add umlauts or something) to a decimal entity, then filter anything that's boundary-letter-decimal-letter, boundary-decimal-letter, or decimal-decimal-decimal; eg,

s{\b(?:
\w+(?:&#\d+;)+\w*(?:&#\d+;)*|
(?:&#\d+;)+\w+(?:&#\d+;)*\w*|
(?:&#\d+;){3,}
)\b}
{****}xg;


This is still tricky, though. If someone copies from a page that includes a weird symbol then it could make the sentence illegible. So this might not work at all, but I guess I won't know if it's too much of a pain until I try.

So I'm trying to find a list or range of characters that shouldn't be converted to decimal entities.

Do you think it would be relatively safe to convert [\u0080-\u024F] to decimal entities, then walk through whatever the decimals are for 0100-024F to convert to regular letters, then apply the above regex to any decimal entities that weren't converted to a regular letter?

lucy24

8:03 pm on Nov 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I guess anything found on the standard keyboard without using tricks to add umlauts or something
If your users are in the US, that’s ASCII. But if they have a non-English keyboard (any French-Canadians use your site?) certain common modified characters will be right there on the keyboard. Other common non-keyboard characters are things like currency: € £ ¢ and so on.

:: detour to check something ::

Yup, the British keyboard has the £ sign in the location occupied on US keyboards by #. (! ! ! Is that why phone menus inexplicably call # the “pound sign”?)

<tangent>
The present site dislikes repeated punctuation marks, whether separated by a space or not. Happily, they don’t count nonbreaking spaces.
</tangent>

csdude55

8:21 pm on Nov 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My demographic is entirely US, and I block anything in the firewall that's not US, MP, or PR. So I THINK that I can safely enforce ASCII :-)

Can you suggest a better way to either convert or strip anything that's not ASCII than what I mentioned in my last post? Right now it's looking like:


// HTML
<div id="contenteditable">
thïš ïš ä prétty<b></b> thöröûgh štrïng
</div>

<textarea name="data" id="data" hidden></textarea>

// JQuery
// I know jQuery isn't necessary here, but I'm already using it for other things
String.prototype.encodeHTML = function () {
return this.replace(/[\u0080-\u024F]/g,
function (v) { return '&#' + v.charCodeAt() + ';'; }
);
}

$('#data').text(
$('#contenteditable').html().encodeHTML()
);

// at this point, the textarea contains:
// th&#239;&#353; &#239;&#353; &#228; pr&#233;tty<b></b> th&#246;r&#246;&#251;gh &#353;tr&#239;ng

// Perl
$_ = param('data');

%chars = (
'239' => 'i',
'353' => 's',
# and so on
);

# convert approved entities to letters
s{&#(\d+);}
{$asciiChars{$1}}g;

# filter anything else
s{\b(?:
\w+(?:&#\d+;)+\w*(?:&#\d+;)*|
(?:&#\d+;)+\w+(?:&#\d+;)*\w*|
(?:&#\d+;){3,}
)\b}
{****}xg;


If that seems logical, can you suggest how I can view the entire list of decimal references for 0100-024F so that I can manually create a conversion list?

lucy24

11:30 pm on Nov 7, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



can you suggest how I can view the entire list of decimal references for 0100-024F
Criminy. I knew That Other OS was weird about text input, but is there really nothing that lets you view the characters?

This is one of those rare cases where, as not2easy suggests, wikipedia [en.wikipedia.org] may be the way to go. You want the first three post-ASCII sets: “Latin-1 Supplement”, “Latin Extended A” and “Latin Extended B”. (Note that what Unicode insists on calling a “caron” is known everywhere else in the world as a hacek.)

csdude55

6:20 pm on Nov 8, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One last question (well, hopefully last).

The list shows decimal entities from 32 to 126, then there's a list that are left-padded from 0160 to 0255, then 256 to 591 (skip a few) 688 to 767.

In testing, &#162; and &#0162; both show the cent symbol. So why is it padded? Am I going to run in to issues, warning, or errors if I ignore the padding?

lucy24

6:36 pm on Nov 8, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've never used a leading zero and it definitely isn’t needed with decimal entities. I have no idea why wikipedia did it that way. Maybe different editors with different preferences over the years. It would make some small sense if you padded the part of the table that goes 0997, 0998, 0999, 1000, 1001 because then the numbers would line up--or they would if the table used a fixed-pitch font, which it doesn’t.

:: insert “no idea” emoticon ::

For hexadecimal entities, where numbers always go in pairs, do include the leading zero to be safe. Depending on what application is reading the data, it may or may not make a difference.

tangor

5:23 am on Nov 9, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yup, the British keyboard has the £ sign in the location occupied on US keyboards by #. (! Is that why phone menus inexplicably call # the “pound sign”?)


Fun aside: the # sign was used as early as 1850 as an indicator of one pound (weight) in US and English, long before it was called an octothorp(e) or "hash tag".

Note: Phones did not have # on rotary dials ... only after the switch to push button did it appear on handsets ... and later on smart phones.

More commonly, these days, the # symbol appears on forums as a character replacement for coarse language. :)