Can You Help Me Alter My Regular Expression To Include A Particular Range Of Unicode Characters?

March 31, 2024 Post a Comment

I am allowing users to create comments within my app. I have created a javascript regular expression which matches the characters I would like to allow within the comment. This inc

Solution 1:

As a direct response to your question, I would propose the following regex:

/^(?:[A-Za-z0-9\u00C0-\u017F\u20AC\u2122\u2150\u00A9 \/.,\-_$!\'&*()="?#+%:;\<\[\]\r\r\n]|(?:\ud83c[\udf00-\udfff])|(?:\ud83d[\udc00-\ude4f\ude80-\udeff])){1,2000}$/

But really, this require some explanations before you go on... And first of all, let's get back to some definition... You probably know some of these, but they are really necessary for the answer to actually make sense.

Regex are state machines that consume "characters". Sounds simple enough, but various regex engines have different definition of what is a "character", with two predominant variants: either a character is a single byte, or a character is a UTF16 code unit (that is each sequence of 16 bits when the text is encoded in UTF16). JavaScript use the second variant.

Emoji characters require two consecutive UTF16 code unit; that is the reason why, in a UTF16-based regex, they must be matched as two consecutive characters (for example \ud83c[\udf00-\udfff]). The two characters form a pair, and that sequence must be maintained in the regex.

In a regex, a character class (for example [a-z0-9 ,-]) will match a single input character, given that it is contained in the specified characters list. There is no sequence and no ordering on the characters inside that class: at most one character will get matched. Emojis can't therefore be matched correctly simply by including their UTF16 code unit to a long list of accepted characters (well, doing so would actually result in a regex that accepts all valid input, but also accept many invalid input).

A character class can equivalently be replaced by a long list of "alternatives" particles: (?:a|b|c|...|y|z|0|1...|9| |,|-). Note here that I used a non-capturing group, that is (?:...), instead of a capturing group (...); this is desirable whenever you do not intend to refer to the value of a group, since there is a performance cost associated to capturing that value. Indeed, a long list of alternatives is far less efficient than a character class particle; there is however an advantage doing so: alternatives allow matching for sequences of multiple characters. For example, one could say (?:apple|banana|cherry|...). In this form, it is now possible to correctly match emoji characters: (?:\ud83c\udf00|\ud83c\udf01|\ud83c\udf02...\ud83c\udfff|...). But expending all alternatives this would result in a ridiculously long and hard to maintain regex. So you will definitely want to mix character class and alternatives appropriately.

So your regex will basically have the following form:

(?: [all acceptable single characters] |
    \ud83c [all acceptable low surrogates for pairs starting with d83c] |
    \ud83d [all acceptable low surrogates for pairs starting with d83d] )

From this point, I simply plugged in the character classes that you provided in your question, and removed extra spaces...

In your question, your regex more over was surrounded by ^(...){1,2000}$, meaning that the regex would only match if the string, from beginning (that is ^) to the end (that is $) contained between 1 and 2000 of the allowed characters. Adding this around the previously constructed pattern should give the regex I gave at the beginning of my answer. I should however warn you that this might not be the most appropriate way to test for the length of the input string. Why are you imposing the 2000 characters limit? Does that limit actually applies to your storage model? If so, then you should definitely consider the fact that emojis actually takes up two "characters"... And the relation will be even more complex if your backend store values with UTF8 encoding... You should therefore consider checking for the length of the input text with a distinct test, written directly in JavaScript, rather than using a regex repetition specifier. If you decide to so, replace the {1,2000} by a * suffix (which simply means "any number of repetition").

JavaScript Download

Can You Help Me Alter My Regular Expression To Include A Particular Range Of Unicode Characters?

Solution 1:

Post a Comment for "Can You Help Me Alter My Regular Expression To Include A Particular Range Of Unicode Characters?"