Utf-8 Alpha-numeric input validation for unicode Strings

The standard alpha-numeric regular expression validation [a-zA-Z0-9]+ is not suitable for multi-lingual Unicode string validation.

In a multi-lingual software environment, you’ve set up your Unicode database table columns and set your interface to accept the UTF-8 character set – but you want to validate your user’s input. That is to say you still want ‘alpha-numeric’ validation but need to accept Latin accented characters and or even Chinese, Taiwanese etc character sets – you don’t want special characters like !”£$%^& etc. But of course you could modify the combinations below to suit your needs.

Here is a regex expression code snippet with comments that will help you validate the input.


/**
* A localised string valid characters.
*
* Numbers
*    [\u0030-\u0039]+
* English alpha characters
*    [\u0041-\u005a\u0061-\u007a]+
* Underscore and hyphen
*    [\u002d\u005f]+
* Quote, apostrophe, punctuation
*    [\\u0022\u0027\u00b4]+
* Whitespace
*    [\\s]+
* Latin-1 Supplement - With excess removed, only leaving characters
*    Suitable for; German, Italian, Portuguese, Spanish.
*    [\u00c0-\u00ff]+
* Latin-1 Supplement - Extra characters
*    Suitable for Dutch as per http://en.wikipedia.org/wiki/ISO/IEC_8859-1
*    [\u0131-\u0132]+
* Latin-1 Supplement - Extra characters
*    Suitable for French as per http://en.wikipedia.org/wiki/ISO/IEC_8859-1
*    [\u0152\u0178]+
* Chinese Characters
*    [\u4e00-\ud7a3]+
*/
private static final String VALID_CHARS = "[\u0030-\u0039\u0041-\u005a\u0061-\u007a\u002d\u005f\\s\\u0022\u0027\u00b4\u00c0-\u00ff\u0131-\u0132\u0152\u0178\u4e00-\ud7a3]+";

In Java you might use this something like:


private Boolean isValidChars(String inputValue) {
  return (inputValue != null && !inputValue.matches(VALID_CHARS) ? false : true);
}

In JavaScript the same can be used like:


function isValidChars(inputValue) {
  var VALID_CHARS = /[\u0030-\u0039\u0041-\u005a\u0061-\u007a\u002d\u005f\u0022\u0027\u00b4\s\u00c0-\u00ff\u0131-\u0132\u0152\u0178\u4e00-\ud7a3]/gm;
  return inputValue.match(VALID_CHARS);
}

6 Responses

  1. Internationalization with Java « Vijayendra Rao’s Weblog Says:

    [...] In fact, you can find a good amount of details about this here…and if you would like to know in detail about the usage of regular expressions with UTF-8, you can refer to this post. [...]

  2. Chari Says:

    Adrian

    Thank you for this wonderful post. It really helped. I really appreciate this.

    Peace
    Chari

  3. Andreas Andreou Says:

    Hey, you forgot greek :)

    [\u0391-\u03a9\u03b1-\u03c9]+

  4. Pat Says:

    Its a lot simpler than that!

    [^\\p{L}\\d]

    Look in the Pattern javadoc under “Unicode support”

  5. Pat Says:

    Its a lot simpler than that!

    [\\p{L}\\d]

    Look in the Pattern javadoc under “Unicode support”

    (removed the “^” in the previous comment)

  6. JC Castro Says:

    Great post it help me so much

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.