Archive for Unicode

Unicode Character Classes

These are the Unicode “General Category” character class names used in regular expression matching, e.g. in Perl, \pP or \p{Punctuation} to match all Unicode characters having the “punctuation” property.

Expression Syntax Long Name Description
Letter :L Letter Matches any letter, Ll | Lm | Lo | Lt | Lu
Uppercase letter :Lu Uppercase_Letter Matches any one capital letter. For example, :Luhe matches “The” but not “the”.
Lowercase letter :Ll Lowercase_Letter Matches any one lower case letter. For example, :Llhe matches “the” but not “The”.
Title case letter :Lt Titlecase_Letter Matches characters that combine an uppercase letter with a lowercase letter, such as Nj and Dz.
Modifier letter :Lm Modifier_Letter Matches letters or punctuation, such as commas, cross accents, and double prime, used to indicate modifications to the preceding letter.
Other letter :Lo Other_Letter Matches other letters, such as gothic letter ahsa.
Cased letter :LC Cased_Letter Matches any letter with case, Ll | Lt | Lu
Mark :M Mark Matches any mark, Mc | Me | Mn
Non-spacing mark :Mn Nonspacing_Mark Matches non-spacing marks.
Combining mark :Mc Spacing_Mark Matches combining marks.
Enclosing mark :Me Enclosing_Mark Matches enclosing marks.
Number :N Number Matches any number, Nd | Nl | No
Decimal digit :Nd Decimal_Number Matches decimal digits such as 0-9 and their full-width equivalents.
Letter digit :Nl Letter_Number Matches letter digits such as roman numerals and ideographic number zero.
Other digit :No Other_Number Matches other digits such as old italic number one.
Punctuation :P Punctuation Matches any puncutation, Pc | Pd | Pe | Pf | Pi | Po | Ps
Connector punctuation :P c Connector_Punctuation Matches the underscore or underline mark.
Dash punctuation :P d Dash_Punctuation Matches the dash mark.
Open punctuation :P s Open_Punctuation Matches opening punctuation such as open brackets and braces.
Close punctuation :P e Close_Punctuation Matches closing punctuation such as closing brackets and braces.
Initial quote punctuation :P i Initial_Punctuation Matches initial double quotation marks.
Final quote punctuation :P f Final_Punctuation Matches single quotation marks and ending double quotation marks.
Other punctuation :P o Other_Punctuation Matches commas (,), ?, “, !, @, #, %, &, *, \, colons (:), semi-colons (;), ‘, and /.
Symbol :S Symbol Matches any symbol, Sc | Sk | Sm | So
Math symbol :Sm Math_Symbol Matches +, =, ~, |, <, and >.
Currency symbol :Sc Currency_Symbol Matches $ and other currency symbols.
Modifier symbol :Sk Modifier_Symbol Matches modifier symbols such as circumflex accent, grave accent, and macron.
Other symbol :So Other_Symbol Matches other symbols, such as the copyright sign, pilcrow sign, and the degree sign.
Separator :Z Separator Matches any separator, Zl | Zp | Zs
Paragraph separator :Zp Paragraph_Separator Matches the Unicode character U+2029.
Space separator :Zs Space_Separator Matches blanks.
Line separator :Zl Line_Separator Matches the Unicode character U+2028.
Other control :Cc Control Matches end of line.
Other format :Cf Format Formatting control character such as the bidirectional control characters.
Surrogate :Cs Surrogate Matches one half of a surrogate pair.
Other private-use :Co Private_Use Matches any character from the private-use area.
Other not assigned :Cn Unassigned Matches characters that do not map to a Unicode character.

References:

unicode.org

Unicode Character Properties

Unicode Regular Expressions

Unicode Property AliasesĀ 

Perl Regular Expressions

PCRE

Comments