Regular Expression Character Classes .NET Any Character, . Escaped Character Class \ .NET Unicode category or Unicode block: \p{name} .NET Negative Unicode category or Unicode block: \P{name} .NET Word Character: \w .NET Non-word Character: \W .NET Whitespace Character: \s .NET Non-whitespace Character: \S .NET Decimal Digit Character: \d .NET Non-decimal Digit Character: \D Character Group, [] .NET Positive Character Group, [] .NET Negative Character Group, [^] .NET Character Subtraction Group, [base_group-[excluded_group]] Supported Unicode See also Examples Source/Reference
Regular Expression Character Classes
Instead of using a single character or escaped character, a character class is a set of characters that is used to match against an input string.
a symbol is used to represent a set of characters. e.g. "."=any character.
escaped character classes to represent a group of specific charcters. e.g. \p{unicode name,\P{unicode name}, \w, \W, \s, \S, \d, \D
a pair square brackets, [], is used to specify a set of characters, e.g. [abc]=a or b or c
a hyphen character, -, is used as a range separator unless it is the first or last character of the group. e.g. [a-c]=a or b or c
a leading caret character, ^, is used to specify a negative sense that the set of characters must not appear in an input string. e.g. [^abc]=not (a and b and c)
a hyphen character, -, is used to indicate a nested excluded group from the base group. e.g. [a-c-[b]]=a or b.
.NET Any Character, .
The period character, ., is used to match any character including the carriage return character, \r or \u000D except the newline character, \n or \u000A. But in a character class, a period, ., is treated as a literal period character.
Character Class
Description
Exception
.
Wildcard: Matches any single character except \n.
To match a literal period character (. or \u002E), you must precede it with the escape character (\.).
In a character class, a period, ., is treated as a literal period character.
Escaped Character Class \
The backslash character, \, used in regular expression for character class can be used to indicate the following character classes.
.NET Unicode category or Unicode block: \p{name}
The backslash character, \, followed by the character p is used to indicate a Unicode general category or named block by specifing the name with the category abbreviation or named block name that any one of which may be used to match an input string..
.NET Negative Unicode category or Unicode block: \P{name}
The backslash character, \, followed by the character P is used to indicate a Unicode general category or named block by specifing the name with the category abbreviation or named block name that cannot appear in an input string.
.NET Word Character: \w
The backslash character, \, followed by the character w is used to indicate a set of word characters that any one of which may be used to match an input string. By default, the set of word characters are members of the predefined Unicode categories. In other words, \w is equivalent to [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}\p{Lm}]. If ECMAScript-compliant behavior is specified, \w is equivalent to [a-zA-Z_0-9].
.NET Non-word Character: \W
The backslash character, \, followed by the character W is used to indicate a set of word characters that cannot appear in an input string. By default, the set of word characters are members of the predefined Unicode categories. In other words, \W is equivalent to [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}\p{Lm}]. If ECMAScript-compliant behavior is specified, \W is equivalent to [^a-zA-Z_0-9].
.NET Whitespace Character: \s
The backslash character, \, followed by the character s is used to indicate a set of whitespace characters that any one of which may be used to match an input string. By default, the set of whitespace characters are members of the predefined escape sequences and Unicode categories. In other words, \s is equivalent to [\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified, \s is equivalent to [ \f\r\r\t\v].
.NET Non-whitespace Character: \S
The backslash character, \, followed by the character S is used to indicate a set of whitespace characters that cannot appear in an input string. By default, the set of whitespace characters are members of the predefined escape sequences and Unicode categories. In other words, \S is equivalent to [^\f\n\r\t\v\x85\p{Z}]. If ECMAScript-compliant behavior is specified, \S is equivalent to [^ \f\r\r\t\v].
.NET Decimal Digit Character: \d
The backslash character, \, followed by the character d is used to indicate a set of decimal digit characters that any one of which may be used to match an input string. By default, the set of decimal digit characters are members of the predefined Unicode categories. In other words, \d is equivalent to [\p{Nd}]. If ECMAScript-compliant behavior is specified, \d is equivalent to [0-9].
.NET Non-decimal Digit Character: \D
The backslash character, \, followed by the character D is used to indicate a set of decimal digit characters that cannot appear in an input string. By default, the set of decimal digit characters are members of the predefined Unicode categories. In other words, \D is equivalent to [\P{Nd}]. If ECMAScript-compliant behavior is specified, \D is equivalent to [^0-9].
Character Group, []
.NET Positive Character Group, []
A pair of square brackets is used to specify a set of characters that any one of which may be used to match an input string. The set of characters may be specified individually, as a range, or both.
[*character_group*]: character_group can consist of any combination of one or more literal characters, escape characters, or character classes.
[*firstCharacter*-*lastCharacter*]: firstCharacter is the character that begins the range and lastCharacter is the character that ends the range. Two characters are contiguous if they have adjacent Unicode code points. firstCharacter must be the character with the lower code point, and lastCharacter must be the character with the higher code point. A hyphen character, -, is always interpreted as the range separator unless it is the first or last character of the group.
.NET Negative Character Group, [^]
A pair of square brackets with leading caret is used to specify a set of characters that cannot appear in an input string. The set of characters may be specified individually, as a range, or both.
[*^character_group*]: the leading caret, ^, is used to indicate a negative charactergroup. character_group can consist of any combination of one or more literal characters, escape characters, or character classes that cannot appear in an input string..
[^*firstCharacter*-*lastCharacter*]: the leading caret, ^, is used to indicate a negative charactergroup. firstCharacter is the character that begins the range and lastCharacter is the character that ends the range. Two characters are contiguous if they have adjacent Unicode code points. firstCharacter must be the character with the lower code point, and lastCharacter must be the character with the higher code point. A hyphen character, -, is always interpreted as the range separator unless it is the first or last character of the group.
A negative character group in a larger regular expression pattern is not a zero-width assertion. That is, after evaluating the negative character group, the regular expression engine advances one character in the input string.
.NET Character Subtraction Group, [base_group-[excluded_group]]
A character subtraction group is used to specify a set of characters through subtraction that any one of which may be used to match an input string. The set of character subtraction group is the result of excluding the characters in a base character group from another character excluded group.
The form of character subtraction group is [base_group-[excluded_group]]. The hyphen, -, is used to indicate the following nested group is an character excluded_group.
Supported Unicode
Supported Unicode General Categories Supported Named Blocks
\p {name}
\w
\s
\d
Category
Description
Lu
Letter, Uppercase
✔
Ll
Letter, Lowercase
✔
Lt
Letter, Titlecase
✔
Lm
Letter, Modifier
✔
Lo
Letter, Other
✔
L
All letter characters. This includes the Lu, Ll, Lt, Lm, and Lo characters.
Mn
Mark, Nonspacing
✔
Mc
Mark, Spacing Combining
Me
Mark, Enclosing
M
All diacritic marks. This includes the Mn, Mc, and Me categories.
Nd
Number, Decimal Digit
✔
Nl
Number, Letter
No
Number, Other
N
All numbers. This includes the Nd, Nl, and No categories.
Pc
Punctuation, Connector
✔
Pd
Punctuation, Dash
Ps
Punctuation, Open
Pe
Punctuation, Close
Pi
Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
Pf
Punctuation, Final quote (may behave like Ps or Pe depending on usage)
Po
Punctuation, Other
P
All punctuation characters. This includes the Pc, Pd, Ps, Pe, Pi, Pf, and Po categories.
Sm
Symbol, Math
Sc
Symbol, Currency
Sk
Symbol, Modifier
So
Symbol, Other
S
All symbols. This includes the Sm, Sc, Sk, and So categories.
Zs
Separator, Space
Zl
Separator, Line
Zp
Separator, Paragraph
Z
All separator characters. This includes the Zs, Zl, and Zp categories.
✔
Cc
Other, Control
Cf
Other, Format
Cs
Other, Surrogate
Co
Other, Private Use
Cn
Other, Not Assigned (no characters have this property)
C
All control characters. This includes the Cc, Cf, Cs, Co, and Cn categories.
Block name
Code point range
IsBasicLatin
0000 - 007F
IsLatin-1Supplement
0080 - 00FF
IsLatinExtended-A
0100 - 017F
IsLatinExtended-B
0180 - 024F
IsIPAExtensions
0250 - 02AF
IsSpacingModifierLetters
02B0 - 02FF
IsCombiningDiacriticalMarks
0300 - 036F
IsGreek or IsGreekandCoptic
0370 - 03FF
IsCyrillic
0400 - 04FF
IsCyrillicSupplement
0500 - 052F
IsArmenian
0530 - 058F
IsHebrew
0590 - 05FF
IsArabic
0600 - 06FF
IsSyriac
0700 - 074F
IsThaana
0780 - 07BF
IsDevanagari
0900 - 097F
IsBengali
0980 - 09FF
IsGurmukhi
0A00 - 0A7F
IsGujarati
0A80 - 0AFF
IsOriya
0B00 - 0B7F
IsTamil
0B80 - 0BFF
IsTelugu
0C00 - 0C7F
IsKannada
0C80 - 0CFF
IsMalayalam
0D00 - 0D7F
IsSinhala
0D80 - 0DFF
IsThai
0E00 - 0E7F
IsLao
0E80 - 0EFF
IsTibetan
0F00 - 0FFF
IsMyanmar
1000 - 109F
IsGeorgian
10A0 - 10FF
IsHangulJamo
1100 - 11FF
IsEthiopic
1200 - 137F
IsCherokee
13A0 - 13FF
IsUnifiedCanadianAboriginalSyllabics
1400 - 167F
IsOgham
1680 - 169F
IsRunic
16A0 - 16FF
IsTagalog
1700 - 171F
IsHanunoo
1720 - 173F
IsBuhid
1740 - 175F
IsTagbanwa
1760 - 177F
IsKhmer
1780 - 17FF
IsMongolian
1800 - 18AF
IsLimbu
1900 - 194F
IsTaiLe
1950 - 197F
IsKhmerSymbols
19E0 - 19FF
IsPhoneticExtensions
1D00 - 1D7F
IsLatinExtendedAdditional
1E00 - 1EFF
IsGreekExtended
1F00 - 1FFF
IsGeneralPunctuation
2000 - 206F
IsSuperscriptsandSubscripts
2070 - 209F
IsCurrencySymbols
20A0 - 20CF
IsCombiningDiacriticalMarksforSymbols or IsCombiningMarksforSymbols
20D0 - 20FF
IsLetterlikeSymbols
2100 - 214F
IsNumberForms
2150 - 218F
IsArrows
2190 - 21FF
IsMathematicalOperators
2200 - 22FF
IsMiscellaneousTechnical
2300 - 23FF
IsControlPictures
2400 - 243F
IsOpticalCharacterRecognition
2440 - 245F
IsEnclosedAlphanumerics
2460 - 24FF
IsBoxDrawing
2500 - 257F
IsBlockElements
2580 - 259F
IsGeometricShapes
25A0 - 25FF
IsMiscellaneousSymbols
2600 - 26FF
IsDingbats
2700 - 27BF
IsMiscellaneousMathematicalSymbols-A
27C0 - 27EF
IsSupplementalArrows-A
27F0 - 27FF
IsBraillePatterns
2800 - 28FF
IsSupplementalArrows-B
2900 - 297F
IsMiscellaneousMathematicalSymbols-B
2980 - 29FF
IsSupplementalMathematicalOperators
2A00 - 2AFF
IsMiscellaneousSymbolsandArrows
2B00 - 2BFF
IsCJKRadicalsSupplement
2E80 - 2EFF
IsKangxiRadicals
2F00 - 2FDF
IsIdeographicDescriptionCharacters
2FF0 - 2FFF
IsCJKSymbolsandPunctuation
3000 - 303F
IsHiragana
3040 - 309F
IsKatakana
30A0 - 30FF
IsBopomofo
3100 - 312F
IsHangulCompatibilityJamo
3130 - 318F
IsKanbun
3190 - 319F
IsBopomofoExtended
31A0 - 31BF
IsKatakanaPhoneticExtensions
31F0 - 31FF
IsEnclosedCJKLettersandMonths
3200 - 32FF
IsCJKCompatibility
3300 - 33FF
IsCJKUnifiedIdeographsExtensionA
3400 - 4DBF
IsYijingHexagramSymbols
4DC0 - 4DFF
IsCJKUnifiedIdeographs
4E00 - 9FFF
IsYiSyllables
A000 - A48F
IsYiRadicals
A490 - A4CF
IsHangulSyllables
AC00 - D7AF
IsHighSurrogates
D800 - DB7F
IsHighPrivateUseSurrogates
DB80 - DBFF
IsLowSurrogates
DC00 - DFFF
IsPrivateUse or IsPrivateUseArea
E000 - F8FF
IsCJKCompatibilityIdeographs
F900 - FAFF
IsAlphabeticPresentationForms
FB00 - FB4F
IsArabicPresentationForms-A
FB50 - FDFF
IsVariationSelectors
FE00 - FE0F
IsCombiningHalfMarks
FE20 - FE2F
IsCJKCompatibilityForms
FE30 - FE4F
IsSmallFormVariants
FE50 - FE6F
IsArabicPresentationForms-B
FE70 - FEFF
IsHalfwidthandFullwidthForms
FF00 - FFEF
IsSpecials
FFF0 - FFFF
[\f\n\r\t\v\x85\p{Z}
✔
standard decimal digits 0-9 as well as the decimal digits of a number of other character sets
✔
See also
Regular Expression Language - Quick Reference
Examples
Examples of Character Classes
ASP.NET Code Input:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Sample Page</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
<script runat="server">
Sub Page_Load()
Dim xstring As String = "01 345"&ChrW(913)&"67 9abc def"&Chr(13)&Chr(10)&"7890"&Chr(13)&Chr(10)
Dim xmatchstr As String = ""
Dim xoption As RegexOptions = RegexOptions.Multiline
xmatchstr = xmatchstr & "Given string: " & """01 345""&ChrW(913)&""67 9abc def""&Chr(13)&Chr(10)&""7890""&Chr(13)&Chr(10)" & "<br />"
xmatchstr = xmatchstr & showresult(xstring,".+",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"\P{Ll}",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"\p{Ll}",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"\p{IsBasicLatin}",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"\P{IsBasicLatin}",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"\w",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"\W",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"\s",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"\S",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"\d",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"\D",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"[1357\r]",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"[^1357\r]",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"[1357\r-[5]]",RegexOptions.None)
xmatchstr = xmatchstr & showresult(xstring,"[^1357\r-[5]]",RegexOptions.None)
lbl01.Text = xmatchstr
End Sub
Function showresult(xstring,xpattern,xoption)
Dim xmatches As MatchCollection
Dim xmatchstr As String = ""
Dim xint As Integer
xmatchstr = xmatchstr & "<br />Result of Regex.Matches(string,""" & xpattern & """," & xoption & "): "
xmatches = Regex.Matches(xstring,xpattern,xoption)
xmatchstr = xmatchstr & "<br />->Result of MatchCollection.Count: """
xmatchstr = xmatchstr & xmatches.Count & """<br />"
For xint = 0 to xmatches.Count - 1
xmatchstr = xmatchstr & "->->Result of MatchCollection("& xint & ").Value, Index, Length: """
xmatchstr = xmatchstr & xmatches(xint).Value & ", " & xmatches(xint).Index & ", " & xmatches(xint).Length & """<br />"
Next
Return xmatchstr
End Function
</script>
</head>
<body>
<% Response.Write ("<h1>This is a Sample Page of Character Classes</h1>") %>
<p>
<%-- Set on Page_Load --%>
<asp:Label id="lbl01" runat="server" />
</p>
</body>
</html>