Sideway BICK BLOG from Sideway

A Sideway to Sideway Home

Script, Scripting Language, Regular Expression (RegExp) Object Regular Expression

Regular Expression Features

Character Classes

Character class is a group of literals used to match one single substring of a searched string.

Customized Character Classes

In order to allow of using only any one of one or more than one literal to match the same single substring of a searched string in a simpler way, the concept of customized character class is introduced by putting expressions within a pair of square brackets []. In other word, a character class, also called a character set, is only a group or collection of specified literals for matching with one single substring of a searched string and is interpreted as a literal subexpression instead of a ordinary regular subexpression. Since all characters of a character class are used to match with the same single substring of a searched string, the order of the characters listed in the character class does not matter. A character class is therefore an unordered set with unique elements in no particular order. Logically, the match of a character class is equal to performs a logical OR of all matches of each character member in the character class. A character class is therefore equivalent to a regular grouping expression with or | operator.

character a:
to match an a to a single substring of a searched string.
character class [a]:
to match an a to a single substring of a searched string.
character class [aeiou]:
to match an a, or an e, or an i, or an o, or an u to the same single substring of a searched string and is equivalent to (a|e|i|o|u)
character class [ueioa]:
to match an u, or an e, or an i, or an o, or an a to the same single substring of a searched string and is equivalent to (u|e|i|o|a)

Two control operators, caret (^) and hyphen (-) are provided for creation of character classes. Since control operators inside an expression always describe the expression itself, when using control operators inside a character class, both characters carry a meaning different from those used in regular expression pattern. A caret (^) is used as an internal negated set indicator placed inside the [] as the first character immediately after the opening [. A hyphen (-) is used as usual to specify a range of characters by relating the scope of pre- and post- hyphen characters. aracters.

Besides character classes of customized grouping, some common character classes are also bundled as standard regular expression metacharacters. For example, Period ., and escaped characters, \d, \D, \s, \S, \w, and \W. These metacharacters of grouped literals can be used as metacharacters outside the square brackets as a shorthand to a standard character classes and inside the square brackets as a literal subgroup of a character class.

Literals of Character Classes

Since character class is always a class of literal, the only special characters or metacharacters used in defining a character class are the opening square bracket [, the closing square bracket ], the backslash \, the caret ^, and the hyphen -, where the pair of of square brackets [] are a pair of delimiters and the backslash \ is used in escape characters. In other words, the literals of these five special characters used in a chracter set should be escaped characters accordingly, but other special characters can be used as a ordinary characters without the necessity of being escaped by a backslash. However, the four literal characters of the five special character can also be used directly in some case. For example,

[: The next opening square bracket [ is usually not interpreted as the delimiter of a character class. e.g. element [ = [[]
^: The caret ^ can be used as a literal, if the caret is not placed immediately after the opening square bracket because other carets ^ in the character class is usually not interpreted as the negation of the character class. e.g. not element ^ = [^^]
]: If a regular expression engine does always assume a character class is not a empty set , a close square bracket ] placed immediately after the open square bracket delimiter is usually not interpreted as the close square bracket delimiter of a character class. Or a close square bracket ] placed immediately after the open square bracket delimiter and the internal negated set indicator is usually also not interpreted as the close square bracket delimiter of a negated character class. Otherwise the close square bracket ] may be interpreted as an error. e.g. element ] = []]
-: The hyphen - can be used as a literal, if the hyphen - is placed immediately after the open square bracket delimiter, immediately after the open square bracket delimiter and the internal negated set indicator, or immediately before the close square bracket. e.g. element - = [-]

Escape Characters

In general, literal escape characters expressed in terms of ASCII hexadecimal escape values, ASCII control escape values, ASCII octal escape values, UNICODE hexadecimal escape values carry the same meaning in both regular expression pattern and character set. However the escaped characters may carry different meaning in interpretation. For example,

Description of Unprintable Control Codes	SYM	char	escaped char	regular expression pattern	character class
Null char	NUL
Start of Heading	SOH
Start of Text	STX
End of Text	ETX
End of Transmission	EOT
Enquiry	ENQ
Acknowledgment	ACK
Bell	BEL
Back Space	BS		\b		Back Space
Horizontal Tab	HT		\t	Horizontal Tab
Line Feed	LF		\n	Line Feed
Vertical Tab	VT		\v	Vertical Tab
Form Feed	FF		\f	Form Feed
Carriage Return	CR		\r	Carriage Return
Shift Out / X-On	SO
Shift In / X-Off	SI
Data Line Escape	DLE
Device Control 1 (oft. XON)	DC1
Device Control 2	DC2
Device Control 3 (oft. XOFF)	DC3
Device Control 4	DC4
Negative Acknowledgement	NAK
Synchronous Idle	SYN
End of Transmit Block	ETB
Cancel	CAN
End of Medium	EM
Substitute	SUB
Escape	ESC
File Separator	FS
Group Separator	GS
Record Separator	RS
Unit Separator	US
Description of printable characters except char 127	SYM	char	escaped char
Space
Exclamation mark	!
Double quotes (or speech marks)	"
Number	#
Dollar	$		n/a
Procenttecken	%
Ampersand	&
Single quote	'
Open parenthesis (or open bracket)	(		\(	(
Close parenthesis (or close bracket)	)		\)	)
Asterisk	*		\*	*
Plus	+		\+	+
Comma	,
Hyphen	-		\-	n/a	-
Period, dot or full stop	.		\.	a character class	.
Slash or divide	/		\/	/	n/a
Zero	0	0	\0	a backreference	n/a
One	1	1	\1	a backreference	n/a
Two	2	2	\2	a backreference	n/a
Three	3	3	\3	a backreference	n/a
Four	4	4	\4	a backreference	n/a
Five	5	5	\5	a backreference	n/a
Six	6	6	\6	a backreference	n/a
Seven	7	7	\7	a backreference	n/a
Eight	8	8	\8	a backreference	n/a
Nine	9	9	\9	a backreference	n/a
Colon	:
Semicolon	;
Less than (or open angled bracket)	<
Equals	=
Greater than (or close angled bracket)	>
Question mark	?		\?	?	n/a
At SYM	@
Uppercase A	A	A
Uppercase B	B	B		not boundary position
Uppercase C	C	C
Uppercase D	D	D		a character class
Uppercase E	E	E
Uppercase F	F	F
Uppercase G	G	G
Uppercase H	H	H
Uppercase I	I	I
Uppercase J	J	J
Uppercase K	K	K
Uppercase L	L	L
Uppercase M	M	M
Uppercase N	N	N
Uppercase O	O	O
Uppercase P	P	P
Uppercase Q	Q	Q
Uppercase R	R	R
Uppercase S	S	S		a character class
Uppercase T	T	T
Uppercase U	U	U
Uppercase V	V	V
Uppercase W	W	W	\W	a character class	n/a
Uppercase X	X	X
Uppercase Y	Y	Y
Uppercase Z	Z	Z
Opening bracket	[		\[	[
Backslash	\		\\	\
Closing bracket	]		\]
Caret - circumflex	^		/^	^
Underscore	_
Grave accent	`
Lowercase a	a	a
Lowercase b	b	b	\b	boundary position	back space
Lowercase c	c	c	\c	escaped control character
Lowercase d	d	d	\d	a character class
Lowercase e	e	e
Lowercase f	f	f	\f	Form Feed
Lowercase g	g	g
Lowercase h	h	h
Lowercase i	i	i
Lowercase j	j	j
Lowercase k	k	k
Lowercase l	l	l
Lowercase m	m	m
Lowercase n	n	n	\n	Line Feed
Lowercase o	o	o
Lowercase p	p	p
Lowercase q	q	q
Lowercase r	r	r	\r	Carriage Return
Lowercase s	s	s	\s	a character class
Lowercase t	t	t	\t	Horizontal Tab
Lowercase u	u	u	\u	escaped UNICODE hexadecimal character
Lowercase v	v	v	\v	Vertical Tab
Lowercase w	w	w	\w	a character class
Lowercase x	x	x	\x	escaped ASCII hexadecimal character
Lowercase y	y	y
Lowercase z	z	z
Opening brace	{		\{
Vertical bar	\|		\\|
Closing brace	}		\}
Equivalency sign - tilde	~
Delete

Literal Range in Character Classes

In order to shorten the list of literals in a character class, the concept of range is applied. A hyphen is used to indicate a closed range of values. The range should be clearly defined by a finite lower boundry, the pre-hyphen character and a finite upper boundry, the post-hyphen character. In other words, the range is from the pre-hyphen character through the post-hyphen character. Since range in a chracter set is only a representation literals in the character class, literals and literal ranges can be used in a character class, more than one range can be used in a character class, and again the order of the ranges inside the character is immaterial. For example,

character class [af-h]:
to match an a, or a f, or a g, or a h to a single substring of a searched string.
character class [a-f0-9]:
to match an a, or a b, or a c, or a d, or an e, or a f, or a 0, or a 1, or a 2, or a 3, or a 4, or a 5, or a 6, or a 7, or a 8, or a 9 to a single substring of a searched string.
character class [0-9a-f]:
to match a 0, or a 1, or a 2, or a 3, or a 4, or a 5, or a 6, or a 7, or a 8, or a 9, or an a, or a b, or a c, or a d, or an e, or a f to a single substring of a searched string.
character class [aA7p-s51-3B-E9]:
to match an a, or an A, or a 7, or a p, or a q, or a r, or a s or a 5, or a 1, or a 2, or a 3, or a B, or a D, or a E or a 9 to a single substring of a searched string.

Negated Character Classes

Sometimes, instead of matching any one of one or more than one literal in a character class to the same single substring of a searched string, it is more convenient to make use of the negetive sense of a character class for not matching any one of one or more than one literal in a character class to the same single substring of a searched string. A character class can be negated by placing a caret ^ as an internal negated set indicator immediately after the opening square bracket as the first character. To match a negated character class to a searched string is to match any printable or unprintable characters not listed in the negated character class to a single substring of search string. Since all characters, except the first caret ^ character, of a negated character class are used to not match with the same single substring of a searched string, the order of the negated characters listed in the negated character class does not matter. When excluding the internal negated set indicator placed immediately at the beginning of the negated charactered set, a negated character class is therefore an unordered set with unique elements in no particular order. Logically, the match of a negated character class is equal to performs a logical AND of all negated matches of each character member in the negated character class. For example,

negated character class [^a]:
to match not an a to a single substring of a searched string.
negated character class [^aeiou]:
to match not any element, i.e. element a, element e, element i, element o, and element u, of the negated character class to the same single substring of a searched string.
negated character class [^ueioa]:
to match not an u, and not an e, and not an i, and not an o, and not an a to the same single substring of a searched string.
negated character class [^ha-f]:
to match neither a h, nor an e, nor a b, nor a c, nor a d, nor an e, nor a f to the same single substring of a searched string.

Examples of Character Classes and Negated Character Classes

Examples of character classes and Negated character classes:

To match only any one of one or more than one literal in a character class to the same single substring of a searched string. e.g.
- [a] imply to match an a to a single substring of a searched string.
- [aeiou] imply to match an a, or an e, or an i, or an o, or an u to the same single substring of a searched string....
- [ueioa] imply to match an u, or an e, or an i, or an o, or an a to the same single substring of a searched string.
To match only any one of a range of literals in a character class to the same single substring of a searched string. e.g.
- [e-i] imply to match an e, or a f, or a g, or a h, or an i to the same single substring of a searched string.g.
- [E-I] imply to match an E, or a F, or a G, or a H, or an I to the same single substring of a searched string.
- [a-z] imply to match an a, or a b, or ..., or a y, or a z to the same single substring of a searched string.
- [A-Z] imply to match an A, or a B, or ..., or a Y, or a Z to the same single substring of a searched string.
- [3-7] imply to match a 3, or a 4, or a 5, or a 6, or a 7 to the same single substring of a searched string.
- [0-9] imply to match a 0, or a 1, or ..., or a 8, or a 9 of a searched string.
To match only any one of more than one literal ranges in a character class to the same single substring of a searched string. e.g.
- [e-iE-I] imply to match an e, or a f, or a g, or a h, or an i, or an E, or a F, or a G, or a H, or an I to the same single substring of a searched string.g.
- [a-z3-7] imply to match an a, or a b, or ..., or a y, or a z, or a 3, or a 4, or a 5, or a 6, or a 7 to the same single substring of a searched string.
- [A-Zw-ze-i] imply to match an A, or a B, or ..., or a Y, or a Z, or a w, or a x, or a y, or a z, or an e, or a f, or a g, or a h, or an i to the same single substring of a searched string.
- [E-I3-7] imply to match an E, or a F, or a G, or a H, or an I, or a 3, or a 4, or a 5, or a 6, or a 7 to the same single substring of a searched string.
- [a-z0-9A-Z] imply to match an a, or a b, or ..., or a y, or a z, or a 0, or a 1, or ..., or a 8, or a 9, or an A, or a B, or ..., or a Y, or a Z to the same single substring of a searched string.
To match only any one of the listed literals and literal ranges in a character set to the same single substring of a searched string. e.g.
- [A3e-iBkE-I] imply to match an A, or a 3, or an e, or a f, or a g, or a h, or an i, or a B, or a k, or an E, or a F, or a G, or a H, or an I to the same single substring of a searched string.g.
- [a-z93-71Z] imply to match an a, or a b, or ..., or a y, or a z, or a 9, or a 3, or a 4, or a 5, or a 6, or a 7, or a 1, or a Z to the same single substring of a searched string.
- [3A-Ze-i] imply to match a 3, or an A, or a B, or ..., or a Y, or a Z, or an e, or a f, or a g, or a h, or an i to the same single substring of a searched string.
- [E-I3-7b] imply to match an E, or a F, or a G, or a H, or an I, or a 3, or a 4, or a 5, or a 6, or a 7, or a 3 to the same single substring of a searched string.
- [0-9aeiouA-Z] imply to match a 0, or a 1, or ..., or a 8, or a 9, or an a, or an e, or an i, or an o, or an u, or an A, or a B, or ..., or a Y, or a Z to the same single substring of a searched string.
To match not any one of the listed literals and literal ranges in a character set to the same single substring of a searched string. e.g.
- [^a] imply to match not an a to the same single substring of a searched string.g.
- [^aeiou] imply to match not an a, and an e, and an i, and an o, and an u to the same single substring of a searched string.
- [^e-i] imply to match an e, and a f, and a g, and a h, and an i to the same single substring of a searched string.
- [^e-iE-I] imply to match not an e, and a f, and a g, and a h, and an i, and an E, and a F, and a G, and a H, and an I to the same single substring of a searched string.
- [^A3e-iBkE-I] imply to match not an A, and a 3, and an e, and a f, and a g, and a h, and an i, and a B, and a k, and an E, and a F, and a G, and a H, and an I to the same single substring of a searched string.

Standard Character Classes

Besides customized character classes, some commonly used character classes are provided as standard character classes. These standard character classes are represented and called by lowercase metacharacter shorthands. When a metacharacter shorthand is placed in an expression outside any square bracket pair of a character class, the called standard character class is used to match the same single substring of a searched string. However when a metacharacter shorthand is placed inside any square bracket pair of a character class, the called standard character class is used as a literal subgroup of the character class together with other members of the character class to match the same single substring of a searched string. For example,

metacharacter .:
to match any character except newline character to a single substring of a searched string.
metacharacter \d:
to match any digit character to a single substring of a searched string and is equivalent to [0-9]. Logically Or(d).
metacharacter \s:
to match any white-space character, including space character, tab character, and form feed character to a single substring of a searched string and is equivalent to [ \f\n\r\t\v]. Logically Or(s)
metacharacter \w:
to match any word, that is alphanumeric and underscore, character, including A-Z, a-z, 0-9, and underscore to a single substring of a searched string and is equivalent to [A-Za-z0-9_]. Logically Or(w)

Negated Standard Character Classes

Simular to customized character classes, there are also corresponding negated standard character class to not match any commonly used character classes to a single substring of a searched string. These standard character classes are represented and called by uppercase metacharacter shorthands. For example,

metacharacter \D:
to match not any digit character to a single substring of a searched string and is equivalent to [^0-9]. Logically negated(Or(d))=And(negated(d)).
metacharacter \S:
to match not any white-space character, including space character, tab character, and form feed character to a single substring of a searched string and is equivalent to [^ \f\n\r\t\v]. Logically negated(Or(s))=And(negated(s)).
metacharacter \W:
to match not any word (alphanumeric and underscore) character, including A-Z, a-z, 0-9, and underscore to a single substring of a searched string and is equivalent to [^A-Za-z0-9_]. Logically negated(Or(w))=And(negated(w)).

Standard Character Classes in Character Class

Since a metacharacter shorthand can be used to represent a standard character class, a metacharacter shorthand can be used as a character subclass in a standard charcter class.

Using only a single metacharacter shorthand in Character Class

For example,

[^\d]=[\D] :
⇒negated(Or(d))=And(negated(d)).
Since \d=[0-9]= to match either of any digit character and \D=[^0-9]= to match neither of any digit character, imply [^\d]=[^0-9]=\D=[\D] and to match neither of any digit character is equivalent to to match not (either of any digit character). Logically [\D]=\D=And(negated d)=negated(Or(d))=[^\d].
[^\s]=[\S]:
⇒negated(Or(s))=And(negated(s)).
Since \s=[ \f\n\r\t\v]= to match either of any white-space character and \S=[^ \f\n\r\t\v]= to match neither of any white-space character, imply [^\s]=[^ \f\n\r\t\v]=\S=[\S] and to match neither of any white-space character is equivalent to to match not (either of any white-space character). Logically [\S]=\S=And(negated s)=negated(Or(s))=[^\s].
[^\w]=[\W]:
⇒negated(Or(w))=And(negated(w)).
Since \w=[A-Za-z0-9_]= to match either of any word character and \W=[^A-Za-z0-9_]= to match neither of any word character, imply[^\w]=[^A-Za-z0-9_]=\W= [\W] and to match neither of any word character is equivalent to to match not (either of any word character). Logically [\W]=\W=And(negated w)=negated(Or(w))=[^\w].
[^\D]=[\d]:
⇒negated(And(negated(d)))=Or(d).
Since \d=[0-9]= to match either of any digit character and \D=[^0-9]= to match neither of any digit character, imply [^\D]=[0-9]=\d=[\d] and to match either of any digit character is equivalent to to match not (neither of any digit character). Logically [\d]=\d=Or(d)=negated(And(negated d))=[^\D].
[^\S]=[\s]:
⇒negated(And(negated(s)))=Or(s).
Since \s=[0-9]= to match either of any white-space character and \S=[^0-9]= to match neither of any white-space character, imply [^\S]=[0-9]=\d=[\s] and to match either of any white-space character is equivalent to to match not (neither of any white-space character). Logically [\s]=\s=Or(s)=negated(And(negated s))=[^\S].

[^\W]=[\w]:
⇒negated(And(negated(w)))=Or(w).
Since \w=[0-9]= to match either of any word character and \W=[^0-9]= to match neither of any word character, imply [^\W]=[0-9]=\w=[\w] and to match either of any word character is equivalent to to match not (neither of any word character). Logically [\w]=\w=Or(w)=negated(And(negated w))=[^\W].

Using only one single metacharacter shorthand in Character Class

For example,

[a\d]=Or(a,d):
⇒Or(a,\d)=negated(And(negated(a),negated(\d))).
⇒Or(a,d)=negated(And(negated(a),And(negated(d)))).
⇒Or(a,d)=negated(And(negated(a),negated(d))).
Since \d=[0-9]= to match either of any digit character, imply [a\d]=[a0-9]=(a|\d) and is to match either an a or any digit character . Logically [a\d]=a or Or(d)=a or \d=(a|\d).
[a\D]=Or(a,And(negated(d))):
⇒Or(a,\D)=negated(And(negated(a),negated(\D))).
⇒Or(a,And(negated(d)))=negated(And(negated(a),\d)).
⇒And(Or(a,negated(d))=negated(Or(And(negated(a),d))).
Since \D=[^0-9]= to match neiither of any digit character , imply [a\D]=[a] or [^0-9]=(a|\D) and is to match either an a or not of any digit character . Logically [a\D]=a or negated(Or(d))=a or \D=(a|\D).
[^a\d]= negated(Or(a,d)):
⇒negated(Or(a,\d))=And(negated(a),negated(\d)).
⇒negated(Or(a,d))=And(negated(a),negated(d)).
Since \d=[0-9]= to match either of any digit character, imply [^a\d]=[^a0-9] and is to match neither an a nor any digit character . Logically [^a\d]=[^a0-9]=negated( a) AND negated(\d).
[^a\D]= negated(Or(a,And(negated(d)))):
⇒negated(Or(a,\D))=And(negated(a),negated(\D)).
⇒negated(Or(a,And(negated(d)))=And(negated(a),\d).
⇒negated(And(Or(a,negated(d)))=Or(And(negated(a),d)).
Since \D=[^0-9]= to match neiither of any digit character , imply [^a\D]=negated([a] or [^0-9])=negaged a and d and is to match neither an a nor not of any digit character . Logically [^a\D]=negated(a or negated(Or(d)))=negated a and Or(d)=negated a and \d.
[\s\d]=Or(s,d):
⇒Or(\s,\d)=negated(And(negated(\s),negated(\d))).
⇒Or(s,d)=negated(And(negated(s),negated(d))).
Since \s=[ \f\n\r\t\v]= to match either of any white-space character and \d=[0-9]= to match either of any digit character, imply [\s\d]=[ \f\n\r\t\v0-9]=(\s|\d) and is to match either any white-space character or any digit character . Logically [\s\d]=Or(s) or Or(d)=\s or \d=(\s|\d).
[\S\D]=negated(Or(And(s,d))):
⇒Or(\S,\D)=negated(And(negated(\S),negated(\D))).
⇒Or(negated(\s),negated(\d))=negated(And(\s,\d)).
⇒Or(And(negated(s)),And(negated(d)))=negated(And(Or(s),Or(d))).
⇒And(Or(negated(s)),negated(d)))=negated(Or(And(s,d))).
Since \S=[^ \f\n\r\t\v]= to match neither of any white-space character and \D=[^0-9]= to match neiither of any digit character, imply [\S\D]=[^ \f\n\r\t\v] or [^0-9]=(\S|\D) and is to match either not of any white-space character or not of any digit character . Logically [\S\D]=negated(Or(s)) or negated(Or(d))=\S or \D.
[^\s\d]=negated(Or(s,d)):
⇒negated(Or(s,d))=And(negated(\s),negated(\d)).
⇒And(negated(s),negated(d))=And(negated(s),negated(d)).
Since \s=[ \f\n\r\t\v]= to match either of any white-space character and \d=[0-9]= to match either of any digit character, imply [^\s\d]=[^ \f\n\r\t\v0-9]=negated(\s|\d) and is to match neither any white-space character nor any digit character . Logically [^\s\d]=negated(Or(s) or Or(d))=negaged(Or(s,d))=negated(\s|\d).
[^\S\D]=Or(And(s,d)):
⇒negated(Or(\S,\D))=And(negated(\S),negated(\D)).
⇒negated(Or(negated(\s),negated(\d)))=And(\s,\d).
⇒negated(Or(And(negated(s)),And(negated(d))))=And(Or(s),Or(d)).
⇒negated(And(Or(negated(s)),negated(d))))=Or(And(s,d)).
Since \S=[^ \f\n\r\t\v]= to match neither of any white-space character and \D=[^0-9]= to match neiither of any digit character, imply [^\S\D]= [ \f\n\r\t\v] and [0-9] and is to match neither not of any white-space character nor not of any digit character . Logically [^\S\D]=negated(negated(Or(s)) or negated(Or(d))).

Link:http://output.to/sideway/default.asp?qno=160800022

Firefox Knowledge Base

Down ThemAll Knowledge Base

Tips:

Useful Preference Switch in about:config page last updated 04Sep2015
1. extensions.dta.saveTemp last updated 04Sep2015
2. extensions.dta.tempLocation last updated 04Sep2015
3. extensions.dta.network.http.max-persistent-connections-per-server
  (network.http.max-persistent-connections-per-server) last updated 04Sep2015
4. extensions.dta.network.http.max-connections last updated 04Sep2015
5. extensions.dta.serverlimit.perserver last updated 04Sep2015
6. extensions.dta.ntask last updated 04Sep2015
7. extensions.dta.maxchunks last updated 04Sep2015
Creating Patches for DownThemAll from https://bugs.downthemall.net/wiki/CreatingPatches (last updated on 24Jun2016)

AUG 2016

S	M	T	W	T	F	S

	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

JUL 2014

S	M	T	W	T	F	S

		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Sideway BICK Blog

15/08