Reading Time: 8 minutes

What is a Regular Expression?

A regular expression (also called regex or regexp) is a way to describe a pattern. It is used to locate or validate specific strings or patterns of text in a sentence, document, or any other character input.

Regular expressions use both basic and special characters. Basic characters are standard letters, numbers, and general keyboard characters, while all other characters are considered special.

This article will introduce special characters and discuss using them in regex and extended regular expressions (ERE) with plenty of examples.

Special Characters in Regular Expressions

The following characters are the essential special characters used in the regex syntax.

What Programming Languages Use Regular Expressions?

Some programming languages that use regular expressions include Perl, JavaScript, and PHP. Regex is also used in various programs and editors like vim, sed, or grep. It’s important to understand regex in these contexts so that you can easily find or manipulate information or help decipher the work of other programmers.

The same general regex rules apply to all of these applications, though their specific syntax implementations may differ slightly. We will focus on extended regular expressions for this article.

What are Extended Regular Expressions?

To understand extended regular expressions, we need to compare them to basic regular expressions (BRE). These expressions are also referred to as POSIX (Portable Operating System Interface for Unix) basic regular expressions. The difference between ERE and BRE is what happens when you add a backslash in front of a character within an expression.

Example Expression

Using the special characters grid and the definition of extended regular expressions, the expression this+thing in ERE would match the input thissssthing, since the plus sign is a special character that matches one or more preceding characters. To do the same in BRE, you need to use this\+thing to make the plus sign a special character.

The following list of characters are special in ERE and basic in BRE unless escaped:

  • Question mark: ?
  • Plus sign: +
  • Open or left brace: {
  • Closing or right brace: }
  • Pipe or vertical bar: |
  • Open or left parenthesis: (
  • Closed or right parenthesis: )

How to Write a Regular Expression Pattern

Now that we know more about regex let’s learn some specific tools to write a pattern.

Assertions

Assertions are special characters in regular expressions that remove ambiguity or partial matching from an expression. If you don’t use assertions in your expression, you could match a portion of what a user inputs to a field or document, pulling unwanted results. Assertions come as either anchors and lookarounds; we won’t cover lookarounds because they are a bit advanced, but they essentially allow you to match text without including it in your match result.

The two most important anchors are the caret and dollar sign, but \b also has value.

Both anchors can be used together within a regular expression to match a specific group of characters on a single input line.

Assertion Example 1

Below is a regex expression looking for any input that includes the characters from a to z.

  • Expression: [a-z]
  • Input: 123word!

Since there are no assertion characters in the expression, the input 123word! would match the expression because it includes letters within the a-z range: w, o, r, or d.

As your goal is to exclude any characters that are not letters from the a to z range, we need to add assertions to the expression. Update your expression as shown below, which will match the beginning and end of the expression to letters only.

^[a-z]$

Assertion Example 2

We want to use regex to validate a JavaScript form on a website to ensure only real email addresses are entered. An email address must have an @ symbol and at least a two-letter TLD (top-level domain) like .co or .com, as shown below.

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

The syntax above indicates:

  • Any combination of lowercase and uppercase letters and numbers, as well as periods, underscores, percent signs, plus signs, and minus signs.
Note:
The plus sign outside the brackets indicates that one or more preceding grouped characters should be matched.
  • An @ sign. The expression wants to match this character exactly, so it is not in a group.
  • Any combination of lowercase and uppercase letters and numbers, as well as periods and minus signs.
  • A period. As a period is a special character in ERE, we must escape it (include a backslash), so we can match the period character exactly.
  • A minimum of two uppercase and/or lowercase letters. The curly braces quantify the group to 2 or more matches.

As both anchors are present, at the beginning and end of the line, our expression describes all characters.

Assertion Example 3

Another example is to verify valid American Express credit card numbers have been input into a system. American Express credit card numbers begin with either 34 or 37 and are 15 digits long.

We will want to use both the caret and dollar sign anchors to prevent excess characters from being accepted for our credit card filter. Since we know the card number starts with 3, our expression starts out looking like this.

^3$

Groups and Ranges

A group is a specific set of characters that will match any character inside it. For instance, [abcde] is a group that would match any of those five letters. In basic implementations for simple matching, groups are placed into square brackets. We will cover the use of parentheses later in the article.

Ranges are used within groups to match a bunch of known characters all at once. For example, our [a-z] expression is a range within a group.

Ranges can be truncated or added together in a single group with additional characters. For example, the regular expression [a-fA-F0-9] will match any single hexadecimal digit.

You can negate a group by adding a caret as the first character. Therefore, [^a-zA-Z] will match any non-letter character and [^0-9] will match any non-numeric character.

Group and Range Example 1

In our email address example, we defined three groups:

  • First Group: Uses all lowercase and uppercase letter characters, numbers, and the specific characters for a period, underscore, percent sign, plus sign, and minus sign.
  • Second Group: Uses all lowercase and uppercase letter characters, numbers, and the specific characters for a period and minus sign. Eliminating some of the special characters removes all of the legal non-Unicode domain name characters.
  • Third Group: Uses all lowercase and uppercase letter characters with a two-character requirement to eliminate non-Unicode TLDs.

Group and Range Example 2

We began our credit card filter expression by indicating all cards must start with the number 3. As American Express cards start with either 34 or 37, we add the group [47] to the expression, followed by a grouped range to include all other digits. Our expression is nearly complete!

^3[47][0-9]$

Quantifiers

Quantifiers allow for some flexibility in matching as they define the number of times a character, pattern, or group appears in a regex match.

Quantifier Example 1

For example, you might want to match as many characters in the group [a-zA-Z] as possible, as long as at least one letter is present. In this case, you can use the plus sign character after the group to match one or more of the preceding characters.

[a-zA-A]+

Quantifier Example 2

Suppose you want to validate an input text box that only allows numbers and no other characters (letters, hyphens, periods, etc.). Below is the regex syntax.

[0-9]+

Our syntax requires the field to include at least one digit but allows for an infinite number of digits.

Quantifier Example 3

The question mark is useful for optional characters. For example, write the below expression if you want to match the names Ashle, Ashlee, and Ashley.

Ashle[ey]?

The quantifier doesn’t only have to follow a group; it can follow a single character as well. 

Quantifier Example 4

There are plus sign quantifiers in our email address example, after the email name and domain name groups to match any combination of letters, numbers, and characters where appropriate.

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

The curly-brace quantifier at the end signifies the email ending should be two or more characters in length, like .com, .io, .ninja, or .photography.

Quantifier Example 5

No quantifier is needed for the [47] group in our credit card filter since this will match one of those characters by default. There are 15 numbers in an American Express card, so excluding the two we already referenced in our expression, 13 numbers remain. Therefore, the final group should match 13 digits exactly. The full expression for our credit card example is below.

^3[47][0-9]{13}$

Metacharacters

Metacharacters, also known as shorthand, are additional special characters that replace longer BRE expressions. The metacharacters below are used in non-POSIX implementations of regex, like PCRE (Perl Compatible Regular Expressions). While they do not have any bearing on extended regular expressions, except for the period, it is still handy information to know.

Metacharacter Example

While there are no letter-based metacharacters in extended regular expressions except for the period, we still have provided an example for programming in Perl.

Sometimes it’s helpful to match space-delimited output. A space delimiter is a symbol or space that separates your data. An example of a space delimiter is from the bash utility column -t. While this is supposedly tab-delimited, the number of tabs can vary, and spaces are also used in this output to align the text properly visually.

Let’s say that we have output from column -t with letter and number combinations in the second column for apartment numbers. They will always have one letter and one to three digits, surrounded by whitespace. We can match this like so in PCRE.

\s[a-zA-Z]\d+\s

This expression matches like so:

  • Whitespace: \s
  • A single letter (upper or lower case): [a-zA-z]
  • One or more digits: \d+
  • Whitespace: \s

Escaping

Escaping is essential when you want to match a literal character that your regex interpreter could interpret as a special character. As we mentioned earlier, the function of escaping, or adding a backslash before a character, depends on whether you are using ERE or BRE.

To match the smiley face, :^), your extended regular expression would need to look like the syntax below.

:\^\)

Escaping Example

The period between the domain name and the TLD in our email address example is when we would include a backslash. Without the escape character ahead of the period, it would match any character and could return results with typos, like any@email,com where there is a comma instead of a period.

Parentheses

Parentheses create sub-expressions and sequences. A sub-expression is a full regex expression that can be treated as a unit of another piece of regex. Even though we will not cover sequences in this article, a sequence is a small portion of a regex expression that can be replaced with simple variables later and is used in Python, Perl, and sed, amongst others.

Parentheses Example 1

Let’s think about matching specific image formats that we know our program will be able to handle. For example, we can process PNG, JPG (JPEG), and GIF formats. So how could we use regex to make sure that file uploads will match?

We can match a full-text file extension with a parenthetical sub-expression. Much like a group for single characters can go in square brackets, sub-expressions in parentheses can house strings.

(png|jpg|jpeg|gif)
Note:
Alternation is the name for the pipe between each group portion. It is the logical or character in regular expressions.

Parentheses Example 2

These sub-expressions can be treated as a single character in regex, meaning they can be affected by quantifiers, even if the sub-expressions contain quantifiers.

Sub(Expres+ion)?

This regex will match Sub, SubExpression, or the misspelled SubExpresion.

Parentheses Example 3

Recall our matching for Ashlee, Ashle, and Ashley? We can also add Ashleigh as a name to match using this full expression.

^Ashle(igh|[ey])?$

This sub-expression gives us the flexibility to add another match to our bracketed group without resorting to fuzzy or poor matching, as you might get with Ashle[eyigh]+. Of course, this will technically work but would match excess names like Ashleiyge or Ashlei.

Conclusion

You should now be able to create a credit card filter that shows legitimate American Express credit card numbers and complete various other tasks using extended regular expressions.

Want to match a U.S. phone number with the area code separated by either dots or dashes? How about verifying that uploaded documents within a form are not dangerous executable files? Regular expressions are the key to finding your answer.

Are you in the market for a new server to house your database? Liquid Web has lightning-fast VPS, fully-customizable dedicated, and high-performance cloud server options for small businesses to enterprise-level companies. Not sure what you need? Our Most Helpful Humans in Hosting will be glad to assist you. Reach out to us today!

Avatar for Andrej Walilko

About the Author: Andrej Walilko

Andrej Walilko (RHCE6) is a seasoned Linux Administrator, and he is a Migration Project Manager at Liquid Web, developing specialized processes for complex migration types. He enjoys doing woodworking, home improvement, and playing piano in his free time.

Latest Articles

How to install Puppet Server on Linux (AlmaLinux)

Read Article

Deploying web applications with NGINX HTTP Server

Read Article

Email security best practices for using SPF, DKIM, and DMARC

Read Article

Linux dos2unix command syntax — removing hidden Windows characters from files

Read Article

Change cPanel password from WebHost Manager (WHM)

Read Article