Current Tutorial
Introducing Regular Expressions
Next in the Series

Next in the Series: String Literals

Introducing Regular Expressions

 

Introducing Regular Expressions

Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set. They can be used to search, edit, or manipulate text and data. You must learn a specific syntax to create regular expressions — one that goes beyond the normal syntax of the Java programming language. Regular expressions vary in complexity, but once you understand the basics of how they are constructed, you will be able to decipher (or create) any regular expression.

In the world of regular expressions, there are many different flavors to choose from, such as grep, Perl, Tcl, Python, PHP, and awk. The regular expression syntax in the java.util.regex API is most similar to that found in Perl.

The java.util.regex package primarily consists of three classes: Pattern, Matcher, and PatternSyntaxException.

A Pattern object is a compiled representation of a regular expression. The Pattern class provides no public constructors. To create a pattern, you must first invoke one of its public static compile() methods, which will then return a Pattern object. These methods accept a regular expression as the first argument; the following sections are covering the required syntax.

A Matcher object is the engine that interprets the pattern and performs match operations against an input string. Like the Pattern class, Matcher defines no public constructors. You obtain a Matcher object by invoking the matcher() method on a Pattern object.

A PatternSyntaxException object is an unchecked exception that indicates a syntax error in a regular expression pattern.

Before deep diving each class, you must understand how regular expressions are actually constructed. Let us introduce a simple test harness that will be used repeatedly to explore their syntax.

 

Unicode Support

As of the JDK 7 release, Regular Expression pattern matching has expanded functionality to support Unicode 6.0.

Matching a Specific Code Point

You can match a specific Unicode code point using an escape sequence of the form \uFFFF, where FFFF is the hexadecimal value of the code point you want to match. For example, \u6771 matches the Han character for east.

Alternatively, you can specify a code point using Perl-style hex notation, \x{...}. For example:

String hexPattern = "\x{" + Integer.toHexString(codePoint) + "}";

Unicode Character Properties

Each Unicode character, in addition to its value, has certain attributes, or properties. You can match a single character belonging to a particular category with the expression \p{prop}. You can match a single character not belonging to a particular category with the expression \P{prop}.

The three supported property types are scripts, blocks, and a "general" category.

Scripts

To determine if a code point belongs to a specific script, you can either use the script keyword, or the sc short form, for example, \p{script=Hiragana}. Alternatively, you can prefix the script name with the string Is, such as \p{IsHiragana}.

Valid script names supported by Pattern are those accepted by UnicodeScript.forName().

Blocks

A block can be specified using the block keyword, or the blk short form, for example, \p{block=Mongolian}. Alternatively, you can prefix the block name with the string In, such as \p{InMongolian}.

Valid block names supported by Pattern are those accepted by UnicodeScript.forName().

General Category

Categories can be specified with optional prefix Is. For example, IsL matches the category of Unicode letters. Categories can also be specified by using the general_category keyword, or the short form gc. For example, an uppercase letter can be matched using general_category=Lu or gc=Lu.

Supported categories are those of The Unicode Standard in the version specified by the Character class.

 

Test Harness

This section defines a reusable test harness, RegexTestHarness.java , for exploring the regular expression constructs supported by this API. The command to run this code is java RegexTestHarness; no command-line arguments are accepted. The application loops repeatedly, prompting the user for a regular expression and input string. Using this test harness is optional, but you may find it convenient for exploring the test cases discussed in the following pages.

 

import java.io.Console;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class RegexTestHarness {

    public static void main(String[] args){
        Console console = System.console();
        if (console == null) {
            System.err.println("No console.");
            System.exit(1);
        }
        while (true) {

            Pattern pattern = 
            Pattern.compile(console.readLine("%nEnter your regex: "));

            Matcher matcher = 
            pattern.matcher(console.readLine("Enter input string to search: "));

            boolean found = false;
            while (matcher.find()) {
                console.format("I found the text" +
                    " \"%s\" starting at " +
                    "index %d and ending at index %d.%n",
                    matcher.group(),
                    matcher.start(),
                    matcher.end());
                found = true;
            }
            if(!found){
                console.format("No match found.%n");
            }
        }
    }
}

Before continuing to the next section, you may save and compile this code to ensure that your development environment supports the required packages.


Last update: January 10, 2022


Current Tutorial
Introducing Regular Expressions
Next in the Series

Next in the Series: String Literals