FileDocCategorySizeDatePackage
RE.javaAPI DocJava SE 6 API65477Tue Jun 10 00:22:24 BST 2008com.sun.org.apache.regexp.internal

RE

public class RE extends Object implements Serializable
RE is an efficient, lightweight regular expression evaluator/matcher class. Regular expressions are pattern descriptions which enable sophisticated matching of strings. In addition to being able to match a string against a pattern, you can also extract parts of the match. This is especially useful in text parsing! Details on the syntax of regular expression patterns are given below.

To compile a regular expression (RE), you can simply construct an RE matcher object from the string specification of the pattern, like this:

RE r = new RE("a*b");

Once you have done this, you can call either of the RE.match methods to perform matching on a String. For example:

boolean matched = r.match("aaaab");
will cause the boolean matched to be set to true because the pattern "a*b" matches the string "aaaab".

If you were interested in the number of a's which matched the first part of our example expression, you could change the expression to "(a*)b". Then when you compiled the expression and matched it against something like "xaaaab", you would get results like this:

RE r = new RE("(a*)b"); // Compile expression
boolean matched = r.match("xaaaab"); // Match against "xaaaab"

String wholeExpr = r.getParen(0); // wholeExpr will be 'aaaab'
String insideParens = r.getParen(1); // insideParens will be 'aaaa'

int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1
int endWholeExpr = r.getParenEnd(0); // endWholeExpr will be index 6
int lenWholeExpr = r.getParenLength(0); // lenWholeExpr will be 5

int startInside = r.getParenStart(1); // startInside will be index 1
int endInside = r.getParenEnd(1); // endInside will be index 5
int lenInside = r.getParenLength(1); // lenInside will be 4
You can also refer to the contents of a parenthesized expression within a regular expression itself. This is called a 'backreference'. The first backreference in a regular expression is denoted by \1, the second by \2 and so on. So the expression:
([0-9]+)=\1
will match any string of the form n=n (like 0=0 or 2=2).

The full regular expression syntax accepted by RE is described here:


Characters

unicodeChar Matches any identical unicode character
\ Used to quote a meta-character (like '*')
\\ Matches a single '\' character
\0nnn Matches a given octal character
\xhh Matches a given 8-bit hexadecimal character
\\uhhhh Matches a given 16-bit hexadecimal character
\t Matches an ASCII tab character
\n Matches an ASCII newline character
\r Matches an ASCII return character
\f Matches an ASCII form feed character


Character Classes

[abc] Simple character class
[a-zA-Z] Character class with ranges
[^abc] Negated character class
NOTE: Incomplete ranges will be interpreted as "starts from zero" or "ends with last character".
I.e. [-a] is the same as [\\u0000-a], and [a-] is the same as [a-\\uFFFF], [-] means "all characters".

Standard POSIX Character Classes

[:alnum:] Alphanumeric characters.
[:alpha:] Alphabetic characters.
[:blank:] Space and tab characters.
[:cntrl:] Control characters.
[:digit:] Numeric characters.
[:graph:] Characters that are printable and are also visible.
(A space is printable, but not visible, while an
`a' is both.)
[:lower:] Lower-case alphabetic characters.
[:print:] Printable characters (characters that are not
control characters.)
[:punct:] Punctuation characters (characters that are not letter,
digits, control characters, or space characters).
[:space:] Space characters (such as space, tab, and formfeed,
to name a few).
[:upper:] Upper-case alphabetic characters.
[:xdigit:] Characters that are hexadecimal digits.


Non-standard POSIX-style Character Classes

[:javastart:] Start of a Java identifier
[:javapart:] Part of a Java identifier


Predefined Classes

. Matches any character other than newline
\w Matches a "word" character (alphanumeric plus "_")
\W Matches a non-word character
\s Matches a whitespace character
\S Matches a non-whitespace character
\d Matches a digit character
\D Matches a non-digit character


Boundary Matchers

^ Matches only at the beginning of a line
$ Matches only at the end of a line
\b Matches only at a word boundary
\B Matches only at a non-word boundary


Greedy Closures

A* Matches A 0 or more times (greedy)
A+ Matches A 1 or more times (greedy)
A? Matches A 1 or 0 times (greedy)
A{n} Matches A exactly n times (greedy)
A{n,} Matches A at least n times (greedy)
A{n,m} Matches A at least n but not more than m times (greedy)


Reluctant Closures

A*? Matches A 0 or more times (reluctant)
A+? Matches A 1 or more times (reluctant)
A?? Matches A 0 or 1 times (reluctant)


Logical Operators

AB Matches A followed by B
A|B Matches either A or B
(A) Used for subexpression grouping
(?:A) Used for subexpression clustering (just like grouping but
no backrefs)


Backreferences

\1 Backreference to 1st parenthesized subexpression
\2 Backreference to 2nd parenthesized subexpression
\3 Backreference to 3rd parenthesized subexpression
\4 Backreference to 4th parenthesized subexpression
\5 Backreference to 5th parenthesized subexpression
\6 Backreference to 6th parenthesized subexpression
\7 Backreference to 7th parenthesized subexpression
\8 Backreference to 8th parenthesized subexpression
\9 Backreference to 9th parenthesized subexpression

All closure operators (+, *, ?, {m,n}) are greedy by default, meaning that they match as many elements of the string as possible without causing the overall match to fail. If you want a closure to be reluctant (non-greedy), you can simply follow it with a '?'. A reluctant closure will match as few elements of the string as possible when finding matches. {m,n} closures don't currently support reluctancy.

Line terminators
A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

  • A newline (line feed) character ('\n'),
  • A carriage-return character followed immediately by a newline character ("\r\n"),
  • A standalone carriage-return character ('\r'),
  • A next-line character ('\u0085'),
  • A line-separator character ('\u2028'), or
  • A paragraph-separator character ('\u2029).

RE runs programs compiled by the RECompiler class. But the RE matcher class does not include the actual regular expression compiler for reasons of efficiency. In fact, if you want to pre-compile one or more regular expressions, the 'recompile' class can be invoked from the command line to produce compiled output like this:

// Pre-compiled regular expression "a*b"
char[] re1Instructions =
{
0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041,
0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047,
0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000,
0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000,
0x0000,
};


REProgram re1 = new REProgram(re1Instructions);
You can then construct a regular expression matcher (RE) object from the pre-compiled expression re1 and thus avoid the overhead of compiling the expression at runtime. If you require more dynamic regular expressions, you can construct a single RECompiler object and re-use it to compile each expression. Similarly, you can change the program run by a given matcher object at any time. However, RE and RECompiler are not threadsafe (for efficiency reasons, and because requiring thread safety in this class is deemed to be a rare requirement), so you will need to construct a separate compiler or matcher object for each thread (unless you do thread synchronization yourself). Once expression compiled into the REProgram object, REProgram can be safely shared across multiple threads and RE objects.


ISSUES:

  • com.weusours.util.re is not currently compatible with all standard POSIX regcomp flags
  • com.weusours.util.re does not support POSIX equivalence classes ([=foo=] syntax) (I18N/locale issue)
  • com.weusours.util.re does not support nested POSIX character classes (definitely should, but not completely trivial)
  • com.weusours.util.re Does not support POSIX character collation concepts ([.foo.] syntax) (I18N/locale issue)
  • Should there be different matching styles (simple, POSIX, Perl etc?)
  • Should RE support character iterators (for backwards RE matching!)?
  • Should RE support reluctant {m,n} closures (does anyone care)?
  • Not *all* possibilities are considered for greediness when backreferences are involved (as POSIX suggests should be the case). The POSIX RE "(ac*)c*d[ac]*\1", when matched against "acdacaa" should yield a match of acdacaa where \1 is "a". This is not the case in this RE package, and actually Perl doesn't go to this extent either! Until someone actually complains about this, I'm not sure it's worth "fixing". If it ever is fixed, test #137 in RETest.txt should be updated.
see
recompile
see
RECompiler
author
Jonathan Locke
author
Tobias Schäfer
version
$Id: RE.java,v 1.1.2.1 2005/08/01 00:02:55 jeffsuttor Exp $

Fields Summary
public static final int
MATCH_NORMAL
Specifies normal, case-sensitive matching behaviour.
public static final int
MATCH_CASEINDEPENDENT
Flag to indicate that matching should be case-independent (folded)
public static final int
MATCH_MULTILINE
Newlines should match as BOL/EOL (^ and $)
public static final int
MATCH_SINGLELINE
Consider all input a single body of text - newlines are matched by .
static final char
OP_END
* The format of a node in a program is: * * [ OPCODE ] [ OPDATA ] [ OPNEXT ] [ OPERAND ] * * char OPCODE - instruction * char OPDATA - modifying data * char OPNEXT - next node (relative offset) * *
static final char
OP_BOL
static final char
OP_EOL
static final char
OP_ANY
static final char
OP_ANYOF
static final char
OP_BRANCH
static final char
OP_ATOM
static final char
OP_STAR
static final char
OP_PLUS
static final char
OP_MAYBE
static final char
OP_ESCAPE
static final char
OP_OPEN
static final char
OP_OPEN_CLUSTER
static final char
OP_CLOSE
static final char
OP_CLOSE_CLUSTER
static final char
OP_BACKREF
static final char
OP_GOTO
static final char
OP_NOTHING
static final char
OP_RELUCTANTSTAR
static final char
OP_RELUCTANTPLUS
static final char
OP_RELUCTANTMAYBE
static final char
OP_POSIXCLASS
static final char
E_ALNUM
static final char
E_NALNUM
static final char
E_BOUND
static final char
E_NBOUND
static final char
E_SPACE
static final char
E_NSPACE
static final char
E_DIGIT
static final char
E_NDIGIT
static final char
POSIX_CLASS_ALNUM
static final char
POSIX_CLASS_ALPHA
static final char
POSIX_CLASS_BLANK
static final char
POSIX_CLASS_CNTRL
static final char
POSIX_CLASS_DIGIT
static final char
POSIX_CLASS_GRAPH
static final char
POSIX_CLASS_LOWER
static final char
POSIX_CLASS_PRINT
static final char
POSIX_CLASS_PUNCT
static final char
POSIX_CLASS_SPACE
static final char
POSIX_CLASS_UPPER
static final char
POSIX_CLASS_XDIGIT
static final char
POSIX_CLASS_JSTART
static final char
POSIX_CLASS_JPART
static final int
maxNode
static final int
MAX_PAREN
static final int
offsetOpcode
static final int
offsetOpdata
static final int
offsetNext
static final int
nodeSize
REProgram
program
transient CharacterIterator
search
int
matchFlags
int
maxParen
transient int
parenCount
transient int
start0
transient int
end0
transient int
start1
transient int
end1
transient int
start2
transient int
end2
transient int[]
startn
transient int[]
endn
transient int[]
startBackref
transient int[]
endBackref
public static final int
REPLACE_ALL
Flag bit that indicates that subst should replace all occurrences of this regular expression.
public static final int
REPLACE_FIRSTONLY
Flag bit that indicates that subst should only replace the first occurrence of this regular expression.
public static final int
REPLACE_BACKREFERENCES
Flag bit that indicates that subst should replace backreferences
Constructors Summary
public RE(String pattern)
Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler. If you will be compiling many expressions, you may prefer to use a single RECompiler object instead.

param
pattern The regular expression pattern to compile.
exception
RESyntaxException Thrown if the regular expression has invalid syntax.
see
RECompiler
see
recompile

                   // Lazy-alloced array of backref ends

                                                                  
        
    
        this(pattern, MATCH_NORMAL);
    
public RE(String pattern, int matchFlags)
Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler. If you will be compiling many expressions, you may prefer to use a single RECompiler object instead.

param
pattern The regular expression pattern to compile.
param
matchFlags The matching style
exception
RESyntaxException Thrown if the regular expression has invalid syntax.
see
RECompiler
see
recompile

        this(new RECompiler().compile(pattern));
        setMatchFlags(matchFlags);
    
public RE(REProgram program, int matchFlags)
Construct a matcher for a pre-compiled regular expression from program (bytecode) data. Permits special flags to be passed in to modify matching behaviour.

param
program Compiled regular expression program (see RECompiler and/or recompile)
param
matchFlags One or more of the RE match behaviour flags (RE.MATCH_*):
MATCH_NORMAL // Normal (case-sensitive) matching
MATCH_CASEINDEPENDENT // Case folded comparisons
MATCH_MULTILINE // Newline matches as BOL/EOL
see
RECompiler
see
REProgram
see
recompile

        setProgram(program);
        setMatchFlags(matchFlags);
    
public RE(REProgram program)
Construct a matcher for a pre-compiled regular expression from program (bytecode) data.

param
program Compiled regular expression program
see
RECompiler
see
recompile

        this(program, MATCH_NORMAL);
    
public RE()
Constructs a regular expression matcher with no initial program. This is likely to be an uncommon practice, but is still supported.

        this((REProgram)null, MATCH_NORMAL);
    
Methods Summary
private final voidallocParens()
Performs lazy allocation of subexpression arrays

        // Allocate arrays for subexpressions
        startn = new int[maxParen];
        endn = new int[maxParen];

        // Set sub-expression pointers to invalid values
        for (int i = 0; i < maxParen; i++)
        {
            startn[i] = -1;
            endn[i] = -1;
        }
    
private intcompareChars(char c1, char c2, boolean caseIndependent)
Compares two characters.

param
c1 first character to compare.
param
c2 second character to compare.
param
caseIndependent whether comparision is case insensitive or not.
return
negative, 0, or positive integer as the first character less than, equal to, or greater then the second.

        if (caseIndependent)
        {
            c1 = Character.toLowerCase(c1);
            c2 = Character.toLowerCase(c2);
        }
        return ((int)c1 - (int)c2);
    
public intgetMatchFlags()
Returns the current match behaviour flags.

return
Current match behaviour flags (RE.MATCH_*).
MATCH_NORMAL // Normal (case-sensitive) matching
MATCH_CASEINDEPENDENT // Case folded comparisons
MATCH_MULTILINE // Newline matches as BOL/EOL
see
#setMatchFlags

        return matchFlags;
    
public java.lang.StringgetParen(int which)
Gets the contents of a parenthesized subexpression after a successful match.

param
which Nesting level of subexpression
return
String

        int start;
        if (which < parenCount && (start = getParenStart(which)) >= 0)
        {
            return search.substring(start, getParenEnd(which));
        }
        return null;
    
public intgetParenCount()
Returns the number of parenthesized subexpressions available after a successful match.

return
Number of available parenthesized subexpressions

        return parenCount;
    
public final intgetParenEnd(int which)
Returns the end index of a given paren level.

param
which Nesting level of subexpression
return
String index

        if (which < parenCount)
        {
            switch (which)
            {
                case 0:
                    return end0;

                case 1:
                    return end1;

                case 2:
                    return end2;

                default:
                    if (endn == null)
                    {
                        allocParens();
                    }
                    return endn[which];
            }
        }
        return -1;
    
public final intgetParenLength(int which)
Returns the length of a given paren level.

param
which Nesting level of subexpression
return
Number of characters in the parenthesized subexpression

        if (which < parenCount)
        {
            return getParenEnd(which) - getParenStart(which);
        }
        return -1;
    
public final intgetParenStart(int which)
Returns the start index of a given paren level.

param
which Nesting level of subexpression
return
String index

        if (which < parenCount)
        {
            switch (which)
            {
                case 0:
                    return start0;

                case 1:
                    return start1;

                case 2:
                    return start2;

                default:
                    if (startn == null)
                    {
                        allocParens();
                    }
                    return startn[which];
            }
        }
        return -1;
    
public REProgramgetProgram()
Returns the current regular expression program in use by this matcher object.

return
Regular expression program
see
#setProgram

        return program;
    
public java.lang.String[]grep(java.lang.Object[] search)
Returns an array of Strings, whose toString representation matches a regular expression. This method works like the Perl function of the same name. Given a regular expression of "a*b" and an array of String objects of [foo, aab, zzz, aaaab], the array of Strings returned by grep would be [aab, aaaab].

param
search Array of Objects to search
return
Array of Strings whose toString() value matches this regular expression.

        // Create new vector to hold return items
        Vector v = new Vector();

        // Traverse array of objects
        for (int i = 0; i < search.length; i++)
        {
            // Get next object as a string
            String s = search[i].toString();

            // If it matches this regexp, add it to the list
            if (match(s))
            {
                v.addElement(s);
            }
        }

        // Return vector as an array of strings
        String[] ret = new String[v.size()];
        v.copyInto(ret);
        return ret;
    
protected voidinternalError(java.lang.String s)
Throws an Error representing an internal error condition probably resulting from a bug in the regular expression compiler (or possibly data corruption). In practice, this should be very rare.

param
s Error description

        throw new Error("RE internal error: " + s);
    
private booleanisNewline(int i)

return
true if character at i-th position in the search string is a newline

        char nextChar = search.charAt(i);

        if (nextChar == '\n" || nextChar == '\r" || nextChar == '\u0085"
            || nextChar == '\u2028" || nextChar == '\u2029")
        {
            return true;
        }

        return false;
    
public booleanmatch(java.lang.String search, int i)
Matches the current regular expression program against a character array, starting at a given index.

param
search String to match against
param
i Index to start searching at
return
True if string matched

        return match(new StringCharacterIterator(search), i);
    
public booleanmatch(CharacterIterator search, int i)
Matches the current regular expression program against a character array, starting at a given index.

param
search String to match against
param
i Index to start searching at
return
True if string matched

        // There is no compiled program to search with!
        if (program == null)
        {
            // This should be uncommon enough to be an error case rather
            // than an exception (which would have to be handled everywhere)
            internalError("No RE program to run!");
        }

        // Save string to search
        this.search = search;

        // Can we optimize the search by looking for a prefix string?
        if (program.prefix == null)
        {
            // Unprefixed matching must try for a match at each character
            for ( ;! search.isEnd(i - 1); i++)
            {
                // Try a match at index i
                if (matchAt(i))
                {
                    return true;
                }
            }
            return false;
        }
        else
        {
            // Prefix-anchored matching is possible
            boolean caseIndependent = (matchFlags & MATCH_CASEINDEPENDENT) != 0;
            char[] prefix = program.prefix;
            for ( ; !search.isEnd(i + prefix.length - 1); i++)
            {
                int j = i;
                int k = 0;

                boolean match;
                do {
                    // If there's a mismatch of any character in the prefix, give up
                    match = (compareChars(search.charAt(j++), prefix[k++], caseIndependent) == 0);
                } while (match && k < prefix.length);

                // See if the whole prefix string matched
                if (k == prefix.length)
                {
                    // We matched the full prefix at firstChar, so try it
                    if (matchAt(i))
                    {
                        return true;
                    }
                }
            }
            return false;
        }
    
public booleanmatch(java.lang.String search)
Matches the current regular expression program against a String.

param
search String to match against
return
True if string matched

        return match(search, 0);
    
protected booleanmatchAt(int i)
Match the current regular expression program against the current input string, starting at index i of the input string. This method is only meant for internal use.

param
i The input string index to start matching at
return
True if the input matched the expression

        // Initialize start pointer, paren cache and paren count
        start0 = -1;
        end0   = -1;
        start1 = -1;
        end1   = -1;
        start2 = -1;
        end2   = -1;
        startn = null;
        endn   = null;
        parenCount = 1;
        setParenStart(0, i);

        // Allocate backref arrays (unless optimizations indicate otherwise)
        if ((program.flags & REProgram.OPT_HASBACKREFS) != 0)
        {
            startBackref = new int[maxParen];
            endBackref = new int[maxParen];
        }

        // Match against string
        int idx;
        if ((idx = matchNodes(0, maxNode, i)) != -1)
        {
            setParenEnd(0, idx);
            return true;
        }

        // Didn't match
        parenCount = 0;
        return false;
    
protected intmatchNodes(int firstNode, int lastNode, int idxStart)
Try to match a string against a subset of nodes in the program

param
firstNode Node to start at in program
param
lastNode Last valid node (used for matching a subexpression without matching the rest of the program as well).
param
idxStart Starting position in character array
return
Final input array index if match succeeded. -1 if not.

        // Our current place in the string
        int idx = idxStart;

        // Loop while node is valid
        int next, opcode, opdata;
        int idxNew;
        char[] instruction = program.instruction;
        for (int node = firstNode; node < lastNode; )
        {
            opcode = instruction[node + offsetOpcode];
            next   = node + (short)instruction[node + offsetNext];
            opdata = instruction[node + offsetOpdata];

            switch (opcode)
            {
                case OP_RELUCTANTMAYBE:
                    {
                        int once = 0;
                        do
                        {
                            // Try to match the rest without using the reluctant subexpr
                            if ((idxNew = matchNodes(next, maxNode, idx)) != -1)
                            {
                                return idxNew;
                            }
                        }
                        while ((once++ == 0) && (idx = matchNodes(node + nodeSize, next, idx)) != -1);
                        return -1;
                    }

                case OP_RELUCTANTPLUS:
                    while ((idx = matchNodes(node + nodeSize, next, idx)) != -1)
                    {
                        // Try to match the rest without using the reluctant subexpr
                        if ((idxNew = matchNodes(next, maxNode, idx)) != -1)
                        {
                            return idxNew;
                        }
                    }
                    return -1;

                case OP_RELUCTANTSTAR:
                    do
                    {
                        // Try to match the rest without using the reluctant subexpr
                        if ((idxNew = matchNodes(next, maxNode, idx)) != -1)
                        {
                            return idxNew;
                        }
                    }
                    while ((idx = matchNodes(node + nodeSize, next, idx)) != -1);
                    return -1;

                case OP_OPEN:

                    // Match subexpression
                    if ((program.flags & REProgram.OPT_HASBACKREFS) != 0)
                    {
                        startBackref[opdata] = idx;
                    }
                    if ((idxNew = matchNodes(next, maxNode, idx)) != -1)
                    {
                        // Increase valid paren count
                        if ((opdata + 1) > parenCount)
                        {
                            parenCount = opdata + 1;
                        }

                        // Don't set paren if already set later on
                        if (getParenStart(opdata) == -1)
                        {
                            setParenStart(opdata, idx);
                        }
                    }
                    return idxNew;

                case OP_CLOSE:

                    // Done matching subexpression
                    if ((program.flags & REProgram.OPT_HASBACKREFS) != 0)
                    {
                        endBackref[opdata] = idx;
                    }
                    if ((idxNew = matchNodes(next, maxNode, idx)) != -1)
                    {
                        // Increase valid paren count
                        if ((opdata + 1) > parenCount)
                        {
                            parenCount = opdata + 1;
                        }

                        // Don't set paren if already set later on
                        if (getParenEnd(opdata) == -1)
                        {
                            setParenEnd(opdata, idx);
                        }
                    }
                    return idxNew;

                case OP_OPEN_CLUSTER:
                case OP_CLOSE_CLUSTER:
                    // starting or ending the matching of a subexpression which has no backref.
                    return matchNodes( next, maxNode, idx );

                case OP_BACKREF:
                    {
                        // Get the start and end of the backref
                        int s = startBackref[opdata];
                        int e = endBackref[opdata];

                        // We don't know the backref yet
                        if (s == -1 || e == -1)
                        {
                            return -1;
                        }

                        // The backref is empty size
                        if (s == e)
                        {
                            break;
                        }

                        // Get the length of the backref
                        int l = e - s;

                        // If there's not enough input left, give up.
                        if (search.isEnd(idx + l - 1))
                        {
                            return -1;
                        }

                        // Case fold the backref?
                        final boolean caseFold =
                            ((matchFlags & MATCH_CASEINDEPENDENT) != 0);
                        // Compare backref to input
                        for (int i = 0; i < l; i++)
                        {
                            if (compareChars(search.charAt(idx++), search.charAt(s + i), caseFold) != 0)
                            {
                                return -1;
                            }
                        }
                    }
                    break;

                case OP_BOL:

                    // Fail if we're not at the start of the string
                    if (idx != 0)
                    {
                        // If we're multiline matching, we could still be at the start of a line
                        if ((matchFlags & MATCH_MULTILINE) == MATCH_MULTILINE)
                        {
                            // If not at start of line, give up
                            if (idx <= 0 || !isNewline(idx - 1)) {
                                return -1;
                            } else {
                                break;
                            }
                        }
                        return -1;
                    }
                    break;

                case OP_EOL:

                    // If we're not at the end of string
                    if (!search.isEnd(0) && !search.isEnd(idx))
                    {
                        // If we're multi-line matching
                        if ((matchFlags & MATCH_MULTILINE) == MATCH_MULTILINE)
                        {
                            // Give up if we're not at the end of a line
                            if (!isNewline(idx)) {
                                return -1;
                            } else {
                                break;
                            }
                        }
                        return -1;
                    }
                    break;

                case OP_ESCAPE:

                    // Which escape?
                    switch (opdata)
                    {
                        // Word boundary match
                        case E_NBOUND:
                        case E_BOUND:
                            {
                                char cLast = ((idx == 0) ? '\n" : search.charAt(idx - 1));
                                char cNext = ((search.isEnd(idx)) ? '\n" : search.charAt(idx));
                                if ((Character.isLetterOrDigit(cLast) == Character.isLetterOrDigit(cNext)) == (opdata == E_BOUND))
                                {
                                    return -1;
                                }
                            }
                            break;

                        // Alpha-numeric, digit, space, javaLetter, javaLetterOrDigit
                        case E_ALNUM:
                        case E_NALNUM:
                        case E_DIGIT:
                        case E_NDIGIT:
                        case E_SPACE:
                        case E_NSPACE:

                            // Give up if out of input
                            if (search.isEnd(idx))
                            {
                                return -1;
                            }

                            char c = search.charAt(idx);

                            // Switch on escape
                            switch (opdata)
                            {
                                case E_ALNUM:
                                case E_NALNUM:
                                    if (!((Character.isLetterOrDigit(c) || c == '_") == (opdata == E_ALNUM)))
                                    {
                                        return -1;
                                    }
                                    break;

                                case E_DIGIT:
                                case E_NDIGIT:
                                    if (!(Character.isDigit(c) == (opdata == E_DIGIT)))
                                    {
                                        return -1;
                                    }
                                    break;

                                case E_SPACE:
                                case E_NSPACE:
                                    if (!(Character.isWhitespace(c) == (opdata == E_SPACE)))
                                    {
                                        return -1;
                                    }
                                    break;
                            }
                            idx++;
                            break;

                        default:
                            internalError("Unrecognized escape '" + opdata + "'");
                    }
                    break;

                case OP_ANY:

                    if ((matchFlags & MATCH_SINGLELINE) == MATCH_SINGLELINE) {
                        // Match anything
                        if (search.isEnd(idx))
                        {
                            return -1;
                        }
                    }
                    else
                    {
                        // Match anything but a newline
                        if (search.isEnd(idx) || isNewline(idx))
                        {
                            return -1;
                        }
                    }
                    idx++;
                    break;

                case OP_ATOM:
                    {
                        // Match an atom value
                        if (search.isEnd(idx))
                        {
                            return -1;
                        }

                        // Get length of atom and starting index
                        int lenAtom = opdata;
                        int startAtom = node + nodeSize;

                        // Give up if not enough input remains to have a match
                        if (search.isEnd(lenAtom + idx - 1))
                        {
                            return -1;
                        }

                        // Match atom differently depending on casefolding flag
                        final boolean caseFold =
                            ((matchFlags & MATCH_CASEINDEPENDENT) != 0);

                        for (int i = 0; i < lenAtom; i++)
                        {
                            if (compareChars(search.charAt(idx++), instruction[startAtom + i], caseFold) != 0)
                            {
                                return -1;
                            }
                        }
                    }
                    break;

                case OP_POSIXCLASS:
                    {
                        // Out of input?
                        if (search.isEnd(idx))
                        {
                            return -1;
                        }

                        switch (opdata)
                        {
                            case POSIX_CLASS_ALNUM:
                                if (!Character.isLetterOrDigit(search.charAt(idx)))
                                {
                                    return -1;
                                }
                                break;

                            case POSIX_CLASS_ALPHA:
                                if (!Character.isLetter(search.charAt(idx)))
                                {
                                    return -1;
                                }
                                break;

                            case POSIX_CLASS_DIGIT:
                                if (!Character.isDigit(search.charAt(idx)))
                                {
                                    return -1;
                                }
                                break;

                            case POSIX_CLASS_BLANK: // JWL - bugbug: is this right??
                                if (!Character.isSpaceChar(search.charAt(idx)))
                                {
                                    return -1;
                                }
                                break;

                            case POSIX_CLASS_SPACE:
                                if (!Character.isWhitespace(search.charAt(idx)))
                                {
                                    return -1;
                                }
                                break;

                            case POSIX_CLASS_CNTRL:
                                if (Character.getType(search.charAt(idx)) != Character.CONTROL)
                                {
                                    return -1;
                                }
                                break;

                            case POSIX_CLASS_GRAPH: // JWL - bugbug???
                                switch (Character.getType(search.charAt(idx)))
                                {
                                    case Character.MATH_SYMBOL:
                                    case Character.CURRENCY_SYMBOL:
                                    case Character.MODIFIER_SYMBOL:
                                    case Character.OTHER_SYMBOL:
                                        break;

                                    default:
                                        return -1;
                                }
                                break;

                            case POSIX_CLASS_LOWER:
                                if (Character.getType(search.charAt(idx)) != Character.LOWERCASE_LETTER)
                                {
                                    return -1;
                                }
                                break;

                            case POSIX_CLASS_UPPER:
                                if (Character.getType(search.charAt(idx)) != Character.UPPERCASE_LETTER)
                                {
                                    return -1;
                                }
                                break;

                            case POSIX_CLASS_PRINT:
                                if (Character.getType(search.charAt(idx)) == Character.CONTROL)
                                {
                                    return -1;
                                }
                                break;

                            case POSIX_CLASS_PUNCT:
                            {
                                int type = Character.getType(search.charAt(idx));
                                switch(type)
                                {
                                    case Character.DASH_PUNCTUATION:
                                    case Character.START_PUNCTUATION:
                                    case Character.END_PUNCTUATION:
                                    case Character.CONNECTOR_PUNCTUATION:
                                    case Character.OTHER_PUNCTUATION:
                                        break;

                                    default:
                                        return -1;
                                }
                            }
                            break;

                            case POSIX_CLASS_XDIGIT: // JWL - bugbug??
                            {
                                boolean isXDigit = ((search.charAt(idx) >= '0" && search.charAt(idx) <= '9") ||
                                                    (search.charAt(idx) >= 'a" && search.charAt(idx) <= 'f") ||
                                                    (search.charAt(idx) >= 'A" && search.charAt(idx) <= 'F"));
                                if (!isXDigit)
                                {
                                    return -1;
                                }
                            }
                            break;

                            case POSIX_CLASS_JSTART:
                                if (!Character.isJavaIdentifierStart(search.charAt(idx)))
                                {
                                    return -1;
                                }
                                break;

                            case POSIX_CLASS_JPART:
                                if (!Character.isJavaIdentifierPart(search.charAt(idx)))
                                {
                                    return -1;
                                }
                                break;

                            default:
                                internalError("Bad posix class");
                                break;
                        }

                        // Matched.
                        idx++;
                    }
                    break;

                case OP_ANYOF:
                    {
                        // Out of input?
                        if (search.isEnd(idx))
                        {
                            return -1;
                        }

                        // Get character to match against character class and maybe casefold
                        char c = search.charAt(idx);
                        boolean caseFold = (matchFlags & MATCH_CASEINDEPENDENT) != 0;
                        // Loop through character class checking our match character
                        int idxRange = node + nodeSize;
                        int idxEnd = idxRange + (opdata * 2);
                        boolean match = false;
                        for (int i = idxRange; !match && i < idxEnd; )
                        {
                            // Get start, end and match characters
                            char s = instruction[i++];
                            char e = instruction[i++];

                            match = ((compareChars(c, s, caseFold) >= 0)
                                     && (compareChars(c, e, caseFold) <= 0));
                        }

                        // Fail if we didn't match the character class
                        if (!match)
                        {
                            return -1;
                        }
                        idx++;
                    }
                    break;

                case OP_BRANCH:
                {
                    // Check for choices
                    if (instruction[next + offsetOpcode] != OP_BRANCH)
                    {
                        // If there aren't any other choices, just evaluate this branch.
                        node += nodeSize;
                        continue;
                    }

                    // Try all available branches
                    short nextBranch;
                    do
                    {
                        // Try matching the branch against the string
                        if ((idxNew = matchNodes(node + nodeSize, maxNode, idx)) != -1)
                        {
                            return idxNew;
                        }

                        // Go to next branch (if any)
                        nextBranch = (short)instruction[node + offsetNext];
                        node += nextBranch;
                    }
                    while (nextBranch != 0 && (instruction[node + offsetOpcode] == OP_BRANCH));

                    // Failed to match any branch!
                    return -1;
                }

                case OP_NOTHING:
                case OP_GOTO:

                    // Just advance to the next node without doing anything
                    break;

                case OP_END:

                    // Match has succeeded!
                    setParenEnd(0, idx);
                    return idx;

                default:

                    // Corrupt program
                    internalError("Invalid opcode '" + opcode + "'");
            }

            // Advance to the next node in the program
            node = next;
        }

        // We "should" never end up here
        internalError("Corrupt program");
        return -1;
    
public voidsetMatchFlags(int matchFlags)
Sets match behaviour flags which alter the way RE does matching.

param
matchFlags One or more of the RE match behaviour flags (RE.MATCH_*):
MATCH_NORMAL // Normal (case-sensitive) matching
MATCH_CASEINDEPENDENT // Case folded comparisons
MATCH_MULTILINE // Newline matches as BOL/EOL

        this.matchFlags = matchFlags;
    
protected final voidsetParenEnd(int which, int i)
Sets the end of a paren level

param
which Which paren level
param
i Index in input array

        if (which < parenCount)
        {
            switch (which)
            {
                case 0:
                    end0 = i;
                    break;

                case 1:
                    end1 = i;
                    break;

                case 2:
                    end2 = i;
                    break;

                default:
                    if (endn == null)
                    {
                        allocParens();
                    }
                    endn[which] = i;
                    break;
            }
        }
    
protected final voidsetParenStart(int which, int i)
Sets the start of a paren level

param
which Which paren level
param
i Index in input array

        if (which < parenCount)
        {
            switch (which)
            {
                case 0:
                    start0 = i;
                    break;

                case 1:
                    start1 = i;
                    break;

                case 2:
                    start2 = i;
                    break;

                default:
                    if (startn == null)
                    {
                        allocParens();
                    }
                    startn[which] = i;
                    break;
            }
        }
    
public voidsetProgram(REProgram program)
Sets the current regular expression program used by this matcher object.

param
program Regular expression program compiled by RECompiler.
see
RECompiler
see
REProgram
see
recompile

        this.program = program;
        if (program != null && program.maxParens != -1) {
            this.maxParen = program.maxParens;
        } else {
            this.maxParen = MAX_PAREN;
        }
    
public static java.lang.StringsimplePatternToFullRegularExpression(java.lang.String pattern)
Converts a 'simplified' regular expression to a full regular expression

param
pattern The pattern to convert
return
The full regular expression

        StringBuffer buf = new StringBuffer();
        for (int i = 0; i < pattern.length(); i++)
        {
            char c = pattern.charAt(i);
            switch (c)
            {
                case '*":
                    buf.append(".*");
                    break;

                case '.":
                case '[":
                case ']":
                case '\\":
                case '+":
                case '?":
                case '{":
                case '}":
                case '$":
                case '^":
                case '|":
                case '(":
                case ')":
                    buf.append('\\");
                default:
                    buf.append(c);
                    break;
            }
        }
        return buf.toString();
    
public java.lang.String[]split(java.lang.String s)
Splits a string into an array of strings on regular expression boundaries. This function works the same way as the Perl function of the same name. Given a regular expression of "[ab]+" and a string to split of "xyzzyababbayyzabbbab123", the result would be the array of Strings "[xyzzy, yyz, 123]".

Please note that the first string in the resulting array may be an empty string. This happens when the very first character of input string is matched by the pattern.

param
s String to split on this regular exression
return
Array of strings

        // Create new vector
        Vector v = new Vector();

        // Start at position 0 and search the whole string
        int pos = 0;
        int len = s.length();

        // Try a match at each position
        while (pos < len && match(s, pos))
        {
            // Get start of match
            int start = getParenStart(0);

            // Get end of match
            int newpos = getParenEnd(0);

            // Check if no progress was made
            if (newpos == pos)
            {
                v.addElement(s.substring(pos, start + 1));
                newpos++;
            }
            else
            {
                v.addElement(s.substring(pos, start));
            }

            // Move to new position
            pos = newpos;
        }

        // Push remainder if it's not empty
        String remainder = s.substring(pos);
        if (remainder.length() != 0)
        {
            v.addElement(remainder);
        }

        // Return vector as an array of strings
        String[] ret = new String[v.size()];
        v.copyInto(ret);
        return ret;
    
public java.lang.Stringsubst(java.lang.String substituteIn, java.lang.String substitution)
Substitutes a string for this regular expression in another string. This method works like the Perl function of the same name. Given a regular expression of "a*b", a String to substituteIn of "aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the resulting String returned by subst would be "-foo-garply-wacky-".

param
substituteIn String to substitute within
param
substitution String to substitute for all matches of this regular expression.
return
The string substituteIn with zero or more occurrences of the current regular expression replaced with the substitution String (if this regular expression object doesn't match at any position, the original String is returned unchanged).


                                                                                                             
         
    
        return subst(substituteIn, substitution, REPLACE_ALL);
    
public java.lang.Stringsubst(java.lang.String substituteIn, java.lang.String substitution, int flags)
Substitutes a string for this regular expression in another string. This method works like the Perl function of the same name. Given a regular expression of "a*b", a String to substituteIn of "aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the resulting String returned by subst would be "-foo-garply-wacky-".

It is also possible to reference the contents of a parenthesized expression with $0, $1, ... $9. A regular expression of "http://[\\.\\w\\-\\?/~_@&=%]+", a String to substituteIn of "visit us: http://www.apache.org!" and the substitution String "<a href=\"$0\">$0</a>", the resulting String returned by subst would be "visit us: <a href=\"http://www.apache.org\">http://www.apache.org</a>!".

Note: $0 represents the whole match.

param
substituteIn String to substitute within
param
substitution String to substitute for matches of this regular expression
param
flags One or more bitwise flags from REPLACE_*. If the REPLACE_FIRSTONLY flag bit is set, only the first occurrence of this regular expression is replaced. If the bit is not set (REPLACE_ALL), all occurrences of this pattern will be replaced. If the flag REPLACE_BACKREFERENCES is set, all backreferences will be processed.
return
The string substituteIn with zero or more occurrences of the current regular expression replaced with the substitution String (if this regular expression object doesn't match at any position, the original String is returned unchanged).

        // String to return
        StringBuffer ret = new StringBuffer();

        // Start at position 0 and search the whole string
        int pos = 0;
        int len = substituteIn.length();

        // Try a match at each position
        while (pos < len && match(substituteIn, pos))
        {
            // Append string before match
            ret.append(substituteIn.substring(pos, getParenStart(0)));

            if ((flags & REPLACE_BACKREFERENCES) != 0)
            {
                // Process backreferences
                int lCurrentPosition = 0;
                int lLastPosition = -2;
                int lLength = substitution.length();
                boolean bAddedPrefix = false;

                while ((lCurrentPosition = substitution.indexOf("$", lCurrentPosition)) >= 0)
                {
                    if ((lCurrentPosition == 0 || substitution.charAt(lCurrentPosition - 1) != '\\")
                        && lCurrentPosition+1 < lLength)
                    {
                        char c = substitution.charAt(lCurrentPosition + 1);
                        if (c >= '0" && c <= '9")
                        {
                            if (bAddedPrefix == false)
                            {
                                // Append everything between the beginning of the
                                // substitution string and the current $ sign
                                ret.append(substitution.substring(0, lCurrentPosition));
                                bAddedPrefix = true;
                            }
                            else
                            {
                                // Append everything between the last and the current $ sign
                                ret.append(substitution.substring(lLastPosition + 2, lCurrentPosition));
                            }

                            // Append the parenthesized expression
                            // Note: if a parenthesized expression of the requested
                            // index is not available "null" is added to the string
                            ret.append(getParen(c - '0"));
                            lLastPosition = lCurrentPosition;
                        }
                    }

                    // Move forward, skipping past match
                    lCurrentPosition++;
                }

                // Append everything after the last $ sign
                ret.append(substitution.substring(lLastPosition + 2, lLength));
            }
            else
            {
                // Append substitution without processing backreferences
                ret.append(substitution);
            }

            // Move forward, skipping past match
            int newpos = getParenEnd(0);

            // We always want to make progress!
            if (newpos == pos)
            {
                newpos++;
            }

            // Try new position
            pos = newpos;

            // Break out if we're only supposed to replace one occurrence
            if ((flags & REPLACE_FIRSTONLY) != 0)
            {
                break;
            }
        }

        // If there's remaining input, append it
        if (pos < len)
        {
            ret.append(substituteIn.substring(pos));
        }

        // Return string buffer as string
        return ret.toString();