MySQL Regular Expression Searches

Revision as of 19:52, 16 October 2007 by Neil (Talk | contribs) (Handling Special Characters)

Revision as of 19:52, 16 October 2007 by Neil (Talk | contribs) (Handling Special Characters)

In previous chapters we have looked at some basic data retrieval filtering techniques using comparison operators and wildcards. Whilst this approaches to data filtering are fine for basic searching, more advanced searching quickly becomes cumbersome using these techniques.

In this chapter we will look at yet another filtering and searching technique supported by MySQL called regular expressions.

What are Regular Expressions?

The term regular expressions is one of those technical terms where the words do very little to explain what the technology does. regular expressions are a feature common to many programming languages and are a topic on which entire can, and indeed, have been written. Fortunately (or unfortunately depending on your perspective), MySQL does not provide as extensive support for regualr expressions as some other programming languages. This is good in that it makes the subject area easier to learn, but may be frustrating if you are already proficient with the rich regular expression syntax available in other languages.

Regular expressions are essentially a set of instructions using a predefined syntax for matching text in a variety of flexible ways. For example, you might want to extract all the occurrences of a particular word sequence from a block of text. You might also want to perform a search for a particular piece of text and replace it with some alternate text.

Regular Expression Character Matching

In order to introduce the REGEXP operator, we will begin by looking at a use of regular expressions that could similarly be used with the LIKE operator. As in the previous chapter we need to retrieve rows from a table taking into consideration the difference in spelling of the color gray (grey). To perform this retrieval we will use the regex dot character matching (.). Rather like the LIKE underscore wildcard, this character indicates that any character in this position in the text will be considered a match. For example:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'Gr.y Computer Case';
+-----------+--------------------+-------------------+
| prod_code | prod_name          | prod_desc         |
+-----------+--------------------+-------------------+
|        11 | Grey Computer Case | ATX PC CASE       |
|        12 | Gray Computer Case | ATX PC CASE (USA) |
+-----------+--------------------+-------------------+
2 rows in set (0.05 sec)

So far we haven't done anything that could not have been achieved using wildcards. Regular expressions, however, do not stop here.


Matching from a Group of Characters

One problem with the approach outlined above is that any letter between the 'Gr' and the 'y' would have registered as a match. In reality we are only interested in words that contain either an 'a' or an 'e' in that location. Fortunately, regular expressions allow us to specify a group of acceptable character matches for any character position. The syntax for this requires that the characters be places in square brackets at the desired location in the match text. For example:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'Gr[ae]y Computer Case';
+-----------+--------------------+-------------------+
| prod_code | prod_name          | prod_desc         |
+-----------+--------------------+-------------------+
|        11 | Grey Computer Case | ATX PC CASE       |
|        12 | Gray Computer Case | ATX PC CASE (USA) |
+-----------+--------------------+-------------------+
2 rows in set (0.00 sec)

Use of this syntax ensures that only the words 'Grey' and 'Gray will match the search criteria. There is no limit to the number of characters that can be grouped in the brackets when using this filtering technique.

Matching from a Range of Characters

The character group matching syntax can be extended to cover range of characters. For example, instead of declaring a regular expression to cover the letters between A and F as [ABCDEF] we could simply specify a range of characters using the '-' character between the upper and lower ranges [A-F]. We could, for example, list a product based on certian model numbers which begin with numbers ranging from 1 to 6:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'CD-RW Model [1-6]543';
+-----------+------------------+-----------+
| prod_code | prod_name        | prod_desc |
+-----------+------------------+-----------+
|         1 | CD-RW Model 4543 | CD Writer |
|        14 | CD-RW Model 5543 | CD Writer |
|        15 | CD-RW Model 6543 | CD Writer |
|        16 | CD-RW Model 2543 | CD Writer |
+-----------+------------------+-----------+
4 rows in set (0.00 sec)

Handling Special Characters

As you have seen, regular expressions assign special meaning to particular characters. For example the dot (.) and square brackets ([]) all have special meaning. Those who studying critical thinking at college will already be questioning what to do if the character sequence that is the subject of a search contains one or more of these characters. Obviously if you are are looking for the following text that looks like a regular expression, the text for which you want to search is, itself, going to be viewed as regular expression syntax.

To address this issue, a concept known as escaping is used. In SQL, escaping involves preceding any characters that may be mis-interpreted as a regular expression special character with double back slashes (\\). For example, suppose we have a row in our product table which reads as follows:

+-----------+--------------------+-----------+
| prod_code | prod_name          | prod_desc |
+-----------+--------------------+-----------+
|        17 | CD-RW Model [7543] | CD Writer |
+-----------+--------------------+-----------+

If we were to search for this without regard to the fact that the prod_name value contains regular expression special characters we will not get what we wanted:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'CD-RW Model [7543]';
+-----------+------------------+-----------+
| prod_code | prod_name        | prod_desc |
+-----------+------------------+-----------+
|         1 | CD-RW Model 4543 | CD Writer |
|        14 | CD-RW Model 5543 | CD Writer |
+-----------+------------------+-----------+
2 rows in set (0.00 sec)

The cause of the problem is that the regular expression has been interpreted as a search for any entries that read 'CD-RW Model' followed by either a 7, 5, 4 or 3. Clearly, what we wanted was the actual text [7543]. If instead we escape the brackets with the \\ escape sequence we get the result we want:

mysql> SELECT * FROM product WHERE prod_name REGEXP 'CD-RW Model \\[7543\\]';
+-----------+--------------------+-----------+
| prod_code | prod_name          | prod_desc |
+-----------+--------------------+-----------+
|        17 | CD-RW Model [7543] | CD Writer |
+-----------+--------------------+-----------+
1 row in set (0.00 sec)

Regular Expressions and Whitespace Characters

Regular expression syntax also provides a mechanism to reference whitespace characters such as tabs, carriage returns and line feeds. These are referenced in a regular expression using metacharacters. These metacharacters are outlined in the following table:

MetacharacterDescription
\\nNew line (line feed)
\\fForm feed
\\tTab
\\rCarriage return
\\vVertical tab


==