Don’t Have Two Problems
Jamie Zawinski once posted this on a message board: “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”. He wasn’t the first to use this quote, but his use of it seems to be the most famous. Jamie was lamenting about the misuse of Regular Expressions in Perl, a language in which they are used commonly. Regular Expressions, or Regexes, have gained this notoriety because although they are powerful, they are difficult to read and can be used incorrectly, which can lead to logic errors and/or exponential running times.
According to this story on Medium: Everything you need to know about Regular Expressions, regexes are commonly used for verifying the structure of strings, selecting substrings, search and replace, and splitting into an array. Regexes are their own language and require a considerable amount of time to learn them. I think it important to learn them as they are valuable when working with strings and text. This blog post Regular Expressions: Now You Have Two Problems (codinghorror.com) likens them to hot sauce, you have to use them sparingly. Also mentioned in this post is that you don’t have to use one big regex. You can use a series of smaller regexes and it will be just fine. In my last project, I used regexes to validate the structure of a new password. I used a separate regex for each part of the password so I could see what part of the password was missing. In my text based video poker program I wrote in 2018, I used this regex to determine what cards the player wanted to hold or draw: (h|d){5}. The alternative to this would have been to use a loop or a higher order function to check if the input was composed of exactly 5 characters, either h or d.
The most important rule of using a regex besides correctly matching/rejecting is that it should NEVER cause catastrophic backtracking. This is when a regex uses an exponential amount of steps to match something. Suppose you wanted to match the following: 2 or more lowercase e’s followed by a lowercase k and then an exclamation point(!). You might write a regex like this:/ (e+e+)+k!/ If you use this regex to match “eeeeeeeeeeeeeeeeee!” (18 e’s followed by an exclamation point), it would take Regex Buddy 655,538 steps to find out the k is missing. Refactor the regex as /e{2,}k!/ and it terminates without a match in 34 steps in testing “eeeeeeeeeeeeeeeeee!”
Another potential issue with regexes is that you might need additional processing on the string to validate it to prevent an overly complex regex. Let’s say you wanted to match an IPv4 address. You might write something like this: /([0–9]{1,3})\.([0–9]{1,3})\.([0–9]{1,3})\.([0–9]{1,3})/ Unfortunately, this would match 111.222.333.444, which is not valid. Credits to Everything you need to know about Regular Expressions for this example. The same problem arises from using a regex to validate North American phone numbers I found on Stack Overflow: /^(\([0–9]{3}\) |[0–9]{3}-)[0–9]{3}-[0–9]{4}$/ This matches 911–911–0000, which is not a valid phone number. As stated before regexes are useful for validating the structure of a string. To validate the content, you would either need capturing, parsing logic, and/or a more complex regex.
Regular Expressions are a valuable tool for any software developer, even if they can be difficult to read. They may not be the most appropriate tool in all situations, but they are better than loops when used in the right places. They can be very efficient when they are properly constructed. If you are not familiar with them, you should learn how to use them. Then you won’t have two problems.