This post is not going to introduce regex nor talk about the basics of regex. In this post, we’ll discuss a possible strategy in order to construct/understand lengthy regular expressions. Thus, it is assumed that the reader is somewhat familiar with regex.
Introduction
As a software engineer, sooner rather than later you stumble upon regular expressions. At a first glance, these might look a little bit intimidating or even nonsensical. One could say that the entangled sequence of characters looks similar to the Egyptian hieroglyphs. However, if one understands & masters regex, one might become the Lord of the Rings in the string-searching universe. π§ββοΈ
Let’s get into it
Why the ‘Regex that matches the entire universe’ title you may wonder? Well, this post is about tackling very lengthy regular expressions, such as the one below.
([\s]{3,10})(\w)+(\s)+(([0-9]{1,2}((\β|\β|-)([0-9]{1,2}(:[0-
9]{1,2}[A-Z]{1,2}|[A-Z]{1,2}))|:[0-9]{1,2}([A-Z]{1,2}(\β|\β|
-)[0-9]{1,2}(:[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2})|(\β|\β|-)[0-9
]{1,2}(:[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2}))|[A-Z]{1,2}(\β|\β|-
)[0-9]{1,2}(:[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2})))|Collaboratio
n|Silver)(\s\s|(([0-9]{1,2}((\β|\β|-)([0-9]{1,2}(:[0-9]{1,2}
[A-Z]{1,2}|[A-Z]{1,2}))|:[0-9]{1,2}([A-Z]{1,2}(\β|\β|-)[0-9]
{1,2}(:[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2})|(\β|\β|-)[0-9]{1,2}(
:[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2}))|[A-Z]{1,2}(\β|\β|-)[0-9]{
1,2}(:[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2})))|Collaboration|Silve
r)\s\s)(\w)+(\s{1})(\w)+(\s{1})(\w)+(\s{1})(\w)+(\s{1})(\w)+
The expression above might seem a little bit tricky, right? If you would have to construct/understand something similar, how should you go about doing this?
As we know, problems can be tackled in various ways. Consequently, the proposed solution is just one way in which one can go about solving this type of problem, namely constructing or understanding lengthy regular expressions.
Solution
If you have to build a lengthy regex that matches a specific pattern, you may want to construct a tree-like structure with which you’ll be able to quickly identify all possible cases. Then it’ll be just a matter of thoroughly translating everything in regex. Let’s take an easy example that illustrates this.
Let’s say you want to identify time intervals in a text.
You know that these intervals have the following format: hours-hours
.
For example:
12-5PM
2-5PM
5PM-1AM
etc.
Construct the tree-like structure
This will help you identify all possible cases.
number(s)
-
number(s)
xM
xM
-
number(s)
xM
After constructing the above tree-like structure you can easily identify if you’ve taken into account all cases. For example, we can see that 2-5PM
can be obtained by following:
number(s)
β -
β number(s)
β xM
Construct the regex expression
For this particular example, one possible regex is:
([0-9]{1,2}) #number(s)
(
(-) # -
([0-9]{1,2}) # number(s)
([A-Z]{2}) # xM
|
([A-Z]{2}) # xM
(-) # -
([0-9]{1,2}) #number(s)
([A-Z]{2}) # xM
)
In this particular example, we assume no intervals that contain minutes, e.g. ‘12:30-2PM’ is not taken into account.
Final step
Test your expression with various examples. Copy the expression to clipboard and start testing.
([0-9]{1,2})((-)([0-9]{1,2})([A-Z]{2})|([A-Z]{2})(-)([0-9]{1,2})([A-Z]{2}))
Conclusion
There are various ways in which one can solve a problem that involves regex. However, if the regex gets too lengthy, one has to have a strategy that helps reduce possible mistakes. Consequently, a strategy can be to construct a tree-like structure that takes into account all possible cases.
See you in the next one! π€