Creating more readable regular expressions with Simple Regex Language
Clear-Sighted
Regular expressions are a powerful tool, but they can also be very hard to digest. The Simple Regex Language lets you write regular expressions in natural language.
Regular expressions are a fundamental feature of Linux – and many other modern operating systems. A regular expression is a search term with special placeholders representing several possible characters at the same time. The concept of a regular expression is an extension of the idea behind the "wildcard" character used in many GUI search tools, but the power and subtlety of regular expressions far exceeds what you can do with a simple wildcard.
For example, suppose you want to search the system.log
file for errors, but you don't know whether the term Error
will appear with initial cap or all lowercase (Error
or error
). You could use a regular expression as part of the Grep command:
grep -e '[eE]rror' system.log
The expression [eE]
means: There is either a lowercase e
or uppercase E
.
A quick check for capitalization is easy to read and interpret, but some regular expressions are much more exotic. Who is able to say right away what text the following expression describes:
/^(?:\w|[\.\-\+])+(?:@) (?:[a-z]|[0-9]|[\.\-])+(?:\.)[a-z]{2,}$/i
Once you derive an expression like this, it can be a powerful tool for a script or a string search tool like Grep, but for the human who created this expression, and the other humans who comes along later and want to read it, decoding a regular expression can be a time-consuming endeavor. What is more, a small error that creeps into the expression could be difficult to spot, although it could have a significant effect on the value of the search result. An error in a complex regular expression could even form the basis for malicious code and an Internet attack.
The fledgling Simple Regex Language (SRL, [1]) from the developer Karim Geiger aims to address the problem of incomprehensibility in regular expressions. Geiger started SRL as a bit of fun in Fall 2016, and since then, other developers have helped to implement SRL in various coding languages.
The SRL allows you to write regular expressions in natural English. In the previous example of the logfile, the two words Error
and error
start with either E
or e
. In SRL, you could say:
one of "eE"
and follow it with the character string rror
:
one of "eE" literally "rror"
This line forms a complete expression in the SRL. SRL does not consider uppercase and lowercase for keywords, so LITERALLY
is thus the same as literally
. However, for literal strings, uppercase and lowercase are very important: literally "Error"
therefore means something completely different from literally "error"
.
In SRL, the developer can frame strings – in the example rror
– with single or double quotes. You have the option of separating the individual components of the complete expression with a comma or a line break. Adding a break does not change the logic but instead simply improves the legibility:
one of "eE", literally "rror"
The example expression matches all text passages where the character strings error
or Error
appear. Hence the word Terror
ism would be a valid reference.
Empty Words
Spaces (whitespaces) correctly separate the words:
whitespace one of "eE" literally "rror" whitespace
The word error
is usually at the beginning of a line in logfiles. Anyone who is only interested in these lines, just needs to write:
begin with one of "eE" literally "rror"
The test text now needs to start with Error
or error
. However, the expression only works if the program considers each line of the file as text to be retested (similarly to grep
).
Some logfiles mark errors with the abbreviation EE
, which you could include in the expression with:
begin with any of (literally "EE", (one of "eE" literally "rror"))
As with traditional regular expressions, brackets group matching subexpressions. The term any of
serves as a logical Or. In the example, the text looks for lines beginning with either with the character string EE
, or with Error
or error
. The comma is cosmetic.
When the Post Rings
Sometimes characters should be repeated several times. For example, with the abbreviation EE
, there are exactly two E
s in succession. Or in SRL, you could say: literally "E" exactly 2 times
. Instead of exactly 2 times
, you could also write twice
.
In the following expression:
begin with any of (any character, one of".-+") once or more
the expression any character
stands for any letters between A and Z or for a digit between 0 and 9 or an underscore _
. Uppercase and lowercase are of no importance. The permitted characters can be repeated as often as desired; however, there must be at least one character. The entry once or more
ensures a minimum of one character.
If the string you are looking for is an email address, you'll also need to ensure the presence of the @
character: literally "@"
. The domain name behind it may, in turn, be made up of several letters or numbers and the special characters .
and -
:
any of (letter, digit, one of ".-") once or more
The any character
expression does not work for the domain name because domain names prohibit the underscore _
. The letter
and digit
expressions specify letters and numerals without additional characters. The top-level domain, which starts with a period, forms the end:
<C>literally "."<C>
At least two more letters follow:
letter at least 2 times must end
The developer explains that uppercase and lowercase are irrelevant by explicitly adding case insensitive
.
Listing 1 shows the whole expression. The expression deliberately keeps the email address test simple; for example, the standard allows other special characters in front of the @
. The domain name must also always end with a letter or a number.
Listing 1
Checking an Email Address
Testing, Testing, 1, 2, 3
You can test your SRL expression directly at the SRL project website under the menu item Build [2]. Just enter the SRL expression under Your SRL Query, type a test text under Test Input, and have it checked via Run Query (Figure 1). At the bottom of the page, developers immediately find out whether the test text matches the SRL expression. In addition, the page supplies the corresponding regular expression for comparison.
Figure 2 shows the expression for Listing 1 as an example – which, by the way, is identical to the cryptic regular expression at the beginning of this article. If the tester places a check mark in front of Save Query (to the right of Test Input), the server keeps track of all entries. The tester can use the URL at the bottom of the page to access the page with the SRL expression at any time. It remains unclear where the stored data will reside, so testers should not use sensitive data with Test Input.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.