Scary Spookvember: Regex!

This is the first blog post in a new series of blog posts that involves me learning and blogging about things that scare developers. I came up with this idea basically right after Halloween ended, so unfortunately this has to be the Scary Spookvember series now instead of Spooktober, which would've made more sense 🤦.

This first article is about Regex. Many developers think of it as like a template that can match strings, but actually, Regex is a simple programming language! To show this, let's take a look at a simple Regex that matches a simple lowercase UUID, such as e81dd3ba-44bb-11ec-81d3-0242ac130003. The Regex that I found on stack overflow is:
\b([0-9a-f]){8}-([0-9a-f]){4}-([0-9a-f]){4}-([0-9a-f]){4}-([0-9a-f]){12}\b
At first glance, this looks quite confusing and scary, but once we reformat it a bit, things begin to start taking shape:
\b		# match the beginning or end of a word
(
  [0-9a-f]	# match any character from 0-9 or a-f 
) {8}		# loop 8 times
-		# match the - character
(
  [0-9a-f]	# match any character from 0-9 or a-f 
) {4}		# loop 4 times
-		# match the - character
(
  [0-9a-f]	# match any character from 0-9 or a-f 
) {4}		# loop 4 times
-		# match the - character
(
  [0-9a-f]	# match any character from 0-9 or a-f 
) {4}		# loop 4 times
-		# match the - character
(
  [0-9a-f]	# match any character from 0-9 or a-f 
) {12}		# loop 12 times
\b		# match the beginning or end of a word
Now, the regex is trivial to understand! We can even use this type of reasoning to build much more complicated Regex patterns. For example, I was signing up for an Epic Games account a few months back to get some of their free games, and when setting a password, they gave me the following requirements:
How can we build a way to validate Epic Games passwords? Let's start by abstracting away some details into "functions".
has_number()
has_letter()
has_greater_than_or_equal_to_seven_characters()
not has_whitespace()
Now, we can think about solving each one of these functions on its own. Let's try has_number first:
.		# match any character
  *?		# loop 0 times and try the rest of the regex, loop 1 times and try the rest of the regex, etc.
[0-9]		# match any digit
Then, let's do has_letter which is pretty similar to has_number:
.		# match any character
  *?		# loop 0 times and try the rest of the regex, loop 1 times and try the rest of the regex, etc. until match
[a-zA-Z]	# match any letter
Then, let's do has_greater_than_or_equal_to_seven_characters:
.		# match any character
  {7,}		# loop at least 7 times and try the rest of the regex
Another way to do this is something like:
.		# match any character
  {6}		# loop 6 times
.		# match any character
And finally, let's do has_whitespace
.		# match any character
  *?		# loop 0 times and try the rest of the regex, loop 1 times and try the rest of the regex, etc. until match
\s		# match a whitespace character
Now, we can combine these to form a full pattern validator. To do this, we will use positive and negative lookaheads. A lookahead basically attaches some conditions to a match. Let's say I were to write code to match the "x" character at position p, I might write "x" == word[p]. If I attached a lookahead to this "x" character match, that would look something like this in Regex x(?=some_function) for positive lookahead and x(?!some_function) for negative lookahead, and the pseudocode would be like "x" == word[p] and some_function(word[p+1:]) for positive lookahead and "x" == word[p] and !some_function(word[p+1:]) for negative lookahead. With this extra bit of knowledge, we can now get to the Regex for our password validator:
^		# match the start of the line
  (?=		# positive lookahead for has_number function
    .
      *?
    [0-9]
  )
  (?=		# positive lookahead for has_letter function
    .
      *?
    [a-zA-Z]
  )
  (?=		# positive lookahead for has_greater_than_or_equal_to_seven_characters function
    .
      {7,}
  )
  (?!		# negative lookahead for has_whitespace function
    .
      *?
    \s
  )
.		# match any character
  *		# loop to the end
$		# match the end of the line
All in all, this Regex looks like: ^(?=.*?[0-9])(?=.*?[a-zA-Z])(?=.{7,})(?!.*?\s).*$ and it works!

That's all for this post about Regex. For more information, check out this great conference talk: Understanding and Using Regular Expressions and for some extra practice, try implementing a Regex display name validator for Epic Games!


Comments

Popular posts from this blog

First-Principles Derivation of A Bank

A Play-by-play of the Mirai botnet source code - Part 1 (scanner.c)

You can control individual packets using WinDivert!