Are you a Star (Wars|Trek) fan?

Imagine you are a definite Trekkie, and happen to share a computer with a Star Wars fan. Or the other way around. Whichever you prefer, I am not going to debate (in this post) which one is better. And on that computer, you have all episodes of Star Wars and all series of Star Trek. And you forget where these are saved on the computer, so you need to find them. But as the geek you already are, you want to find your favourite films without having to stumble upon any of your friend’s (or are they friends?!) films. This is where the labyrinthine game of regex golf comes into play.

In regex golf, you are meant to find the shortest possible regular expression that matches all terms in one set without matching any of the terms in another set. Just like code golf, where you have to create the shortest possible code to solve a given task. Just like in golf, where you are meant to visit all holes in as few shots as possible. So, given a set of Star Wars subtitles and one of Star Trek subtitles, can you write an as-short-as-possible regex to match only one of them?

Regex Golf

This regex, /m | [tn]|b/, which seems to be case-insensitive, dates back from when only six Star Wars episodes existed. Now that Episode VII has been released, let’s update the set to have the following:

  • “The Phantom Menace”
  • “Attack of the Clones”
  • “Revenge of the Sith”
  • “A New Hope”
  • “The Empire Strikes Back”
  • “Return of the Jedi”
  • “The Force Awakens”
  • “The Motion Picture”
  • “The Wrath of Khan”
  • “The Search For Spock”
  • “The Voyage Home”
  • “The Final Frontier”
  • “The Undiscovered Country”
  • “Generations”
  • “First Contact”
  • “Insurrection”
  • “Nemesis”
  • the one without a subtitle
  • “Into Darkness”

You will notice that the addition of “The Force Awakens” breaks the regex. This episode’s name is not captured by any of the three options given in the regex. So what’s the smallest change we can make to make the regex work again?

I found two equally long possibilities to this end: either we add an ‘a’ to the middle option, or replace the ‘b’ with the sequence ‘ke’.

/m | [tna]|b/
/m | [tn]|ke/

I’ll come back to this topic later on with a heuristic to find you an approximation of the shortest possible regex that you can have when playing golf.


Regular expressions training

I found Regex Crossword the other day, and immediately got addicted to it. So addicted, that my current ranking is 286, and just in a couple of days. Most puzzles are squares, some of the puzzles are hexagonal, many of them are extremely interesting, even using Unicode characters (e.g., Checkmate, Atheists). They pushed my knowledge of regular expressions to the limit. I have to admit there were several cases where I had to use an engine (I find regex101 pretty cool and easy to work with) to make sure I understand correctly what the regex wants.

Unicode can be tricky to work with when building a regular expression. Mostly because of the apparently infinite number of characters having similar functions: 17 white spaces, 24 dashes, 1984 lowercase letters etc. Luckily, you can rely on the fact that every single Unicode character is assigned to one of the (currently) 31 categories. You can view this list of categories, if interested in seeing what they are. And luckier us, most languages that support regexes have included these categories in their engines, to some extent, where they are equivalent to character classes.

If you look at this Word Character Class crossword, you will notice lots of, obviously, character classes. Imagine you would need to write those regexes for all letters, explicitly. Computer says no! It’s much easier to both write and understand if you just use \p{Ll} for lowercase letters, whichever script they might be in.

Happy regexing!