Are you a Star (Wars|Trek) fan?

Imagine you are a definite Trekkie, and happen to share a computer with a Star Wars fan. Or the other way around. Whichever you prefer, I am not going to debate (in this post) which one is better. And on that computer, you have all episodes of Star Wars and all series of Star Trek. And you forget where these are saved on the computer, so you need to find them. But as the geek you already are, you want to find your favourite films without having to stumble upon any of your friend’s (or are they friends?!) films. This is where the labyrinthine game of regex golf comes into play.

In regex golf, you are meant to find the shortest possible regular expression that matches all terms in one set without matching any of the terms in another set. Just like code golf, where you have to create the shortest possible code to solve a given task. Just like in golf, where you are meant to visit all holes in as few shots as possible. So, given a set of Star Wars subtitles and one of Star Trek subtitles, can you write an as-short-as-possible regex to match only one of them?

Regex Golf

This regex, /m | [tn]|b/, which seems to be case-insensitive, dates back from when only six Star Wars episodes existed. Now that Episode VII has been released, let’s update the set to have the following:

  • “The Phantom Menace”
  • “Attack of the Clones”
  • “Revenge of the Sith”
  • “A New Hope”
  • “The Empire Strikes Back”
  • “Return of the Jedi”
  • “The Force Awakens”
  • “The Motion Picture”
  • “The Wrath of Khan”
  • “The Search For Spock”
  • “The Voyage Home”
  • “The Final Frontier”
  • “The Undiscovered Country”
  • “Generations”
  • “First Contact”
  • “Insurrection”
  • “Nemesis”
  • the one without a subtitle
  • “Into Darkness”

You will notice that the addition of “The Force Awakens” breaks the regex. This episode’s name is not captured by any of the three options given in the regex. So what’s the smallest change we can make to make the regex work again?

I found two equally long possibilities to this end: either we add an ‘a’ to the middle option, or replace the ‘b’ with the sequence ‘ke’.

/m | [tna]|b/
/m | [tn]|ke/

I’ll come back to this topic later on with a heuristic to find you an approximation of the shortest possible regex that you can have when playing golf.


Current weather status

Ever wondered what the weather is like somewhere on this planet right now? Maybe somewhere where your special other is, or where you want to travel, or just out of data sciencey curiosity. Or maybe you want to study weather patterns and global warming. Of course, it is implied that you do not want to use Google or other more or less accurate readily available services.

Well, you are close to being able to find it all by yourself, and really accurately too. One of the industries that relies massively on accurate weather information is aviation. Almost all airports open for traffic publish weather information every 30 minutes, and sometimes even forecasts the weather for the following few hours, if there are chances of it changing significantly. These reports are called METAR reports.

All you need to do is find a way to access that kind of data. Fortunately, some people have already thought about this and provide weather data in real time. One such service is AVWX, the Aviation Weather REST API.

All you are left with to do is to call the API for a specific airport code (ICAO format), e.g., EGLL for London Heathrow. The example below uses Python and the urllib2 and json libraries to call the service and decode the answer into JSON.


Once you reach this point, it is just a matter of parsing the JSON values that are returned.

{"Altimeter": "1016", "Cloud-List": [["BKN", "031"], ["BKN", "046"]], "Dewpoint": "02", "Flight-Rules": "VFR", "Other-List": [], "Raw-Report": "EGLL 191320Z AUTO 22013KT 9999 BKN031 BKN046 10/02 Q1016 NOSIG", "Remarks": "NOSIG", "Remarks-Info": {}, "Runway-Vis-List": [], "Station": "EGLL", "Temperature": "10", "Time": "191320Z", "Units": {"Altimeter": "hPa", "Altitude": "ft", "Temperature": "C", "Visibility": "m", "Wind-Speed": "kt"}, "Visibility": "9999", "Wind-Direction": "220", "Wind-Gust": "", "Wind-Speed": "13", "Wind-Variable-Dir": []}

The raw METAR is included if you want to parse that directly. Additionally, some of the values included in there are explicit, some are just abbreviations used in aviation. A good place to look at if you want to understand what they mean is Texas A&M University, where you can find a quick course on METARs.

There are some Twitter accounts which already publish real-time weather information based on these METAR reports: Vremea la Iași and Vremea la Cluj-Napoca.

Of course, you are limited by the need of having a reporting weather station (usually an airport) wherever you want to look. But it should do the job to start with.

Good weather to all!

Regular expressions training

I found Regex Crossword the other day, and immediately got addicted to it. So addicted, that my current ranking is 286, and just in a couple of days. Most puzzles are squares, some of the puzzles are hexagonal, many of them are extremely interesting, even using Unicode characters (e.g., Checkmate, Atheists). They pushed my knowledge of regular expressions to the limit. I have to admit there were several cases where I had to use an engine (I find regex101 pretty cool and easy to work with) to make sure I understand correctly what the regex wants.

Unicode can be tricky to work with when building a regular expression. Mostly because of the apparently infinite number of characters having similar functions: 17 white spaces, 24 dashes, 1984 lowercase letters etc. Luckily, you can rely on the fact that every single Unicode character is assigned to one of the (currently) 31 categories. You can view this list of categories, if interested in seeing what they are. And luckier us, most languages that support regexes have included these categories in their engines, to some extent, where they are equivalent to character classes.

If you look at this Word Character Class crossword, you will notice lots of, obviously, character classes. Imagine you would need to write those regexes for all letters, explicitly. Computer says no! It’s much easier to both write and understand if you just use \p{Ll} for lowercase letters, whichever script they might be in.

Happy regexing!