Tuesday, January 31, 2012

Regular Expressions vs. Legacy String Functions

Back in the day of Visual Basic 6, when you wanted to find some information from a long string, you were (almost) required to write a function that parses your string section by section and use switches or if blocks. In today's world, regular expressions lighten the load quite a bit. As I've been developing apps over the years, I've learned to love regular expressions and figured it was worth mentioning in a post.

Regular expressions are basically patterns that have their own parsing engine tied to them for stripping text-based information from a string of text. For example, say I want to strip a certain value from a nasty looking string, and I know that the string will be in a certain "format". By this I mean that there might be parts of the string that don't change, but the parts that do change, I want to use them for something else. Instead of cracking my knuckles and writing a loop to sift through the string character by character, I'm able to write an single expression to strip out "matches" one by one.

Consider the following connection string:


            Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\data.xls;Extended Properties="Excel 8.0;HDR=YES";

From this string, I want to be able to pull out the value of 'Data Source' (in this case c:\data.xls) and the value of 'HDR' (ie. YES or NO). Back in the day, I would write a few lines of code to Split() the string into pieces and then strip those pieces into shorter pieces and save them into variables. Something like this:

Dim filePath As String
Dim hasHeader As Boolean


Dim sections() As String
sections = Split(connectionString, ";")


For Each section As String In sections
  Dim pieces() as String
  pieces = Split(section, "=")
  If pieces(0) = "Data Source" Then
    filePath = pieces(1)
  Else
    If pieces(0) = "Extended Properties" Then
      Dim properties() As String
      properties = Split(pieces(1), ";")


      For Each prop As String in properties
        Dim values() As String
        values = Split(prop, "=")


        If values(0) = "HDR" Then
          hasHeader = CBool(values(1) = "YES")
        End If
      Next
    End If
  End If
Next

Look at all of that code!? Just to get two values from a single string. I used to write code like this in VB6 all the time. There are lots of assumptions in that code and it is not optimized at all, let alone understandable at first site. There are many chances for errors to happen in that code as well... Not fun.

Regular Expressions to the Rescue
When .NET was introduced, there was this new (to me) concept of regular expressions that allowed pieces to be picked out of the string as needed using a "pattern". This is very common in Unix and many other programming languages, but it was new to me at the time and intimidated me. It was one more thing I had to learn. I used to approach it with dread and felt like I had to relearn the syntax every time I used it.

Today, I actually think in regular expressions a lot of time. There are many special characters that allow the regular expression engine to understand certain functions. For example a "$" means "beginning of string". A "." means any character. A "+" means "one or more times". A "*" means "zero or more times". When you put these special characters together, you can do some very powerful things. There is all kinds of documentation on the internet to help you understand this language if you are interested. If you are new to computer programming, then I recommend that you learn this sooner than later. It will make your life much easier.

To get the values from the above connection string using regular expressions, it requires a very simple regular expression pattern:

provider\=microsoft\..+?\.oledb\..+;Data\sSource\=(?.*)\;Extended\sProperties\=\"".+\;HDR\=(?.+)\;\""

Once I've defined this pattern, I can use it to strip out the "filename" and "hasheader" values very quickly and efficiently. While this example is border-line elementary, consider stripping values from a 20kb Xml file or a huge Html string that you stripped from a web page. Better yet, consider the power that it offers you when parsing a 5mb log file for information. 

This might be over many peoples heads and it might be common knowledge for others. For me, it was way over my head for a few years. However, after using it so much and relying on it, it's become common knowledge; it is baked into my daily routines now. Regular expressions are a very common thing inside Vim, and many Unix command line programs. For example, if I were to paste the connection string into Vim, I could place my cursor at the beginning of the connection string and simply type "d/Data Source" and the entire string from "provider" all the way up to "Data Source" is removed.

Here is a killer utility application that you should put on your thumb drive and use if you need to parse a large text file for values: http://regexlab.codeplex.com/


It's free, and I've found it to be very very powerful when constructing complex regular expressions in my daily programming tasks.

Regular expressions. Learn them. Use them.

No comments: