|
|
The joy of Regex
back
Listed in this article are a few real world applications of the awesome Regular Expression (regex) patterns and their processors. Regex is was made popular along with the Perl programming language, although it has been around for as long as the first Unix operating system was written. In Unix, one can access regex using sed, a command line utility which can search for regex patterns in the pattern space i.e. the text you want to process, and applies a number of operations on in the hold space i.e. the output text you desire. The operations include delete, replace, match. It is not my intention in this article to cover the basics of Regex, which can be obtained by searching for regex on most search engines. I would like to concentrate more on the things you can do with regex in real life. Sometimes a very complex problem is reduced to a trivial piece of code by applying Regex to the pattern space.
Regex is now ubiquitous among most modern programming languages (Perl, Java, C#, Python, Ruby etc. to name a few).
More advanced features of regex have also been incorporated into most implementations, as compared to the original version embedded into Unix's sed. These days one can perform:
- greedy vs. ungreedy matching
- positive and negative look ahead insertion
- matching of character classes instead of just literal matching e.g. non-space characters (\S), digits only (\d) etc.
OK, here are a few good ones I have accumulated in the course of developing this site. If you know of any other good kick-ass regex pattern, let me know.
-
Parse columns from a CSV (comma separated) file
When you have to deal with the processing of comma separated records, like those imported/exported from the ubiquitous Excel, the format to cope with is: the first line contains the name of the columns, then the rest of the lines contain the actual data itself. Each data line contains exactly the same number of columns as the first, or header, line. Each column is separated from the next by a comma. If you have a comma in the data itself, you enclose that whole column's value in double quotes.
So, the parser needs to be able to cope with the following variations:
- field1,field2,field3,field4
- field1,"field2 with embedded space",field3,field4
- field1,field2,"field3 with embedded comma (,)",field4
Typically, the Ruby code for splitting the above lines into an Array of fields in the order of appearance is as follow, and the magic is simply contained in the Regular Expression (Regex) expression that is able to cope with all the variations above:
record = line.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/) # [ "field1", "field2", "field3", "field4" ]
-
Replace embedded emoticons with the equivalent images
A lot of blogs these days allow users to pick the emoticons from a popup with an embedded WYSIWYG editor. Emoticons, or smileys, reflect the mood of the author, or convey the his/her emotions which are relevant to the discussion. However, this is an overkill for a quick message board. When people type into a message board, they sometimes leave a shorthand version of these smileys. There is a convention of sort for these smileys, in that if you turn them sideways, you can replace them with characters from the standard keyboard. For example :-) represents , or a smile.
This is another area where regex comes into its own
line.gsub!(/\:\-\)|\:\)/mi,'<img src="/images/smileys/regular_smile.gif"/>') line.gsub!(/\;\-\)|\;\)/mi,'<img src="/images/smileys/wink_smile.gif"/>') line.gsub!(/\:\-\(|\:\(/mi,'<img src="/images/smileys/sad_smile.gif"/>') line.gsub!(/\:\-D|\:D/mi,'<img src="/images/smileys/teeth_smile.gif"/>') line.gsub!(/8\-\)|8\-D/mi,'<img src="/images/smileys/shades_smile.gif"/>') line.gsub!(/\:\"\)/mi,'<img src="/images/smileys/embarassed_smile.gif"/>') line.gsub!(/\:\-s/mi,'<img src="/images/smileys/confused_smile.gif"/>') line.gsub!(/\:,\(/mi,'<img src="/images/smileys/cry_smile.gif"/>') line.gsub!(/\:\-\>|\:\>/mi,'<img src="/images/smileys/devil_smile.gif"/>') line.gsub!(/\:\-o|\:\-0/mi,'<img src="/images/smileys/surprised_smile.gif"/>') line.gsub!(/\:\-P|\:\-P/mi,'<img src="/images/smileys/tongue_smile.gif"/>')
-
Replace embedded email addresses with equivalent anchors
When people send me comments, they usually put their email addresses into the body of the message. When I display these comments on these blog pages, it would be nice to be able to click on these email addresses which in turn pops up a mail window (Outlook, Thunderbird etc., take your pick), allowing me and other blog readers to quickly send that person a message. The regex expression to do this only fly is:
line.gsub!(/(\S+@\S+)/mi,' <a href="mailto:\1">\1</a>')
-
Replace embedded URL's with clickable anchors
Back to that message board again, to auto-replace embedded URL's people send with a real anchor when it comes to rendering it for the web page, the pattern is:
line.gsub!(/\s(\w+:\/\/)([^\s\/]+)([^\s]*)/mi,' <a href="\1\2\3" target="_blank">\2</a>')
Of course, this approach will not optimise the anchor for the search engines because we miss out on the title tag (this tag can not be derived from the URL itself, unless we grab the page it points to and perform a keyword analysis on it, which is not very practical).
The other limitation of the above technique is it always put the name of the website without the URI path into the visible text of the anchor. You can replace it for something generic like external link, but above and beyond that, it can not derive the visible text from the link itself, unless this text is known in advance.
As ever, I only talk about things I have experimented with, and you will see examples of the above usages throughout this site (well, apart from the CSV upload, which I use to maintain the Promotions page, which I am keeping away from the world, for good reasons)
back
by by David at 15 Jun 2006 14:49:45
|
|