Regular expressions have been one of my favorite programming tools since I first discovered them. They are wonderfully robust and things can usually be done with them in many ways. For example, here are multiple ways to match an IPv4 address:
^\d\d?\d?\.\d\d?\d?\.\d\d?\d?\.\d\d?\d?$
^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$
^(\d{1,3}\.){3}\d{1,3}$
^([0-9]{1,3}\.){3}[0-9]{1,3}$
One of my major annoyances though has always been lists. I have always done them like ^(REGEX,)*REGEX$.
For example, I would do a list of IP addresses like this: ^(\d{1,3}\.){3}\d{1,3},)*\d{1,3}\.){3}\d{1,3}$.
I recently realized however that a list can much more elegantly be done as follows: ^(REGEX(,|$))+(?<!,)$. I would describe this as working by:
^: Start of the statement (test string)
(REGEX(,|$))+: A list of items separated by either a comma or EOS (end of statement). If we keep this regular expression as not-multi-line (the default), then the EOS can only happen at the end of the statement.
(?<!,): This is a look-behind assertion saying that the last character before the EOS cannot be a comma. If we didn’t have this, the list could look like this, with a comma at the end: “ITEM,ITEM,ITEM,”.
$: The end of the statement
So the new version of the IP address list would look like this ^((\d{1,3}\.){3}\d{1,3}(,|$))+(?<!,)$ instead of this ^((\d{1,3}\.){3}\d{1,3},)*(\d{1,3}\.){3}\d{1,3}$.
Also, since an IP address is just a list of numbers separated by periods, it could also look like this: ^(\d{1,3}(\.|$)){4}(?<!\.)$.
Today I thought I’d give a demonstration on the use of regular expressions [reference page here]. Regular expressions are basically a simplified scripting language for finding and replacing complex text strings, and are implemented into much of today’s software which involve a lot of text editing. They are a fabulously handy tool for computer users and are especially useful for programmers. I believe RegExps actually originally gained their notoriety through the Perl programming language. I also recently heard that it is definite that the new version of C++ (C++0x) will have native library support for regular expressions, yay!
Since I posted yesterday on DNS stuff, and have the examples from it handy, I figured I’d use those :-).
Let’s say you had a group of .com domains and wanted to find out their name servers (I’ve had to do this when switching to new name servers to make sure all the domains we did not control at the registrar level had their name servers set to the new ones). For this example, we will use the following domains “castledragmire.com”, “riaboy.com”, “NonExistantDomainA.com”, and “dakusan.com”.
First, we’d need to have the list of the domains, for this example, one domain per line is used.
Next, we need to turn them into a bash (Linux) script to grab all the information we need.
Replace: “^(.*)$”
With: “echo '!?$1?!'; host -t ns $1 a.gtld-servers.net | grep ' name server ';”
Sample output: (The !? ?! stuff are markers for easier viewing and parsing)
echo '!?castledragmire.com?!'; host -t ns castledragmire.com a.gtld-servers.net | grep ' name server ';
echo '!?riaboy.com?!'; host -t ns riaboy.com a.gtld-servers.net | grep ' name server ';
echo '!?NonExistantDomainA.com?!'; host -t ns NonExistantDomainA.com a.gtld-servers.net | grep ' name server ';
echo '!?dakusan.com?!'; host -t ns dakusan.com a.gtld-servers.net | grep ' name server ';
Next, we run the script, and it would output the following:
!?castledragmire.com?!
castledragmire.com name server ns3.deltaarc.com.
castledragmire.com name server ns4.deltaarc.com.
!?riaboy.com?!
riaboy.com name server ns3.deltaarc.com.
riaboy.com name server ns4.deltaarc.com.
!?NonExistantDomainA.com?!
!?dakusan.com?!
dakusan.com name server ns3.deltaarc.com.
dakusan.com name server ns4.deltaarc.com.
Next, we would keep running the following regular expression until no more replacements are found.
This would combine all domains with multiple name servers onto one line with name servers separated by spaces.
Replace: “(.*?) name server (.*)\n\1 name server (.*)”
With: “$1 name server $2$3”
It would output the following:
!?castledragmire.com?!
castledragmire.com name server ns3.deltaarc.com. ns4.deltaarc.com.
!?riaboy.com?!
riaboy.com name server ns3.deltaarc.com. ns4.deltaarc.com.
!?NonExistantDomainA.com?!
!?dakusan.com?!
dakusan.com name server ns3.deltaarc.com. ns4.deltaarc.com.
The final regular expression would turn the output into a single line per domain, followed by its domain servers. The current extra line before the list of name servers is to help spot any domains that did not provide us with name servers.
Replace: “!\?(.*?)\?!\n\1 name server (.*)”
With: “#$1\t$2”
Which would output the final following data: