Thursday, April 29, 2010

Parsing Delimited Data

So I had to parse some delimited data today in Oracle and it turned out to be a little more complicated that it probably needed to be.  This is a great example of something being simple in one technology and difficult in another.

In .NET, we can simply delimited data string, and call the Split method and provide the delimiter and we're provided with an array of the tokenized elements.

Since what I was parsing was IP address data, I will carry that theme into my example.

In .NET...

string a;
a = "192.168.1.1";
string[] aParts = a.Split('.'); 
 
What I end up with is an array of 4 elements 192, 168,1,1.  I have parsed my IP address.  I can then loop through each string in the array or reference each element using its zero-based index.
 
In Oracle, this proved to be a bit more complicated.  A split type function doesn't exist in Oracle.  I could have written one but that was overkill since I'm writing this db procedure to be ran one time for a data migration of a few thousand records.
 
So let's take a look at what I did.  Keep in mind, I took of the FROM and WHERE clause from these queries because they don't really pertain to what I'm talking about.

We will assume the IPADDR value is 192.168.1.2.  Keep in mind that I'm doing the REPLACE on the entire SUBSTRING operation to remove the period (.) from the IP.
 
You'll notice I'm using substr and instr heavily in this example. 
 
INSTR returns the position of an character inside a string.  INSTR  has several overloads.  This overload I'm using takes 1) the string to examine, 2) the character(s) to find, 3) what position to start in the string, and 4) which instance of the character(s) to return the position of.
 
SUBSTRING returns a subset of characters inside a string.  The overload I'm using for it takes 1) the source string, 2) the position to start from, and 3) the number of characters to grab.
 
1. SELECT replace(substr(IPADDR, 1, instr(IPADDR, '.', 1,1)), '.', '') INTO mIP1

So for the "192" part of the IP ... I'm looking in my string, in the 1st position, and stopping at the location of the 1st period.    The instr returns 4 so I actually get back "192." and I just remove the period. 

You'll notice in Oracle, these strings are a 1-based index unlike C# which uses the 0-based.
2. SELECT replace(substr(IPADDR, instr(IPADDR, '.', 1,1), instr(IPADDR, '.', 1,2)- instr(IPADDR, '.', 1,1)), '.', '') INTO mIP2

So now I need the "168" part.  So I want to start after the 1st period and end at the 2nd period.  You'll notice that I substract the position of the 2nd period instr(IPADDR, '.', 1,2) from the location of the 1st period instr(IPADDR, '.', 1,1) to get the length between the two periods.

And again, I remove any periods with the outer replace.
3. SELECT replace(substr(IPADDR, instr(IPADDR, '.', 1,2), instr(IPADDR, '.', 1,3)- instr(IPADDR, '.', 1,2)), '.', '') INTO mIP3

Here I do basically the same thing to the "1" except I look for the location of the 2nd period and subtract the location of the 3rd period from the 2nd period to get my length.

4. SELECT replace(substr(IPADDR, instr(IPADDR, '.', 1,3), 4), '.', '') INTO mIP4

At the end, I start from the position of the 3rd period and go a hard coded 4 characters.  This is misleading because there may not be 4 characters but rather this just goes to the end of the string.  I didn't realize I'd done that until I look at it again for this blog.  But its not a big deal in this case.

So there you go.  Parsing an IP Address in C# vs. parsing it in Oracle.  I should point out that this same technique should work in MS SQL as-is or with minor modifications.  As a bonus lesson, you can see that if I had a complex data parsing situation, maybe using C# would be quicker and easier than trying to do the parsing in the database.

As a final note, I could have done a Regular Expression as well but that's probably more illegible than these queries were!  Plus I'm did this for a client and after I'm gone this should be easier to understand/maintain.

No comments:

Post a Comment