North East Bytes - a Microsoft technology usergroup in North East England.

Powered by Squarespace

« We've all been there | Main | NEBytes November 16th: Virtualising SharePoint and Mobile BI »
Wednesday
Nov162011

Finding duplicate keys in a data file with PowerShell

In my day job I've sometimes found myself looking for anomalies in data exported from our Siemens HiPath PBX. Siemens produce a tool that we run on a schedule to export some of the configuration settings for telephone extensions so that we can use it in various other systems. The file is called port.txt and it occasionally contains duplicate values in its unique_key field. (Don't you just love irony?!)

The HiPath doesn't care (apparently) but when we import the data into SQL Server it goes into a table where unique_key is the primary key, so the DTS package doing the import fails. We can find the duplicate unique_key very quickly with PowerShell and report back to the telecoms people to fix their config.

The data arrives in a CSV file, which are easy to work with in PowerShell. (I don't know about anyone else, but I work with CSV files a lot.) Finding our duplicate is about as simple as:

Import-CSV port.txt | Group-Object unique_key | `
  Where-Object {$_.count -gt 1} | Select-Object -ExpandProperty Group

Sadly this file is not immediately usable. The header row is a,b,c,d,
i.e. The last item doesn't have a name. Shame on you Siemens!

So we have to cope with that. There are a number of ways we could do this but we're using one of the CSV data handling cmdlets that came along in PowerShell 2, ConvertFrom-CSV, which basically takes some comma-separated values and converts them into objects in the same way as Import-CSV does, but without the file.

Obviously we need to get it into a proper CSV format with no empty headers first. In fact, in this particular example there's a comma on the end of every row with nothing after it, so we'll simply replace those trailing commas with a comma followed by something using a regular expression match:

Get-Content port.txt | ForEach-Object {$_ -replace ",$", ",foo"}

That gives us something like a,b,c,d,foo which we can convert to objects and deal with as before, so the whole job can be achieved with:

GC port.txt | Foreach {$_ -replace ",$", ",foo"} | `
  ConvertFrom-CSV | Group unique_key | `
  Where {$_.count -gt 1} | Select -ExpandProperty Group

That should be clear, perhaps with the exception of the last cmdlet. When you group objects by some value, you get a GroupInfo object. That contains the value that you grouped on, the count of objects in the group, and a collection of the objects in the grouping called "group". I'm using the ExpandProperty parameter on the Select-Object cmdlet to expand those objects back out so I can see the data that's causing the problem. In this case it looks something like this:

unique_key       : 87734
switch_name      : KN0A
pen              : 1-14-25-25
extension        : 496625
tcos             : 6
least_cost_rout1 : 32
foo              : foo

unique_key       : 87734
switch_name      : KN0A
pen              : 1-15-49-73
extension        : 497573
tcos             : 6
least_cost_rout1 : 32
foo              : foo

PrintView Printer Friendly Version

EmailEmail Article to Friend

References (1)

References allow you to track sources for this article, as well as articles that were written in response to this article.
  • Response
    Excellent Web site, Maintain the fantastic work. Many thanks.

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>