Well, it’s been a looong time since I added any new posts to my blog – almost 4 years! I was busy 😉
In my day job, a lot of my time is spent automating manual tasks. Recently I was asked to look at automating the process of tidying up server reports to give clean data to be used to create a tracker. The data was supplied in xlsb format (Excel binary) and contained a lot of duplicate rows which needed to be removed. The manual process was to open the file in Excel, navigate to the ‘Data’ tab of the ribbon and choose ‘Remove Duplicates’ – this defaulted to all columns being selected so clicking ‘OK’ did the job.
The automated script needed to run across many(!) xlsb files and I found that Excel ran out of memory after the first few if I relied on Excel macros to open multiple consecutive files. My solution was to use PowerShell – this meant that Excel could be closed between workbooks and so free up memory.
There was a lot of trial and error to get the .RemoveDuplicates() part of the script to work, the MSDN docs say that columns are optional, but that does not seem to be the case. My solution was to make an array with the column numbers.
The relevant parts of the script are below – hopefully it will help someone!
Param([String]$filepath) # to get the file name at the command line.
$xl = New-Object -ComObject "Excel.Application"
$wb = $xl.Workbooks.open($filepath) # open the workbook
$ws = $wb.Sheets.Item("Sheet1") # specify the worksheet
$columns = 1, 2, 3, 4, 5 # create an array of column numbers
$ws.UsedRange.RemoveDuplicates($columns) # remove duplicates
I then had code that saved the deduplicated sheet.
Let me know if this helps, or if you have a better way of doing it!