Find and Delete Duplicate Rows From Tables without Primary Key

Someone recently asked me how to find and delete duplicate rows from sample tables, which do not have a Primary Key. I have written one such sample on MSDN code Find and/or Delete Duplicate Rows which I will share here

SAMPLE Data

-- Suppress data loading messages
SET NOCOUNT ON

-- Create Table
CREATE TABLE #Customers (ID integer, CustName varchar(20), Pincode int)

-- Load Sample Data in Table
INSERT INTO  #Customers VALUES (1, 'Jack',45454 )
INSERT INTO  #Customers VALUES (2, 'Jill', 43453)
INSERT INTO  #Customers VALUES (3, 'Tom', 43453)
INSERT INTO  #Customers VALUES (4, 'Kathy', 22343)
INSERT INTO  #Customers VALUES (5, 'David', 65443)
INSERT INTO  #Customers VALUES (6, 'Kathy', 22343)
INSERT INTO  #Customers VALUES (7, 'Kim', 65443)
INSERT INTO  #Customers VALUES (8, 'Hoggart', 33443)
INSERT INTO  #Customers VALUES (9, 'Kate', 61143)
INSERT INTO  #Customers VALUES (10, 'Kim', 65443)

To indentify duplicate rows, the trick is to Group the rows by CustName, Pincode. Rows having similar CustName and Pincode will have more than one rows in the grouping. So locate them using HAVING COUNT(*) > 1. If duplicate values are encountered, return the maximum ID for each duplicate row. Using the outer query, delete any ID returned by subquery.

-- Find Duplicate Rows
SELECT    MAX(ID) as ID, CustName, Pincode FROM #Customers
GROUP BY CustName, Pincode
HAVING COUNT(*) > 1

-- Delete Duplicate Rows
DELETE FROM #Customers
WHERE ID IN
( SELECT    MAX(ID) FROM #Customers
GROUP BY CustName, Pincode
HAVING COUNT(*) > 1)

SELECT * FROM #Customers

OUTPUT

About The Author

Suprotim Agarwal, MCSD, MCAD, MCDBA, MCSE, is the founder of DotNetCurry, DNC Magazine for Developers, SQLServerCurry and DevCurry. He has also authored a couple of books 51 Recipes using jQuery with ASP.NET Controls and a new one recently at The Absolutely Awesome jQuery CookBook.

Suprotim has received the prestigous Microsoft MVP award for nine times in a row now. In a professional capacity, he is the CEO of A2Z Knowledge Visuals Pvt Ltd, a digital group that represents premium web sites and digital publications comprising of Professional web, windows, mobile and cloud developers, technical managers, and architects.

Get in touch with him on Twitter @suprotimagarwal, LinkedIn or befriend him on Facebook

4 comments:

Anonymous said...: I'd change ...WHERE ID IN (...) to ...WHERE ID NOT IN (...), that way you'd get rid of all the duplicates, regardless how many there are.; October 18, 2010 at 6:05 AM
Federico Emi said...: Robert adding NOT IN will delete the rows that are 'not duplicate' and that's exactly the opposite of what the author describes.

If you are suggesting an alternate query, please elaborate since I get the desired results using the query shown in this article; October 18, 2010 at 9:12 AM
Anonymous said...: This won't work if there is more than one duplicate as it only deletes the duplicate with the largest ID field.; October 18, 2010 at 5:09 PM
Unknown said...: To satisfy the requirement that all dups are deleted, no matter how many there are, use the row_number() window function. Number the rows within the same group according to your id field in an ascending order, and then delete all rows that have the generated row_number() > 1. This way you'll be left with no dups and it's all in one go.; November 4, 2010 at 9:48 AM