Someone recently asked me how to find and delete duplicate rows from sample tables, which do not have a Primary Key. I have written one such sample on MSDN code Find and/or Delete Duplicate Rows which I will share here
SAMPLE Data
-- Suppress data loading messages
SET NOCOUNT ON
-- Create Table
CREATE TABLE #Customers (ID integer, CustName varchar(20), Pincode int)
-- Load Sample Data in Table
INSERT INTO #Customers VALUES (1, 'Jack',45454 )
INSERT INTO #Customers VALUES (2, 'Jill', 43453)
INSERT INTO #Customers VALUES (3, 'Tom', 43453)
INSERT INTO #Customers VALUES (4, 'Kathy', 22343)
INSERT INTO #Customers VALUES (5, 'David', 65443)
INSERT INTO #Customers VALUES (6, 'Kathy', 22343)
INSERT INTO #Customers VALUES (7, 'Kim', 65443)
INSERT INTO #Customers VALUES (8, 'Hoggart', 33443)
INSERT INTO #Customers VALUES (9, 'Kate', 61143)
INSERT INTO #Customers VALUES (10, 'Kim', 65443)
To indentify duplicate rows, the trick is to Group the rows by CustName, Pincode. Rows having similar CustName and Pincode will have more than one rows in the grouping. So locate them using HAVING COUNT(*) > 1. If duplicate values are encountered, return the maximum ID for each duplicate row. Using the outer query, delete any ID returned by subquery.
-- Find Duplicate Rows
SELECT MAX(ID) as ID, CustName, Pincode FROM #Customers
GROUP BY CustName, Pincode
HAVING COUNT(*) > 1
-- Delete Duplicate Rows
DELETE FROM #Customers
WHERE ID IN
( SELECT MAX(ID) FROM #Customers
GROUP BY CustName, Pincode
HAVING COUNT(*) > 1)
SELECT * FROM #Customers
OUTPUT
4 comments:
I'd change ...WHERE ID IN (...) to ...WHERE ID NOT IN (...), that way you'd get rid of all the duplicates, regardless how many there are.
Robert adding NOT IN will delete the rows that are 'not duplicate' and that's exactly the opposite of what the author describes.
If you are suggesting an alternate query, please elaborate since I get the desired results using the query shown in this article
This won't work if there is more than one duplicate as it only deletes the duplicate with the largest ID field.
To satisfy the requirement that all dups are deleted, no matter how many there are, use the row_number() window function. Number the rows within the same group according to your id field in an ascending order, and then delete all rows that have the generated row_number() > 1. This way you'll be left with no dups and it's all in one go.
Post a Comment