There can be two types of duplication of rows in a table
1. Entire row getting duplicated because there is no primary key or unique key.
2. Only primary key or unique key value is different, but remaining all values are same.
We want remove one of the duplicate records of John.
By issuing the following summary query, we can see which see which records are duplicate.
Now we will add this row to a local temporary table.
Now the situation is that the duplicate row is in the local temporary table. All we need to now is to delete records from main table customers1 as per matching custid of the local temporary table.
Will the above query work? Not entirely, as by using the above query, we lost all the duplicate records!! Let us see the table again.
Now to keep one record of John, we will take help of the local temporary table again. Let us add the same record from temporary table into customers1 table.
Finally we got a single record of John at the end. Let us confirm by seeing the Customers1 table.
Once done, we can drop the local temporary table.
Here is the same customer’s record, but this time John’s record has been added thrice with different customer ids but same Passport number!
We need to use the technique of Self Join initially to check for duplicate records containing different custid but same passport number.
Let’s first truncate the customers2 table and add the same rows again.
So the query will be as follows.
1. Entire row getting duplicated because there is no primary key or unique key.
2. Only primary key or unique key value is different, but remaining all values are same.
Scenario 1: Delete duplicate rows without primary key or unique key.
Let us create the following example.create table customers1 (CustId Int, CustName Varchar(20), CustCity Varchar(20), Passport_Number Varchar(20)) go Insert into customers1 Values(1, 'John', 'Paris', 'P123X78') Insert into customers1 Values(2, 'Martin', 'London', 'L873X92') Insert into customers1 Values(3, 'Smith', 'New York', 'N293Y99') Insert into customers1 Values(1, 'John', 'Paris', 'P123X78') go select * from customers1 go
We want remove one of the duplicate records of John.
By issuing the following summary query, we can see which see which records are duplicate.
select * from customers1 Group by Custid,CustName, CustCity, Passport_Number Having count(*) > 1
Now we will add this row to a local temporary table.
Select * into #Temp_customers1 from customers1 where 1 = 2
Insert into #Temp_customers1 select * from customers1 Group by Custid,CustName, CustCity, Passport_Number Having count(*) > 1
Now the situation is that the duplicate row is in the local temporary table. All we need to now is to delete records from main table customers1 as per matching custid of the local temporary table.
Delete from customers1 where custid in (select Custid from #Temp_customers1)
Will the above query work? Not entirely, as by using the above query, we lost all the duplicate records!! Let us see the table again.
select * from customers1 go
Now to keep one record of John, we will take help of the local temporary table again. Let us add the same record from temporary table into customers1 table.
Insert into Customers1 select * from #Temp_customers1 go
Finally we got a single record of John at the end. Let us confirm by seeing the Customers1 table.
select * from customers1 go
Once done, we can drop the local temporary table.
Scenario 2: Delete duplicate rows where primary key or unique key value is different but remaining values are same.
Let us create the following example.create table customers2 (CustId Int Primary Key, CustName Varchar(20), CustCity Varchar(20), Passport_Number Varchar(20)) go Insert into customers2 Values(1, 'John', 'Paris', 'P123X78') Insert into customers2 Values(2, 'Martin', 'London', 'L873X92') Insert into customers2 Values(3, 'Smith', 'New York', 'N293Y99') Insert into customers2 Values(4, 'John', 'Paris', 'P123X78') Insert into customers2 Values(5, 'John', 'Paris', 'P123X78') select * from customers2 go
Here is the same customer’s record, but this time John’s record has been added thrice with different customer ids but same Passport number!
Scenario 2.a: Delete Duplicate rows but keep one using CTE
select distinct a.* from customers2 a join customers2 b on a.custid <> b.custid and a.CustName = b.CustName and a.CustCity = b.CustCity and a.Passport_Number = b.Passport_Number
Now we have realized that custid 1, 4 & 5 are duplicate. The self-join statement accompanied by delete statement will give us the desired output of keeping the last duplicate record by eliminating all the previous duplicate records. We will use the Common Table Expression (CTE) and put the Self Join query in it.
With Duplicates as (select distinct a.custid as Customer_ID from customers2 a join customers2 b on a.custid <> b.custid and a.CustName = b.CustName and a.CustCity = b.CustCity and a.Passport_Number = b.Passport_Number ) Delete from Customers2 where custid in (select Customer_ID from Duplicates) and custid <> (select max(Customer_ID) from Duplicates)Let’s check which rows got deleted.
select * from customers2 go
Scenario 2.b: Delete all duplicate records but keep the first original one
Truncate Table customers2 go Insert into customers2 Values(1, 'John', 'Paris', 'P123X78') Insert into customers2 Values(2, 'Martin', 'London', 'L873X92') Insert into customers2 Values(3, 'Smith', 'New York', 'N293Y99') Insert into customers2 Values(4, 'John', 'Paris', 'P123X78') Insert into customers2 Values(5, 'John', 'Paris', 'P123X78') goThe only change in the sub query will be that we need to use min(CustomerID) instead of max(CustomerID).
So the query will be as follows.
With Duplicates as (select distinct a.custid as Customer_ID from customers2 a join customers2 b on a.custid <> b.custid and a.CustName = b.CustName and a.CustCity = b.CustCity and a.Passport_Number = b.Passport_Number ) Delete from Customers2 where custid in (select Customer_ID from Duplicates) and custid <> (select min(Customer_ID) from Duplicates)Let us confirm this in the customers2 table.
select * from customers2 go
And that’s how we can delete duplicate records in SQL Server with tables without primary key, containing primary key and by keeping one original row.
2 comments:
Scenario 1
There's no need for the initial SELECT INTO for the temp table. Just do it like this:
SELECT *
FROM customers1
GROUP BY
Custid
, CustName
, CustCity
, Passport_Number
HAVING COUNT(*) > 1
But this method is expensive, involving an INSERT, a DELETE and another INSERT. It hasn't been necessary to do it like this since the days of SQL Server 2000. It also has the fatal flaw that it won't work if you have duplicate IDs with different data. Add this row to your table and see what happens with your technique:
INSERT INTO customers1 VALUES (1, 'Betty', 'Huddersfield', 'J467C25')
Here is a method that is cheaper and more general:
WITH Numbered AS (
SELECT
ROW_NUMBER() OVER (
PARTITION BY Custid,CustName, CustCity, Passport_Number
ORDER BY (SELECT NULL)) AS RowNo
, Custid
, CustName
, CustCity
, Passport_Number
FROM customers1
)
DELETE FROM Numbered
WHERE RowNo > 1
Scenario 2
This works as far as it goes, although your method will tolerate two people having the same passport number, which I don't imagine is a good thing. Of course, if you had two people with different names but the same passport number, you'd have to make a decision on which one to delete, something you probably can't do in code, so let's stick with the scenario as you describe it for the sake of simplicity. Here is a more concise way of doing it. You can use ASC or DESC in the ORDER BY, depending on whether you want to keep the highest or lowest ID:
WITH Numbered AS (
SELECT
ROW_NUMBER() OVER (
PARTITION BY CustName, CustCity, Passport_Number
ORDER BY Custid ASC) RowNo
, Custid
, CustName
, CustCity
, Passport_Number
FROM customers2
)
DELETE FROM Numbered
WHERE RowNo > 1
John
Thanks for your solution John. Definitely better than the original proposed solution.
Post a Comment