Typical. You answer a question only to find that the answer you provided isn’t what the original poster was looking for by reading the dozen or so comments that arrived on the question while you were figuring out your answer… So I’ve taken my answer out and will put it here instead. 🙂
How can I deduplicate data?
In more detail…
Let’s work this through with a sample. A ludicrously simple sample.
DECLARE @DeDupe TABLE (f1 VARCHAR(100), i1 INT) INSERT INTO @DeDupe SELECT 'Fred', 1 UNION ALL SELECT 'Fred', 1 UNION ALL SELECT 'John', 1 UNION ALL SELECT 'John', 2 SELECT * FROM @DeDupe
At this point, we have a (very) small table with a duplicate row. So now we’ll use a CTE and that ROW_NUMBER function mentioned above and add the following to our query above:
WITH DeDupeTemp AS ( SELECT f1, i1, ROW_NUMBER() OVER ( PARTITION BY f1,i1 ORDER BY f1,i1) AS RowNum FROM @DeDupe) SELECT * FROM DeDupeTemp
As you can see, that final
SELECT gives us an extra column, the row number of the line within that particular combination of fields.
We now have an easy way of deleting the data we don’t need. Replace the final line of the above with:
DELETE FROM DeDupeTemp WHERE RowNum > 1 SELECT * FROM @DeDupe
Hurrah! No duplicate data. Manipulating the CTE like this has directly affected the base table.
The next step is to wrap that code up into a stored procedure that takes the name of a table as a parameter, prepares dynamic T-SQL statements and executes them to deduplicate the named table.