Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
russellhq
Regular Visitor

Deleting Duplicate Rows in a Data Warehouse Table?

I've created a data warehouse which has a dataflow that connects to an excel file. Each week the excel file is updated but some of the old data is still in it, so when the dataflow runs and appends the new data to the table, I end up with some duplicate rows in my table.

 

I tried multiple queries to get this to work, some along the lines of this link

 

https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/development/remove-duplicate-rows...

 

But I keep getting an error when I run the query.

 

Has anyone got this to work for them? Ideally the query should find records that match on one column and delete them.

 

I did get this to work by splitting the CTE into one query, running that amd creating a new table, then using the delete query to delete all the duplicate rows. But can't for the life of me get them to work together!

1 ACCEPTED SOLUTION

Nothing to be proud of but if it works and doesn't break any GUIDs then so be it.

 

Maybe add an index column to your data to avoid this in the future.

View solution in original post

7 REPLIES 7
russellhq
Regular Visitor

OK, I've managed to come up with a solution for this, I don't know if it's a very good one, but it seems to work for what I want it to do.

 

The solution was to write a stored procedure that creates a Temp table by selecting the distinct rows in the original table. Then it drops the original table, creates a new table using the original tables name and selects all records from the temp table, then drops the temp table.

 

The procedure looks like this:

 

SET ANSI_NULLS ON;

SET QUOTED_IDENTIFIER ON;

CREATE TABLE [Excel Warehouse].[dbo].[TempDailyChecks] AS
SELECT DISTINCT *
FROM [DailyChecks];

DROP TABLE [DailyChecks];

CREATE TABLE [Excel Warehouse].[dbo].[DailyChecks] AS
SELECT *
FROM [TempDailyChecks];

DROP TABLE [TempDailyChecks];

 

Nothing to be proud of but if it works and doesn't break any GUIDs then so be it.

 

Maybe add an index column to your data to avoid this in the future.

Thanks @lbendlin 

My data comes from an excel spreadsheet that's exported from an online service and it has no index column. So I'm not sure how best to go about creating one.

 

For example, if I create an index column in my dataflow when I import it into my table in the Data Warehouse, when I get a new excel file and it contains some rows that are already in the data warehouse, I'm not sure where I'd go from here to match the rows in the data warehouse to the rows in the excel workbook as the index would likely be different.

 

Using DISTINCT seems to be the only way I could think of. But maybe this is my lack of understanding as I'm pretty new to SQL and databases in general.

My data comes from an excel spreadsheet that's exported from an online service and it has no index column

yeah, that's unfortunate.  If your solution works reliably in a reasonable amount of time then leave it as is.

russellhq
Regular Visitor

Thansk @Anonymous .

The database I am using is a Fabric Data Warehouse. I'm using a CTE with ROW_NUMBER() as you describe, but keep getting an error. The error I get it:

 

Internal Query Processor Error: The query processor could not produce a query plan. For more information, contact Customer Support Services.
Msg 8624, Level 16, State 1, Code line 1
 
For testing, I used a table called duplicate_test which has one column called Col1, with the values 1, 2, 3, 4, 5, 5, 3, 5
 
The queries I have tried are:

 

 

 
WITH CTE AS (
        SELECT
            *,
            ROW_NUMBER() OVER (PARTITION BY Col1 ORDER BY Col1 DESC) AS RowNum
        FROM
            dbo.duplicate_test
    )

    DELETE FROM CTE
    WHERE RowNum > 1 
 

and

 WITH CTE([Col1], 
    [DuplicateCount])
AS (SELECT [Col1],
           ROW_NUMBER() OVER(PARTITION BY [Col1]
           ORDER BY Col1) AS DuplicateCount
    FROM duplicate_test)
DELETE FROM CTE
WHERE DuplicateCount > 1 

 

Both give the following error:

 

 Internal Query Processor Error: The query processor could not produce a query plan. For more information, contact Customer Support Services.
Msg 8624, Level 16, State 1, Code line 1 

 

And in both, there is a red underline under DELETE, with the problem described as:

mismatched input 'DELETE' expecting {'SELECT', '('} 

 

 

 

 

 

 

 

Anonymous
Not applicable

Hi @russellhq ,

Based on your description, it appears that you are attempting a method to remove duplicates but are encountering errors when trying to execute the query. 

 

To effectively remove duplicates based on specific columns while avoiding errors, consider using the ROW_NUMBER() function in conjunction with a CTE. This method assigns each row in a dataset partition a unique row number based on a specified order. Any row within each partition with a row number greater than 1 can then be deleted, effectively removing duplicate rows.

 

If you are still encountering errors using this method, could you please provide more detailed information about the error messages you are receiving? It is also helpful to know the specific database system you are using (e.g. SQL Server, Azure SQL) as the syntax and functionality may vary.

 

Best Regards,

Ada Wang

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

After some more Googling, it looks like this may be a limitation of CTEs in a Fabric Data Warehouse

 

https://learn.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-transact-sql?view=s...

 

This stackoverflow post pointed me to this, and also gave a possible option to the OP on how to solve their issue, which was similar but not the same.

https://stackoverflow.com/questions/54624737/cte-with-delete-alternative-for-sql-data-warehouse

 

Is there a way to delete duplicates using a sub-query like the example in the stackoverflow post?

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.