Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
scabral
Helper IV
Helper IV

deleting duplicate rows in lakehouse

Hi,

 

i developed a data pipeline in Fabric Data Factory that copy Parquet file data into a fabrick lakehouse table.  After loading the data, i need to remove duplicates from the lakehouse table.

 

I tried to add a script step in the pipeline, but it seems to only allow connections to warehouse and not lakehouse.  I also tried to create a notebook using spark sql and run a sql script against the lakehouse table like this:

 

;WITH
Dups
AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY cmal.Id ORDER BY cmal.CreationTime) AS [Rank]
FROM dbo.AuditLog cmal)

DELETE Dups
FROM Dups
WHERE Dups.Rank > 1
GO

 

but i get the following error:

 

no viable alternative at input 'with'

 

so i don't think spark sql recognizes the 'with' funciton.  Does anyone know of another way to delete duplicates from a lakehouse table in Fabric?

 

thanks

scott

1 ACCEPTED SOLUTION
v-nikhilan-msft
Community Support
Community Support

Hi @scabral 
Thanks for using Fabric Community.
You can use the dropDuplicates() function in pyspark to drop the duplicates. This will return a new DataFrame with duplicate rows removed. You can also specify which columns to use to identify duplicates by passing a list of column names to the dropDuplicates() function.
For example, to remove duplicate rows based on the id and name columns, you would use the following code: dataframe.dropDuplicates(['id', 'name']) .

I have created a repro based on a sample table named List3. 

vnikhilanmsft_0-1710439172869.png


After dropping the duplicates you can write this dataframe into the table in lakehouse.

vnikhilanmsft_1-1710439260549.png


Hope this helps. Please let me know if you have any further queries.



View solution in original post

5 REPLIES 5
v-nikhilan-msft
Community Support
Community Support

Hi @scabral 
Thanks for using Fabric Community.
You can use the dropDuplicates() function in pyspark to drop the duplicates. This will return a new DataFrame with duplicate rows removed. You can also specify which columns to use to identify duplicates by passing a list of column names to the dropDuplicates() function.
For example, to remove duplicate rows based on the id and name columns, you would use the following code: dataframe.dropDuplicates(['id', 'name']) .

I have created a repro based on a sample table named List3. 

vnikhilanmsft_0-1710439172869.png


After dropping the duplicates you can write this dataframe into the table in lakehouse.

vnikhilanmsft_1-1710439260549.png


Hope this helps. Please let me know if you have any further queries.



thanks, i will try it out and reply with my findings

Hi @scabral 
We haven’t heard from you on the last response and was just checking back to see if your query got resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks

thanks,

 

here is the code that I created using pyspark to remove duplicates from my table in lakehouse after loading:

 

df = spark.sql("Select Id, CreationTime from stgadminactivitieslog")
df1 = df.dropDuplicates(['Id', 'CreationTime'])
df1.write.format("delta").mode("overwrite").save("Tables/stgadminactivitieslog")
 
Scott

Hi @scabral 
Glad that your query got resolved. Please continue using Fabric Community for any help regarding your queries.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Dec Fabric Community Survey

We want your feedback!

Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.