Starting December 3, join live sessions with database experts and the Microsoft product team to learn just how easy it is to get started
Learn moreShape the future of the Fabric Community! Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions. Take survey.
Hi,
i developed a data pipeline in Fabric Data Factory that copy Parquet file data into a fabrick lakehouse table. After loading the data, i need to remove duplicates from the lakehouse table.
I tried to add a script step in the pipeline, but it seems to only allow connections to warehouse and not lakehouse. I also tried to create a notebook using spark sql and run a sql script against the lakehouse table like this:
;WITH
Dups
AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY cmal.Id ORDER BY cmal.CreationTime) AS [Rank]
FROM dbo.AuditLog cmal)
DELETE Dups
FROM Dups
WHERE Dups.Rank > 1
GO
but i get the following error:
no viable alternative at input 'with'
so i don't think spark sql recognizes the 'with' funciton. Does anyone know of another way to delete duplicates from a lakehouse table in Fabric?
thanks
scott
Solved! Go to Solution.
Hi @scabral
Thanks for using Fabric Community.
You can use the dropDuplicates() function in pyspark to drop the duplicates. This will return a new DataFrame with duplicate rows removed. You can also specify which columns to use to identify duplicates by passing a list of column names to the dropDuplicates() function.
For example, to remove duplicate rows based on the id and name columns, you would use the following code: dataframe.dropDuplicates(['id', 'name']) .
I have created a repro based on a sample table named List3.
After dropping the duplicates you can write this dataframe into the table in lakehouse.
Hope this helps. Please let me know if you have any further queries.
Hi @scabral
Thanks for using Fabric Community.
You can use the dropDuplicates() function in pyspark to drop the duplicates. This will return a new DataFrame with duplicate rows removed. You can also specify which columns to use to identify duplicates by passing a list of column names to the dropDuplicates() function.
For example, to remove duplicate rows based on the id and name columns, you would use the following code: dataframe.dropDuplicates(['id', 'name']) .
I have created a repro based on a sample table named List3.
After dropping the duplicates you can write this dataframe into the table in lakehouse.
Hope this helps. Please let me know if you have any further queries.
thanks, i will try it out and reply with my findings
Hi @scabral
We haven’t heard from you on the last response and was just checking back to see if your query got resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks
thanks,
here is the code that I created using pyspark to remove duplicates from my table in lakehouse after loading:
Hi @scabral
Glad that your query got resolved. Please continue using Fabric Community for any help regarding your queries.
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Check out the November 2024 Fabric update to learn about new features.
User | Count |
---|---|
4 | |
4 | |
2 | |
1 | |
1 |
User | Count |
---|---|
13 | |
7 | |
5 | |
4 | |
4 |