Power BI is turning 10, and we’re marking the occasion with a special community challenge. Use your creativity to tell a story, uncover trends, or highlight something unexpected.
Get startedJoin us at FabCon Vienna from September 15-18, 2025, for the ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM. Get registered
Hi,
i developed a data pipeline in Fabric Data Factory that copy Parquet file data into a fabrick lakehouse table. After loading the data, i need to remove duplicates from the lakehouse table.
I tried to add a script step in the pipeline, but it seems to only allow connections to warehouse and not lakehouse. I also tried to create a notebook using spark sql and run a sql script against the lakehouse table like this:
;WITH
Dups
AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY cmal.Id ORDER BY cmal.CreationTime) AS [Rank]
FROM dbo.AuditLog cmal)
DELETE Dups
FROM Dups
WHERE Dups.Rank > 1
GO
but i get the following error:
no viable alternative at input 'with'
so i don't think spark sql recognizes the 'with' funciton. Does anyone know of another way to delete duplicates from a lakehouse table in Fabric?
thanks
scott
Solved! Go to Solution.
Hi @scabral
Thanks for using Fabric Community.
You can use the dropDuplicates() function in pyspark to drop the duplicates. This will return a new DataFrame with duplicate rows removed. You can also specify which columns to use to identify duplicates by passing a list of column names to the dropDuplicates() function.
For example, to remove duplicate rows based on the id and name columns, you would use the following code: dataframe.dropDuplicates(['id', 'name']) .
I have created a repro based on a sample table named List3.
After dropping the duplicates you can write this dataframe into the table in lakehouse.
Hope this helps. Please let me know if you have any further queries.
Hi @scabral
Thanks for using Fabric Community.
You can use the dropDuplicates() function in pyspark to drop the duplicates. This will return a new DataFrame with duplicate rows removed. You can also specify which columns to use to identify duplicates by passing a list of column names to the dropDuplicates() function.
For example, to remove duplicate rows based on the id and name columns, you would use the following code: dataframe.dropDuplicates(['id', 'name']) .
I have created a repro based on a sample table named List3.
After dropping the duplicates you can write this dataframe into the table in lakehouse.
Hope this helps. Please let me know if you have any further queries.
thanks, i will try it out and reply with my findings
Hi @scabral
We haven’t heard from you on the last response and was just checking back to see if your query got resolved. Otherwise, will respond back with the more details and we will try to help.
Thanks
thanks,
here is the code that I created using pyspark to remove duplicates from my table in lakehouse after loading:
Hi @scabral
Glad that your query got resolved. Please continue using Fabric Community for any help regarding your queries.
This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.
Check out the June 2025 Fabric update to learn about new features.
User | Count |
---|---|
56 | |
29 | |
18 | |
10 | |
4 |
User | Count |
---|---|
66 | |
54 | |
21 | |
8 | |
6 |