Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Don't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.

Reply
DebbieE
Community Champion
Community Champion

Fabric Pyspark Help. Adding a Min into a group by agg code in Notebooks

It would be really useful if we had a Pyspark forum

 

Im SQL through and through and learning Pyspark is a NIGHTMARE

I have the following code that finds all the contestants with more that one record in a list that should be unique

from pyspark.sql.functions import *

dfcont.groupBy('contestant')\
        .agg(count('contestant').alias('TotalRecords'))\
        .filter(col('TotalRecords')>1).show(1000)
 
There are 4 contestants that have more than one value in the list
 
I also want to bring through the min contestantKey into the above result set.
 
In SQL it would be
SELECT Min(CustomerID, Customer, Count(*)
GROUP BY Customer
Having Count(*) >1
 
I find it unbelievably frustrating that I can just type that in a couple of seconds. And yet I have been struggling to do it in Pyspark for...... well. Too long. 
 
Any help would be appreciated.  
1 ACCEPTED SOLUTION
Anonymous
Not applicable

Hi @DebbieE ,

Thanks for using Fabric Community.
As I understand -

vgchennamsft_2-1712288591192.png

 



Spark SQL Code:

vgchennamsft_0-1712288207519.png

 

df = spark.sql("SELECT Min(CustomerID), CompanyName, Count(*) FROM gopi_lake_house.customer_table1 group by CompanyName having count(*)>1")
display(df)

 


Pyspark Code:

vgchennamsft_1-1712288265517.png

 



Can you please try below code -

 

 

from pyspark.sql.functions import *



result = dfcont.groupBy('CompanyName')\
        .agg(min('CustomerID').alias('minCustomerID'), count('CustomerID').alias('TotalRecords'))\
        .filter(col('TotalRecords') > 1)\
        .show(1000)

 



Hope this is helpful. Please let me know incase of further queries.

 

View solution in original post

3 REPLIES 3
DebbieE
Community Champion
Community Champion

Yey. It worked. thank you so much. the excercise is to try to do everything I usually do with pyspark so having these extra examples are gold too.

Anonymous
Not applicable

Hi @DebbieE ,

Glad to know that your query got resolved. Please continue using Fabric Community on your further queries.

Anonymous
Not applicable

Hi @DebbieE ,

Thanks for using Fabric Community.
As I understand -

vgchennamsft_2-1712288591192.png

 



Spark SQL Code:

vgchennamsft_0-1712288207519.png

 

df = spark.sql("SELECT Min(CustomerID), CompanyName, Count(*) FROM gopi_lake_house.customer_table1 group by CompanyName having count(*)>1")
display(df)

 


Pyspark Code:

vgchennamsft_1-1712288265517.png

 



Can you please try below code -

 

 

from pyspark.sql.functions import *



result = dfcont.groupBy('CompanyName')\
        .agg(min('CustomerID').alias('minCustomerID'), count('CustomerID').alias('TotalRecords'))\
        .filter(col('TotalRecords') > 1)\
        .show(1000)

 



Hope this is helpful. Please let me know incase of further queries.

 

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Prices go up Feb. 11th.

JanFabricDE_carousel

Fabric Monthly Update - January 2025

Explore the power of Python Notebooks in Fabric!

JanFabricDW_carousel

Fabric Monthly Update - January 2025

Unlock the latest Fabric Data Warehouse upgrades!

JanFabricDF_carousel

Fabric Monthly Update - January 2025

Take your data replication to the next level with Fabric's latest updates!