Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Don't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.

Reply
Ostrzak
Helper II
Helper II

Looking for ideas: non-equi join in pyspark notebook

Hi,

 

I'm looking for a way to create a join in pyspark notebook that accepts "not equal" in the joining clause. Simplistic example:

 

SELECT *

FROM tableA

JOIN tableB ON (tableA.id = tableB.id) AND (tableA.diff_id <> tableB.diff_id)

 

I tried different approaches both with SQL syntax in spark sql, as well as columnar expressions in pyspark api; nothing worked 😞

I am aware of the "crossjoin and filter" approach, but this is super inefficient and my datasets are relatively big.
Any help will be appreciated!

1 ACCEPTED SOLUTION
v-shex-msft
Community Support
Community Support

HI @Ostrzak,

The 'ON' statement should put the key field between TableA and TableB that used to join, you can't directly put filter condition in it.
I'd like to suggest adding where condition after the join to filter or do pre-filter on two tables before join. (according to your description , you want the diff_id not equal records, so you can get the similar part records and use this as condition to filter 'main table' records which not include in it)
Join and filter:

SELECT *
FROM tableA
JOIN tableB ON tableA.id = tableB.id
where tableA.id = tableB.id AND tableA.diff_id <> tableB.diff_id

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

View solution in original post

2 REPLIES 2
Ostrzak
Helper II
Helper II

Hi Xiaoxin,

 

Thank you very much for your help. I found yesterday something similar, but your approach is much more elegant; especially that I need to dynamically change the JOIN and WHERE clauses. Now I can do it with f-strings 🙂
I'm accepting your post as a solution.


v-shex-msft
Community Support
Community Support

HI @Ostrzak,

The 'ON' statement should put the key field between TableA and TableB that used to join, you can't directly put filter condition in it.
I'd like to suggest adding where condition after the join to filter or do pre-filter on two tables before join. (according to your description , you want the diff_id not equal records, so you can get the similar part records and use this as condition to filter 'main table' records which not include in it)
Join and filter:

SELECT *
FROM tableA
JOIN tableB ON tableA.id = tableB.id
where tableA.id = tableB.id AND tableA.diff_id <> tableB.diff_id

Regards,

Xiaoxin Sheng

Community Support Team _ Xiaoxin
If this post helps, please consider accept as solution to help other members find it more quickly.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.