Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hi,
I'm looking for a way to create a join in pyspark notebook that accepts "not equal" in the joining clause. Simplistic example:
SELECT *
FROM tableA
JOIN tableB ON (tableA.id = tableB.id) AND (tableA.diff_id <> tableB.diff_id)
I tried different approaches both with SQL syntax in spark sql, as well as columnar expressions in pyspark api; nothing worked 😞
I am aware of the "crossjoin and filter" approach, but this is super inefficient and my datasets are relatively big.
Any help will be appreciated!
Solved! Go to Solution.
HI @Ostrzak,
The 'ON' statement should put the key field between TableA and TableB that used to join, you can't directly put filter condition in it.
I'd like to suggest adding where condition after the join to filter or do pre-filter on two tables before join. (according to your description , you want the diff_id not equal records, so you can get the similar part records and use this as condition to filter 'main table' records which not include in it)
Join and filter:
SELECT *
FROM tableA
JOIN tableB ON tableA.id = tableB.id
where tableA.id = tableB.id AND tableA.diff_id <> tableB.diff_id
Regards,
Xiaoxin Sheng
Hi Xiaoxin,
Thank you very much for your help. I found yesterday something similar, but your approach is much more elegant; especially that I need to dynamically change the JOIN and WHERE clauses. Now I can do it with f-strings 🙂
I'm accepting your post as a solution.
HI @Ostrzak,
The 'ON' statement should put the key field between TableA and TableB that used to join, you can't directly put filter condition in it.
I'd like to suggest adding where condition after the join to filter or do pre-filter on two tables before join. (according to your description , you want the diff_id not equal records, so you can get the similar part records and use this as condition to filter 'main table' records which not include in it)
Join and filter:
SELECT *
FROM tableA
JOIN tableB ON tableA.id = tableB.id
where tableA.id = tableB.id AND tableA.diff_id <> tableB.diff_id
Regards,
Xiaoxin Sheng
User | Count |
---|---|
6 | |
2 | |
2 | |
2 | |
2 |
User | Count |
---|---|
18 | |
17 | |
6 | |
5 | |
4 |