<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Looking for ideas: non-equi join in pyspark notebook in Data Engineering</title>
    <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Looking-for-ideas-non-equi-join-in-pyspark-notebook/m-p/4055294#M3203</link>
    <description>&lt;P&gt;HI&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/634523"&gt;@Ostrzak&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;The 'ON' statement should put the key field between TableA and TableB that used to join, you can't directly put filter condition in it.&lt;BR /&gt;I'd like to suggest adding where condition after the join to filter or do pre-filter on two tables before join. (according to your description , you want the diff_id not equal records, so you can get the similar part records and use this as condition to filter 'main table' records which not include in it)&lt;BR /&gt;Join and filter:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;SELECT *
FROM tableA
JOIN tableB ON tableA.id = tableB.id
where tableA.id = tableB.id AND tableA.diff_id &amp;lt;&amp;gt; tableB.diff_id&lt;/LI-CODE&gt;
&lt;P&gt;Regards,&lt;/P&gt;
&lt;P&gt;Xiaoxin Sheng&lt;/P&gt;</description>
    <pubDate>Tue, 23 Jul 2024 04:08:52 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2024-07-23T04:08:52Z</dc:date>
    <item>
      <title>Looking for ideas: non-equi join in pyspark notebook</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Looking-for-ideas-non-equi-join-in-pyspark-notebook/m-p/4054025#M3195</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm looking for a way to create a join in pyspark notebook that accepts "not equal" in the joining clause. Simplistic example:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;SELECT *&lt;/P&gt;&lt;P&gt;FROM tableA&lt;/P&gt;&lt;P&gt;JOIN tableB ON (tableA.id = tableB.id) AND&amp;nbsp;(tableA.diff_id &amp;lt;&amp;gt; tableB.diff_id)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried different approaches both with SQL syntax in spark sql, as well as columnar expressions in pyspark api; nothing worked &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I am aware of the "crossjoin and filter" approach, but this is super inefficient and my datasets are relatively big.&lt;BR /&gt;Any help will be appreciated!&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jul 2024 13:20:50 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Looking-for-ideas-non-equi-join-in-pyspark-notebook/m-p/4054025#M3195</guid>
      <dc:creator>Ostrzak</dc:creator>
      <dc:date>2024-07-22T13:20:50Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for ideas: non-equi join in pyspark notebook</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Looking-for-ideas-non-equi-join-in-pyspark-notebook/m-p/4055294#M3203</link>
      <description>&lt;P&gt;HI&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/634523"&gt;@Ostrzak&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;The 'ON' statement should put the key field between TableA and TableB that used to join, you can't directly put filter condition in it.&lt;BR /&gt;I'd like to suggest adding where condition after the join to filter or do pre-filter on two tables before join. (according to your description , you want the diff_id not equal records, so you can get the similar part records and use this as condition to filter 'main table' records which not include in it)&lt;BR /&gt;Join and filter:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;SELECT *
FROM tableA
JOIN tableB ON tableA.id = tableB.id
where tableA.id = tableB.id AND tableA.diff_id &amp;lt;&amp;gt; tableB.diff_id&lt;/LI-CODE&gt;
&lt;P&gt;Regards,&lt;/P&gt;
&lt;P&gt;Xiaoxin Sheng&lt;/P&gt;</description>
      <pubDate>Tue, 23 Jul 2024 04:08:52 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Looking-for-ideas-non-equi-join-in-pyspark-notebook/m-p/4055294#M3203</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2024-07-23T04:08:52Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for ideas: non-equi join in pyspark notebook</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Looking-for-ideas-non-equi-join-in-pyspark-notebook/m-p/4056208#M3216</link>
      <description>&lt;P&gt;Hi Xiaoxin,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you very much for your help. I found yesterday something similar, but your approach is much more elegant; especially that I need to dynamically change the JOIN and WHERE clauses. Now I can do it with f-strings &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;BR /&gt;I'm accepting your post as a solution.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 23 Jul 2024 10:56:34 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Looking-for-ideas-non-equi-join-in-pyspark-notebook/m-p/4056208#M3216</guid>
      <dc:creator>Ostrzak</dc:creator>
      <dc:date>2024-07-23T10:56:34Z</dc:date>
    </item>
  </channel>
</rss>

