Solved: Hash Function for Row Compare

adamlob · ‎10-22-2025

Hi,

I'm working on a Data Pipeline that loads data into a Dataverse table. I do a row compare to detect changes between loads, so I am only loading rows that have changed.

Is there anyway to hash the concat of rows? At the moment it seems I can only do plain-text and then convert it to Binary. Hashing would help save on space.

AntoineW · ‎10-23-2025

Hi @adamlob,

In Dataflow Gen2 / Fabric Data Pipeline (Data Factory), there’s no native “Hash” transformation yet.

You can use Notebooks to do that :

from pyspark.sql.functions import sha2, concat_ws

df_hashed = df.withColumn(
"row_hash",
sha2(concat_ws("|", *df.columns), 256)
)

Then save it back to your Lakehouse table and use that hash for change-detection.

✅ Benefits:

Very fast and scalable,
Produces fixed-length SHA-256 strings (~64 chars),
Easy to use as a comparison key.

Doc :

- https://spark.apache.org/docs/latest/api/sql/index.html#sha2

Hope it can help you !

Best regards,

Antoine

View solution in original post

AntoineW · ‎10-23-2025