topic Re: Replicating TCP-DS Benchmark in Fabric in Data Engineering

Replicating TCP-DS Benchmark in Fabric

VictorMed — Thu, 13 Mar 2025 15:50:32 GMT

Hi All,

I'm trying to replicate the TCP-DS performance test to compare the performance between Spark and DuckDB.

Using the following repo on how to set up the dataset: GitHub - BlueGranite/tpc-ds-dataset-generator: Generate big TPC-DS datasets with Databricks

I've generated a new environment and a Configuration notebook replicating the steps from their configure.py.

After creating the .sh script, I can't find in Fabric a way to configure the cluster/pool to run it as an init script as we've done in Databricks.

I have 2 questions:

Have somebody used the TPC-DS Dataset in Fabric Notebooks successfully? If so, how?
How can I import the spark.sql.perf into my environment to use it? Can't find it in public libraries or Built-in ones.

Thank you.

Regards,

Victor

Re: Replicating TCP-DS Benchmark in Fabric

v-menakakota — Fri, 14 Mar 2025 10:39:52 GMT

Hi @VictorMed ,
Thank you for reaching out to us on the Microsoft Fabric Community Forum.

Since spark.sql.perf is not available in Fabric’s built-in libraries, you can try manually. Try to do it manually in the fabric notebooks. Once the spark.sql.perf is present you can run TPC-DS Dataset.

If this post was helpful, please give us Kudos and consider marking Accept as solution to assist other members in finding it more easily.

Re: Replicating TCP-DS Benchmark in Fabric

VictorMed — Fri, 14 Mar 2025 15:03:37 GMT

Hi @v-menakakota ,

I've manually generated the spark.sql.perf.jar file on my local machine and uploaded as a custom library on the Fabric environment.

The current issue is I'm not able to import it into my notebook with the following error:

----> 1 import spark.sql.perf ModuleNotFoundError: No module named 'spark'

Re: Replicating TCP-DS Benchmark in Fabric

v-menakakota — Tue, 18 Mar 2025 10:26:21 GMT

Hi @VictorMed ,

Thanks for the update! Since you've manually uploaded the spark.sql.perf JAR, here are a few steps to ensure it is correctly imported in your Fabric Notebook:

Try adding the JAR to your Spark session manually and please ensure that the path matches where you've uploaded the JAR in your OneLake Files or Lakehouse.
Since you’re getting ModuleNotFoundError: No module named 'spark', try using this import pattern. If com.databricks.spark.sql.perf is not found, it likely means the JAR isn’t properly loaded into the Spark environment.

spark.sql.perf is primarily a Scala-based benchmarking library. Fabric’s Notebooks use PySpark, so it may not be compatible unless you run it in a Scala environment (which Fabric currently does not natively support in Notebooks).

If this post was helpful, please give us Kudos and mark it as Accepted Solution to assist other community members.
Thank you.

Re: Replicating TCP-DS Benchmark in Fabric

VictorMed — Wed, 19 Mar 2025 13:44:37 GMT

Finally generated the data using the TCP-DS Duck DB extension and saved them into parquets in my lakehouse