topic Pyspark code running twice causing a LIVY status = DEAD in Data Engineering

Pyspark code running twice causing a LIVY status = DEAD

mkj1213 — Thu, 01 Aug 2024 17:31:02 GMT

Hello i am running the following cell (all packages has been imported and parameters set in a previous cells, and the code is running)

df = spark.sql(f"select * from LH.STOCK_Price where YEAR = {parameter_year} and MONTH <= {parameter_month}") # Filter the DataFrame to include only rows where the year in TIMESTAMP is parameter #df = df.filter((col("YEAR") == parameter_year) & (col("MONTH") <= parameter_month)) #df.cache() # Ensure the data is partitioned appropriately df = df.repartition("TYPE", "Vendor", col("TIMESTAMP").cast("date")) # Define a custom function to count distinct IDs distinct_count = F.expr("count(distinct ID)").alias("ID_DISTINCT_COUNT") # Group by TYPE, Vendor, and TIMESTAMP (cast to date) result = df.groupBy( col("TYPE"), col("Vendor"), col("TIMESTAMP").cast("date").alias("DATE") ).agg( count(col("ID")).alias("ID_COUNT"), distinct_count ) result.cache() # Write the result to a Delta table with partitioning result.write.format("delta").mode("append").partitionBy("DATE").saveAsTable("LH.Count_Date_Wise_STOCKS")

The aim of my code is to read a STOCK_Price delta table from a lakehouse for a specific year month into a dataframe df, then group by different columns and count the ID and their distinct count. I am caching the result and then writing it to an aggregate delta table (Count_Date_Wise_Stocks), but i feel that part of my code is running twice as i can see from the spark diagnostic below the cell:

The second run is almost producing a session error with an error code LIVY state = Dead.

Some statistics about my tables:
STOCK_Price : around 700 Billion Rows
df : about 25 Billlions Rows
Count_Date_Wise_STOCKS: around 10,000 rows only.
the execution of the cell is taking sometimes up to 10 hours for 6 month of data and returning a Count_Date_Wise_STOCKS of around 4000 rows.

Questions: What part of my code is causing the second run? how can i avoid that if it is possible?
Regards

Re: Pyspark code running twice causing a LIVY status = DEAD

Anonymous — Fri, 02 Aug 2024 06:12:40 GMT

Hi @mkj1213

I feel like the number of Jobs doesn't mean that the code has run the same number of times. I made some test based on your code (have made some modifications to fit my data). In the following images, you will see that it experienced 7 Spark jobs.

When I query data from the result delta table, it doesn't show any duplicated rows. This means the code runs only once although it shows 7 Spark jobs.

According to my research, this "LIVY status = DEAD" error is more like a result of a lack of some kind of resource or a resource that goes out of limit. I find an Azure Synapse Analytics Blog related to this error as below. According to the solution in it, you can try to increase the node size of the Spark pool which is used to run the notebook.

Livy is dead and some logs to help. - Microsoft Community Hub

Here are some Fabric documentations about setting Spark pool:

Workspace administration settings in Microsoft Fabric - Microsoft Fabric | Microsoft Learn

Manage settings for data engineering and science capacity - Microsoft Fabric | Microsoft Learn

Hope this will be helpful!

Best Regards,
Jing
If this post helps, please Accept it as Solution to help other members find it. Appreciate your Kudos!

Re: Pyspark code running twice causing a LIVY status = DEAD

mkj1213 — Fri, 02 Aug 2024 07:33:41 GMT

Thanks for giving my problem some time on your end.
i will look into the links that you had provided in more details and get back to you. The only difference that i had noticed between the structure of your code and my code was the agg part. In my code i had made two aggregation (count and distinct count) while in mine you had only done one. Is it possible to add another aggregation method (maybe a count) and share the spark jobs screenshot (the second image that you had included in your comments.

On the spark pool settings, it turned out that i was using large size (maybe i could have increased it to X or XX size):

Regards

Re: Pyspark code running twice causing a LIVY status = DEAD

Anonymous — Thu, 08 Aug 2024 07:46:07 GMT

Hi @mkj1213 This is my testing result with two aggregation methods. It includes 6 spark jobs the first time.

The second time:

Re: Pyspark code running twice causing a LIVY status = DEAD

mkj1213 — Fri, 09 Aug 2024 07:45:46 GMT

Thanks for the reply, i am wondering how large is the data in bing_covid_19_data table?

Re: Pyspark code running twice causing a LIVY status = DEAD

Anonymous — Tue, 13 Aug 2024 08:22:33 GMT

This is a small sample data. The underlying parquet file is around 51MB and the table has around 4.7 million rows data.

If you run the code cell with less data from the same tables, will the error occur? If not, maybe you can consider splitting the data and running separately to aggregate data from different time periods.

In addition, did you try increase the node size or number of nodes? Will this have a better result?

What's more, you may try using Fabric Spark monitoring to find more details behind a Spark job of the notebook.

View browse item's recent runs - Microsoft Fabric | Microsoft Learn

Monitor Spark jobs within a notebook - Microsoft Fabric | Microsoft Learn

Best Regards,
Jing

Re: Pyspark code running twice causing a LIVY status = DEAD

Anonymous — Wed, 21 Aug 2024 08:38:12 GMT

Hi @mkj1213

Have you resolved this issue? If any of the answers provided were helpful, please consider accepting them as a solution. If you have found other solutions, we would greatly appreciate it if you could share them with us. Thank you!

Best Regards,
Jing