Timeout failures when using semantic-link for spar...

dbeavon3 · ‎02-11-2025

Our primary tenant is in West US and our capacity is in North Central. When I run certain types of Spark SQL select statements they fail. I haven't really found a pattern, except the spark logs make it clear that there is timeout error related to West US.

We had accidentally created our tenant in West US. So whenever I see anything about West US in the error details, I know it is related to some perfunctory cross-region network bug in Power BI. Our dedicated capacities are never actually hosted in West US.

I see my errors both on "starter pools" and "custom pools".
We have disabled the automatic tracking of machine learning experiments and models.

The error is like so:

ERROR PBIMeasurePartitionReader "libraryName":"SynapseML" "errorMessage":"java.net.SocketTimeoutException"

2025-02-11 18:16:28,647 INFO InMemoryCacheClient$ [scala-execution-context-global-60]: get token for ml from in memory cache is called
2025-02-11 18:16:28,648 INFO InMemoryCacheClient$ [scala-execution-context-global-60]: Token for ml successfully fetched from in-memory cache
2025-02-11 18:16:28,648 INFO TokenLibrary [scala-execution-context-global-60]: InMemory Cache hit! Returning cached token for ml. TimeTaken in ms: 1
2025-02-11 18:16:28,648 INFO TokenLibrary [Executor task launch worker for task 0.0 in stage 5.0 (TID 5)]:  ThreadId: 51 ThreadName: Executor task launch worker for task 0.0 in stage 5.0 (TID 5) getAccessToken internal for pbi is called
2025-02-11 18:16:28,649 INFO InMemoryCacheClient$ [Executor task launch worker for task 0.0 in stage 5.0 (TID 5)]: get token for pbi from in memory cache is called
2025-02-11 18:16:28,650 INFO InMemoryCacheClient$ [Executor task launch worker for task 0.0 in stage 5.0 (TID 5)]: Token for pbi successfully fetched from in-memory cache
2025-02-11 18:16:28,650 INFO TokenLibrary [Executor task launch worker for task 0.0 in stage 5.0 (TID 5)]: InMemory Cache hit! Returning cached token for pbi. TimeTaken in ms: 2
2025-02-11 18:16:28,651 INFO SynapseMLLogging [Executor task launch worker for task 0.0 in stage 5.0 (TID 5)]: sending {"groupBy":[{"column":"Code","table":"Random"},{"column":"Fiscal Year Number","table":"Fiscal Week"},{"column":"Fiscal Week Number","table":"Fiscal Week"},{"column":"Fiscal Week","table":"Fiscal Week"}],"metrics":[{"measure":"USD Price MBF","table":"Random Measures"},{"measure":"USD Price MSF","table":"Random Measures"}],"paginationSettings":{"continuationToken":""},"provider":{"datasetId":"02f9b36c-e947-416e-b082-27a83ffb8da2"}}
2025-02-11 18:18:10,632 WARN HandlingUtils [Executor task launch worker for task 0.0 in stage 5.0 (TID 5)]: Encountered Socket Timeout: Read timed out
2025-02-11 18:18:10,634 ERROR PBIMeasurePartitionReader [Executor task launch worker for task 0.0 in stage 5.0 (TID 5)]: {"protocolVersion":"0.0.1","method":"query","libraryName":"SynapseML","errorMessage":"java.net.SocketTimeoutException","errorType":"java.net.SocketTimeoutException","className":"class com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader","libraryVersion":"1.0.8-spark3.5","modelUid":"PBIMeasurePartitionReader_0fe72b74a082"}
java.net.SocketTimeoutException: PowerBI service comm failed (https://WABI-WEST-US-C-PRIMARY-redirect.analysis.windows.net/v1.0/myOrg/internalMetrics/query)
	at com.microsoft.azure.synapse.ml.powerbi.PBISchemas.post(PBISchemas.scala:100)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.$anonfun$executeQuery$1(PBIMeasurePartitionReader.scala:107)
	at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb(SynapseMLLogging.scala:163)
	at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb$(SynapseMLLogging.scala:160)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.logVerb(PBIMeasurePartitionReader.scala:17)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.executeQuery(PBIMeasurePartitionReader.scala:105)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.<init>(PBIMeasurePartitionReader.scala:142)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIReaderFactory.createReader(PBIMeasureScan.scala:26)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Here is a simple example of a query that works

Here is an example of one that fails:

Interestingly the DAX is formatted and executed in less than a second. The following is successful... it indicates the activity is triggered by the semantic-link notebooks.

... based on these observations, it seems like the Spark side of things is the only side that is failing. It is failing to perform some type of perfunctory operation related to the corporate tenant which is hosted at West US.

I'm fairly certain this is supposed to be a GA feature of Spark in Fabric:

https://learn.microsoft.com/en-us/fabric/data-science/semantic-link-power-bi?tabs=sql

I've opened a support ticket as well. Those tickets with Mindtree typically take ~2 or 3 weeks, and it is EXTREMELY rare for us to get any sort of Microsoft FTE engagement. So I'm hoping there is someone from the Microsoft Spark team who is involved in these forums. Else I will take this to Reddit, since that seems to be the place where this PG hangs out the most.

Anonymous · ‎02-13-2025

Hi @dbeavon3,

Thank you for reaching out in Microsoft Community Forum.

Since your primary tenant is in West US but your Fabric capacity (where the Spark pool runs) is in North Central US, some requests may still be routed through the tenant region, leading to timeouts.

java.net.SocketTimeoutException error related to SynapseML when running Spark notebooks in a scenario where your tenant and capacity are in different regions. This cross-region configuration can sometimes lead to network latency issues, especially with services like SynapseML.

please follow below steps to resolve the error;

1.Run ping or traceroute to check network delays and make sure no firewall or proxy restrictions are affecting connections.

2. Scale up Spark pool resources for better performance.

3.Identify any authentication or connection failures and Break large queries into smaller steps to improve performance.

Please continue using Microsoft community forum.

If you found this post helpful, please consider marking it as "Accept as Solution" and give it a 'Kudos'. if it was helpful. help other members find it more easily.

Regards,
Pavan.

dbeavon3 · ‎02-13-2025

@Anonymous
I now have a ticket open with Mindtree.

Please reduce the amount of copy/paste answers from chat GPT. I would ask chat GPT the question if I wanted that sort of answer.

This bug probably impacts every single multi-region customer in Fabric. Despite the wide impact, I'm still not likely to motivate this PG to fix it very quickly. I'm using Mindtree for support, and they seem to have a hard time escalating these bugs past the "SME". I'm not sure why it is so difficult...

Timeout failures when using semantic-link for spark sql

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - July 2025

Join us at FabCon Vienna from September 15-18, 2025

Timeout failures when using semantic-link for spark sql

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - July 2025