Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.

Reply
FelixL
Advocate I
Advocate I

Delayed data refresh in SQL Analytical Endpoint

I am currently working on moving from Synapse Serverless SQL to Fabric SQL Analytical Endpoint for reading my external Delta Lake tables (Managed and updated in Azure Synapse, and linked to Fabric Lakehouse using ADLS Gen2 shortcuts). However, I am facing issues with the data not being refreshed in a timely manner in Fabric, where it is always up-to-date the second the underlying Delta Lake table is updated when reading through Synapse Serverless. 

 

When I run my Azure Synapse pipelines and refresh my Delta Lake tables I also trigger Power BI dataset refreshes at the end of every pipeline, to load any updated data into Power BI. This works flawlessly when using Synapse Serverless as source, but fails about 50% of the time when using Fabric SQL Analytical Endpoint. The data being read from Fabric is sometimes the newest data, and sometimes old (yesterdays) data - even though the dataset refresh is triggered by the same pipeline that has updated the data, so i KNOW the data is updated at the time of refresing the dataset.

 

Adding a "wait" activity in my pipelines seem to (mostly) solve the problem. Around 10-15 minutes wait seem to do the trick. But this is never needed when using Synapse Serverless.

 

Does Fabric SQL Analytical Endpoint work differently than Synapse Serverless in some fundamental way? Does Fabric keep some kind of cache for external Delta Lake table data, that could explain why I sometimes see an up to 15 minute lagg before the SQL Analytical Endpoint is refreshed? I was hoping the endpoint would simply read the Delta Lake tables as they are, on the lake, but something else seems to be happening here, and it really is a worse experience than using Azure Synapse Sevrerless SQL for this purpose.

 

Anyone else seen similar things? 

 

Thanks,

1 ACCEPTED SOLUTION

So my info from Microsoft is that it should take a few seconds to a minute for the SQL Endpoint to discover the updated delta logs.  They did have an issue which was fixed apparently.

 

If your issue persists it would be worth logging that with MS support and give them you workspace ID, lakehouse id, and time it was happening

View solution in original post

9 REPLIES 9
v-cboorla-msft
Community Support
Community Support

Hi @FelixL 

 

Glad that your query got resolved.

Please continue using Fabric Community for any help regarding your queries.

 

Thanks

AndyDDC
Solution Sage
Solution Sage

Hi @FelixL can you check a setting for the power bi dataset (semantic model) and see if this makes any difference? In the following doc, under the Refresh heading it'll show you a setting for "Keep your Direct Lake data up to date" 

 

https://learn.microsoft.com/en-us/power-bi/enterprise/directlake-overview#:~:text=Direct%20Lake%20mo....

Hi, I am aware of this setting and I enabling it whenever possible. But sometimes, due to data type errors or other refresh issues this setting gets automatically disabled. 

 

And in this case I dont really case about the default Direct Lake dataset - I only want to query my Delta Lake tables with T-SQL via the SQL endpoint (something that always works flawlessly with no refresh delay in Synapse).

 

My understanding is that the SQL endpoint is not dependent on the default dataset. Or is it? I should be able to unselect all tables from the SQL endpoint from the default dataset (and end up with an empty dataset), but still be able to query the SQL endpoint.

 

Are you saying that im wrong, and there is a circle reference going on here? The metadata in the SQL endpoint needs to be updated in order to select what tables to include in the default dataset, but the default dataset needs to be up-to-date (reframed) in order to present data in the SQL endpoint? 😕 That doesnt make much sense, and I hope this is not the case?

 

Im trying to understand if there is some "reframing" activity going on under the hood even for the SQL endpoint, causing the delay I am experiencing.. 

The SQL Endpoint isn't dependant on the default dataset (the other way round is true though).  OK, so just to confirm, if you query the delta data via the shortcut table using SQL via the SQL Endpoint can you see the updated data?  I've just tested now where I have a delta table being written to (overwrite mode) using a Synapse notebook, and then querying that using a shortcut in Fabric Lakehouse SQL Endpoint and I can see the updated data as soon as its written to the external data lake folder.

 

Update:  I've also testing with a report in power bi on the default dataset and I can see the data as soon as it refreshes in the external data lake.

I also load all my data using Synapse notebooks, using overwrite mode. When I query the table in the SQL endpoint in Fabric the data is updated within seconds from when the underlying delta table is updated most of the time - but sometime it takes up to a few minutes for the SQL endpoint to reflect the change in data. 

 

This (somewhat short) delay is what messes with my API triggered dataset refreshes. I would assume that when my activity to load the delta lake table is done, its safe to run the dataset refresh. But this is not always the case. Whenever data has failed to appear in the dataset I always go right back to the SQL endpoint and query the table, at which point I can always see the new/todays data there 

 

I have added a "5 minute wait" activity in my pipeline, and it now works "most of the time", but still sometimes fail to fetch new data, leading me to believe that 5 minutes is not enough in all cases. I will extend this wait activity to 10-15 minutes, to see if this will give the SQL endpoint sufficient time to do "whatever needs to be done" to refresh the shortcuts/links.

 

But as far as you know, the SQL endpoint should be always-up-to-date with the underlying data? There is no "reframing" activities going on behind the scenes, where it holds a cache of which parquet files makes up the latest verison of the Delta Lake table?  

There certainly is reframing with power bi datasets in terms of updating metadata to understand which is the latest delta data.  But for the SQL Endpoint I am not aware of any reframing process, it should read the latest transaction log and display the latest data.  It's also the fact that it works "some of the time" in your scenario that's troubling me.

 

What's the volume of data?

Yes, this is troubling for me to.

 

Most affected tables are rather small, around 100~ MB in total parquet size and 2~ Mil records.

 

I can easily reproduce the SQL endpoint delay issue by starting a notebook and performing a spark sql update against a delta lake table (update xxx set yyy = current_timestamp() , and in paralell checking the SQL endpoint with a query (select max(yyy) from xxx).

 

From the moment my spark update query returns "success" it takes anywhere from a few seconds up to a few minutes until i see my data change in the SQL endpoint (The old timestamp is returned for a number of T-SQL refreshes, until the new timestmap is finally returned).

 

I would expect to not ever be able to see the old timestamp if i start querying through the SQL endpoint after the spark sql query has finished - but this is not the case. 

So my info from Microsoft is that it should take a few seconds to a minute for the SQL Endpoint to discover the updated delta logs.  They did have an issue which was fixed apparently.

 

If your issue persists it would be worth logging that with MS support and give them you workspace ID, lakehouse id, and time it was happening

That you for looking into this.

 

These few minutes delay would explain what I am seeing, and why my API triggered Power BI dataset refreshes fails to fetch new data.. I have had my Power BI refreshes running with a 10 minute delay for a few days now, and everything has been working so far. A 5 minute delay does not seem to be sufficient even now - I "sometime" miss data using only 5 minutes.

 

It's a real shame that there is "any" delay here though, since I cant be 100% sure "when" my data is available for pushing to Power BI via the endpoint. Its especially bad since this did not appear to be the case in Synapse Serverless.

 

I have some spark jobs loading data to some of my tables every 30 minutes (and running for 30~minutes), after which an incremental dataset load is triggered via the same pipeline running the notebooks. Having to add a 10 min wait actitvy in my pipeline here will in have a 30% negative effect on the loading times of my jobs (from job start, to when business users see data in Power BI). 

 

I guess the long term solution here would be to use Direct Lake Semantic Models in Power BI, instead of SQL endpoint. This functionality is however still missing some ctirical features for me to be able to use right now (or I would have to heavily remodel my underlying Gold layer to handle the shortcomings of Direct Lake..) 

 

For now though, I am satisfied in knowing why this is happening. Thanks. 

Helpful resources

Announcements
Microsoft Fabric Learn Together

Microsoft Fabric Learn Together

Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City

March 2024  FBC Gallery Image

Fabric Monthly Update - March 2024

Check out the March 2024 Fabric update to learn about new features.

April Fabric Community Update

Fabric Community Update - April 2024

Find out what's new and trending in the Fabric Community.