Solved: Re: Delayed data refresh in SQL Analytical Endpoin...

FelixL · ‎11-20-2023

I am currently working on moving from Synapse Serverless SQL to Fabric SQL Analytical Endpoint for reading my external Delta Lake tables (Managed and updated in Azure Synapse, and linked to Fabric Lakehouse using ADLS Gen2 shortcuts). However, I am facing issues with the data not being refreshed in a timely manner in Fabric, where it is always up-to-date the second the underlying Delta Lake table is updated when reading through Synapse Serverless.

When I run my Azure Synapse pipelines and refresh my Delta Lake tables I also trigger Power BI dataset refreshes at the end of every pipeline, to load any updated data into Power BI. This works flawlessly when using Synapse Serverless as source, but fails about 50% of the time when using Fabric SQL Analytical Endpoint. The data being read from Fabric is sometimes the newest data, and sometimes old (yesterdays) data - even though the dataset refresh is triggered by the same pipeline that has updated the data, so i KNOW the data is updated at the time of refresing the dataset.

Adding a "wait" activity in my pipelines seem to (mostly) solve the problem. Around 10-15 minutes wait seem to do the trick. But this is never needed when using Synapse Serverless.

Does Fabric SQL Analytical Endpoint work differently than Synapse Serverless in some fundamental way? Does Fabric keep some kind of cache for external Delta Lake table data, that could explain why I sometimes see an up to 15 minute lagg before the SQL Analytical Endpoint is refreshed? I was hoping the endpoint would simply read the Delta Lake tables as they are, on the lake, but something else seems to be happening here, and it really is a worse experience than using Azure Synapse Sevrerless SQL for this purpose.

Anyone else seen similar things?

Thanks,

AndyDDC · ‎11-23-2023

So my info from Microsoft is that it should take a few seconds to a minute for the SQL Endpoint to discover the updated delta logs. They did have an issue which was fixed apparently.

If your issue persists it would be worth logging that with MS support and give them you workspace ID, lakehouse id, and time it was happening

View solution in original post

Muru_fabric · ‎09-08-2024

Hello All,

i am also facing an same issue. Store Procedure are processing stale data from the lake house since there appears to be a delay between the completion of the notebooks and the data being fully available in the DW. As suggested by Microsoft support, introduced a 'wait' activity in the pipeline to try and overcome this problem. we could see that problem is wait time is less than 30 min.

any long term solution such as to use Direct Lake Semantic Models in Power BI, instead of SQL endpoint ?

any suggestions ?

Regards,

Murugan

frithjof_v · ‎08-03-2024

This might be a similar issue, you could possibly try the workaround which is mentioned here and see if that helps:

https://community.fabric.microsoft.com/t5/Data-Engineering/SQL-Endpoint-Slow-To-Reflect-Changes-In-L...

v-cboorla-msft · ‎11-28-2023

Hi @FelixL

Glad that your query got resolved.

Please continue using Fabric Community for any help regarding your queries.

Thanks

AndyDDC · ‎11-21-2023

Hi @FelixL can you check a setting for the power bi dataset (semantic model) and see if this makes any difference? In the following doc, under the Refresh heading it'll show you a setting for "Keep your Direct Lake data up to date"

https://learn.microsoft.com/en-us/power-bi/enterprise/directlake-overview#:~:text=Direct%20Lake%20mo....

wzduniak

This issue is still there, @Microsoft can you just add an ip that we can trigger using service principal credetials to manually refresh sql endpoint, there is an unofficial one but it requiers access via aad account with mfa

FelixL · ‎11-21-2023

Hi, I am aware of this setting and I enabling it whenever possible. But sometimes, due to data type errors or other refresh issues this setting gets automatically disabled.

And in this case I dont really case about the default Direct Lake dataset - I only want to query my Delta Lake tables with T-SQL via the SQL endpoint (something that always works flawlessly with no refresh delay in Synapse).

My understanding is that the SQL endpoint is not dependent on the default dataset. Or is it? I should be able to unselect all tables from the SQL endpoint from the default dataset (and end up with an empty dataset), but still be able to query the SQL endpoint.

Are you saying that im wrong, and there is a circle reference going on here? The metadata in the SQL endpoint needs to be updated in order to select what tables to include in the default dataset, but the default dataset needs to be up-to-date (reframed) in order to present data in the SQL endpoint? 😕 That doesnt make much sense, and I hope this is not the case?

Im trying to understand if there is some "reframing" activity going on under the hood even for the SQL endpoint, causing the delay I am experiencing..

AndyDDC · ‎11-21-2023

The SQL Endpoint isn't dependant on the default dataset (the other way round is true though). OK, so just to confirm, if you query the delta data via the shortcut table using SQL via the SQL Endpoint can you see the updated data? I've just tested now where I have a delta table being written to (overwrite mode) using a Synapse notebook, and then querying that using a shortcut in Fabric Lakehouse SQL Endpoint and I can see the updated data as soon as its written to the external data lake folder.

Update: I've also testing with a report in power bi on the default dataset and I can see the data as soon as it refreshes in the external data lake.

FelixL · ‎11-21-2023

I also load all my data using Synapse notebooks, using overwrite mode. When I query the table in the SQL endpoint in Fabric the data is updated within seconds from when the underlying delta table is updated most of the time - but sometime it takes up to a few minutes for the SQL endpoint to reflect the change in data.

This (somewhat short) delay is what messes with my API triggered dataset refreshes. I would assume that when my activity to load the delta lake table is done, its safe to run the dataset refresh. But this is not always the case. Whenever data has failed to appear in the dataset I always go right back to the SQL endpoint and query the table, at which point I can always see the new/todays data there

I have added a "5 minute wait" activity in my pipeline, and it now works "most of the time", but still sometimes fail to fetch new data, leading me to believe that 5 minutes is not enough in all cases. I will extend this wait activity to 10-15 minutes, to see if this will give the SQL endpoint sufficient time to do "whatever needs to be done" to refresh the shortcuts/links.

But as far as you know, the SQL endpoint should be always-up-to-date with the underlying data? There is no "reframing" activities going on behind the scenes, where it holds a cache of which parquet files makes up the latest verison of the Delta Lake table?

AndyDDC · ‎11-21-2023

There certainly is reframing with power bi datasets in terms of updating metadata to understand which is the latest delta data. But for the SQL Endpoint I am not aware of any reframing process, it should read the latest transaction log and display the latest data. It's also the fact that it works "some of the time" in your scenario that's troubling me.

What's the volume of data?

FelixL · ‎11-21-2023

Yes, this is troubling for me to.

Most affected tables are rather small, around 100~ MB in total parquet size and 2~ Mil records.

I can easily reproduce the SQL endpoint delay issue by starting a notebook and performing a spark sql update against a delta lake table (update xxx set yyy = current_timestamp() , and in paralell checking the SQL endpoint with a query (select max(yyy) from xxx).

From the moment my spark update query returns "success" it takes anywhere from a few seconds up to a few minutes until i see my data change in the SQL endpoint (The old timestamp is returned for a number of T-SQL refreshes, until the new timestmap is finally returned).

I would expect to not ever be able to see the old timestamp if i start querying through the SQL endpoint after the spark sql query has finished - but this is not the case.

AndyDDC · ‎11-23-2023

So my info from Microsoft is that it should take a few seconds to a minute for the SQL Endpoint to discover the updated delta logs. They did have an issue which was fixed apparently.

If your issue persists it would be worth logging that with MS support and give them you workspace ID, lakehouse id, and time it was happening

FelixL · ‎11-23-2023

That you for looking into this.

These few minutes delay would explain what I am seeing, and why my API triggered Power BI dataset refreshes fails to fetch new data.. I have had my Power BI refreshes running with a 10 minute delay for a few days now, and everything has been working so far. A 5 minute delay does not seem to be sufficient even now - I "sometime" miss data using only 5 minutes.

It's a real shame that there is "any" delay here though, since I cant be 100% sure "when" my data is available for pushing to Power BI via the endpoint. Its especially bad since this did not appear to be the case in Synapse Serverless.

I have some spark jobs loading data to some of my tables every 30 minutes (and running for 30~minutes), after which an incremental dataset load is triggered via the same pipeline running the notebooks. Having to add a 10 min wait actitvy in my pipeline here will in have a 30% negative effect on the loading times of my jobs (from job start, to when business users see data in Power BI).

I guess the long term solution here would be to use Direct Lake Semantic Models in Power BI, instead of SQL endpoint. This functionality is however still missing some ctirical features for me to be able to use right now (or I would have to heavily remodel my underlying Gold layer to handle the shortcomings of Direct Lake..)

For now though, I am satisfied in knowing why this is happening. Thanks.

FilipAi · ‎06-18-2024

I too have this issue, and I think it is a real drawback. Don't understand why there would be a delay in this update between delta tables and the SQL Endpoint. My issue is that if I create a delta table in the Lakehouse, it is not query-able in the SQL Endpoint (even after several minutes), with the error Invalid object name 'lakehousename.dbo.tablename'. Really frustrating.

HenrikS2 · ‎05-20-2024

Hi!

I'm experiencing the same delay with Fabric & SQL Analytics endpoint. Or, rather the users experience it.

When they get back to me I can see that the data is updated in the SQL Views but not in the semantic model. If I refresh the Semantic Model manually it will show the correct value. But no pipelines has updated any data in onelake between the automatic refesh and my manual refresh.

I'm doing all the data processing in Fabric pipelines and notebooks (no Synapse or Links to ADLS) but I still experience the same thing.

I'll try to add a delay between copying the data from Silver to Gold and the refesh of the semantic model to see if it has any effect. Not the greatest of solutions when refresh time is important, but that's life :).

Delayed data refresh in SQL Analytical Endpoint

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Community Update - January 2025

New Offer! Become a Certified Fabric Data Engineer

Delayed data refresh in SQL Analytical Endpoint

Helpful resources

Join us at the Microsoft Fabric Community Conference

Fabric Community Update - January 2025