March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Hi All,
We use a pipeline that pulls json data from an API. in the next step of that pipeline, we ingest those json files into a lakehause "raw" table. Once the 5 json files are in the raw table, the next step is an ETL to the production tables.
The problem we are facing is inconsitency in the ingestion of the json files to the raw table. The step automatically runs at 5 am, but we see that some data in the json , from the day before, is actualy not ingested into the raw table, while other data is.
When I look at the actual json file, I see the data is actually in the file itself, so there is no reason for it not to be in the raw table. The realy weird thing is, if i run the pipeline manualy at say like 11 am, new files are created, which than get ingested in the raw table correctly.
I have done a compare between the 5 am and the 11 am file and they are identical, except of course for the date they were created.
The pipeline itself completes without any error. The create json and ingest json loop does run concurrent, since it involves 5 entities and the supplier API allows us to run concuurent sessions.
Since there seems to be no logic to why it will work in one run and not in the other, I have no clue where to even start troubleshooting this.
I initially thought it was maybe because of the concurrent connections, but than I would expect it to miss data on the 11am manual run as well, which it appears not to. (I would think lakehouse would have no isses dealing with 5 consecutive datastreams anyway)
Could the concurrency still be the problem, or does anyone have any ideas where to start to trouble shoot this one?
Warm regards
Hans.
Solved! Go to Solution.
Good morning all,
A small update and a recap.
MS has been in touch and I had a teams call with them. The "pipeline" team has now also been involved. No answer yet, but I am not sure if we ever will get a solution, looking at the article MS posted re: the fact there can be a delay of on average 1 minute (but can be more, and we have seen it as more) between the lakehouse sql endpoint and the data in the lakehouse itself, even though this issue by the looks of it was introduced in the July updates (we havent experienced this before and colleagues of mine hadn't either)
So where does this leave us? the below workaround works for us, but we will have to re-write a few pipelines
Workaround:
Use the lakehouse_raw to get the files and store them as files in the lakehouse, next use a copy data activity to ingest the files and copy the the data to a warehause_raw. The warehouse_raw being a new bronze medallion.
From there you can run a copy data activity (or a stored procedure, or a script, or whatever you need to do) to ETL the data. In our case that would be staged into a staging table (silver medallion) in the warehouse_raw, but that would vary per use case of course and last but not least, use another activity to copy the data to the warehouse for the BA's to use.
Of course the steps above are for our scenario, yours may vary.
Cheers
Hans
Good morning all,
A small update and a recap.
MS has been in touch and I had a teams call with them. The "pipeline" team has now also been involved. No answer yet, but I am not sure if we ever will get a solution, looking at the article MS posted re: the fact there can be a delay of on average 1 minute (but can be more, and we have seen it as more) between the lakehouse sql endpoint and the data in the lakehouse itself, even though this issue by the looks of it was introduced in the July updates (we havent experienced this before and colleagues of mine hadn't either)
So where does this leave us? the below workaround works for us, but we will have to re-write a few pipelines
Workaround:
Use the lakehouse_raw to get the files and store them as files in the lakehouse, next use a copy data activity to ingest the files and copy the the data to a warehause_raw. The warehouse_raw being a new bronze medallion.
From there you can run a copy data activity (or a stored procedure, or a script, or whatever you need to do) to ETL the data. In our case that would be staged into a staging table (silver medallion) in the warehouse_raw, but that would vary per use case of course and last but not least, use another activity to copy the data to the warehouse for the BA's to use.
Of course the steps above are for our scenario, yours may vary.
Cheers
Hans
Good afternoon all, a little update on this. It looks like the SQl endpoint for the lakehouse doesn't immediately refelect the data changes in the lake. Which causes these weird intermittent issues.
Is this a known issue?
I have seen several users who have mentioned confusing results due to delay between Lakehouse and SQL Analytics Endpoint.
Why/how are you querying the SQL Analytics Endpoint?
Could you query the Lakehouse tables directly instead?
when moving data from the raw lakehouse table to the staging table (also in the same lakehouse) we have to do a lot of grouping and ETL Using SQL is the easiest way of doing that.
When researching further we found that the data is actually in the raw table, but not in the staging table and that step is the one where we use the sql end point. We could create a notbook to do the same, but if the same delay happens in the staging table, than there is no garuantee that we wont face the same issue when moving the cleaned, grouped and relational data to the warehouse. Our data scientist uses the warehouse to do his work.
Is your setup like this?
1. Data source is an API which returns JSON data.
2. You use Data pipeline to get data from the API and land it as JSON files in the Raw layer. Could you explain which pipeline activities you are using?
3. In the same Data pipeline, you load the JSON files to a table in Raw layer. Could you explain which pipeline activities you are using? If I understood correctly, this is the first step where you get unexpected results.
4. You use an ETL step inside your Data pipeline to transform and load the data to Production layer. Can you explain which pipeline activity you are using for this?
Is everything happening inside one Data pipeline or are there separate, independent Data pipelines?
You mention that you have a staging table also and a warehouse. Where does it fit in the above list? Could you make a complete list and highlight where the unexpected results occur.
If you can, please share a screenshot. Remember not to share any internal or secret information.
Hi @frithjof_v
1 correct this is what we do.
2 we use a web activity to do the logon for the API, it uses a bearer token,
3 We use a for each loop to query the api and and create the JSON files (7 of them), They can have multiple days of data. In the same loop we read the json file and ingest into the lakehouse raw table.
4 once the files are read into the raw table, there is a copy data step, which uses the lakehouse sql end point to run a sql script to normalise and transform the data. The data goes to another lakehouse "staging table". Here is where we see the discrepancy occur, for instance earlier this aftenoon I ran the API and in the raw table were 11 records for one specific type of data. But in the staging table were only 9 reconds for that dataset, and the last record was 0.00000, in stead of its correct value. (see below)
Once we have ETL'ed the data and normalised it, the staging tabel gets split into 4 tables and loaded into the warehouse, using 4 script activities, (production, production-history and quality and quality-history).
These 4 scripts check for new data to be inserted or older data to be updated, since some data can be updated.
The activities we use are all very basic (some variables, a web activity a loop , a copy data and script activities.
We have also raised a case with MS, and I showed them the discrepancies we are seeing beween the lakehouse data and the sql end point data.
I hope this clarifies it a bit 🙂
cheers
Hans
My 2cents - I am doing similar processing currently for bulk ingestion
A. Define an array. In my case it is [0,1,2,...19].
forEach A (
A. Get data from the data vendor through an on-prem gateway from their sql server and dump the raw data to file layer (opposed to directly writing to a lakehouse table as it is not recommended at all)
(Copy Activity - {src: on-prem SQL, destination :- Lakehouse files section (in a dedicated folder) as a parquet file})
B. Notebook Activity (defaulted to the same lakehouse as landed file and final target table) - progrmatically look into the the dedicated folder from A and pick up the most recent parquet file that landed and transform using Notebook as opposed to using SQL Endpoint and write to the final target lakehouse table. (factor in upsert, append, overwrite wriritng components)
)
I am currently doing this for 20 tables and it has been good so far.
We indeed do not write to a lakehouse table directly, the REST API creates json files in a dedicated folder, which then get read and ingested ina for each loop. That step works well.
Thanks for sharing.
Could it be that the data in the SQL Endpoint is not properly synced with the Lakehouse itself, so when you are running the queries to the SQL Endpoint, you get unexpected results.
Perhaps you could try to do the transformations in a Notebook instead of a script activity. Because the Notebook will use the Lakehouse tables directly, instead of the SQL Analytics Endpoint. Just to see if you get the expected results then.
I can see how using Notebook can become a bit tricky especially when loading from Lakehouse to Warehouse. One option is to just have the data scientist use the Lakehouse instead.
To use Notebook to write to Warehouse, there is something called pyodbc. I have never tried it myself and I don't know how performant it will be. Another option would perhaps be to use data pipeline to copy the data from Lakehouse to a staging table in the Warehouse, and then run a stored procedure in the Warehouse to upsert data from the staging table into the production table. This would be a way to avoid using the Lakehouse's SQL Analytics Endpoint, in order to try to isolate the error and see if the process works fine when avoiding the SQL Analytics Endpoint.
Good to hear that you have a support ticket. Please tell if your issue gets solved.
We have thought about using a notebook to transform the data to the staging table as well, which would solve that steps problem. But the step after is copying the data to the warehouse from the staging table and for that we would need to use the sql endpoint again, so we suspect that would not realy solve our problem, but just move it to the next step.
Moving the gold medallion to a lakehouse in stead of a warehouse is not an option right now, since that is no small change for us.
Once last thing, which I did actualy ask MS about while updating the case, is that we have seen in development, when doing a lot of testing and changing, sometimes you will have to force the table in the lakehouse to force a checkpoint, by using a notebook.
Since MS themself in your article advices to manually press the refresh button (which is vey hit and miss by the way in practice) I was wondering if including that "checkpoint notebook", before the ETL step would maybe force the update of the SQL end point ass well.
Last but not least, since this is a prodcution issue for us, we have now created a raw warehouse, which ingests the json files (created in the raw lakehouse) into a raw warehouse table and from there we do the ETL. This seems to work for now.
Here is a documentation which explains about delay between Lakehouse and SQL Analytics Endpoint. I wasn't aware of this documentation until now:
https://learn.microsoft.com/en-us/fabric/data-warehouse/sql-analytics-endpoint-performance
This certainly matches up with other rumours we have heard. Looking at the date of the document it is extremely new. Rumours we have picked up from others sources indicated this was introduced after a preview release not to long ago, which makes sense looking at the document date.
If you have found out that the JSON data gets properly loaded into your raw tables, and the only issue is regarding further transformations which uses the SQL Analytics Endpoint, then I would use notebook to do all the data transformations (working with the Lakehouse tables, not SQL Endpoint) and have the notebook write the finished table to a Lakehouse.
If you prefer to use data warehouse as the final layer (I don't know why you want to use the data warehouse) then I guess the question is how to get data from the Lakehouse into the Data warehouse without using the SQL Analytics Endpoint of the Lakehouse.
Perhaps you need to use data pipeline to copy data from the Lakehouse into a staging table in the Warehouse, and then run a stored procedure to load the data into the production table in the Warehouse.
(Edit: Here I am assuming that the Data pipeline copy data activity doesn't use the SQL Analytics Endpoint as the source of the copy data activity. However I'm not entirely sure if the Lakehouse connector in Data pipeline actually uses SQL Analytics Endpoint or not).
Alternatively, there is something called pyodbc which makes it possible to write to a Warehouse from a Notebook. I don't have experience with it.
It seems to me that the key is to not use the SQL Analytics Endpoint when querying data from the Lakehouse, if you need to be sure to get the most recent data.
Hi @smeetsh ,
While you mentioned that there is nothing wrong with concurrent pipeline runs, concurrency can still cause intermittent problems. You may want to check if there are any contention conditions or resource contention during the 5am run.
Also how you are introducing the json file, you may need to specify format='multijson' instead of format='json'. Please provide relevant screenshot information and describe it.
Best Regards,
Adamk Kong
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
6 | |
2 | |
2 | |
1 | |
1 |