Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get certified as a Fabric Data Engineer: Check your eligibility for a 50% exam voucher offer and join us for free live learning sessions to get prepared for Exam DP-700. Get started

Reply
Scott_Powell
Advocate III
Advocate III

Gen2 Dataflow to Warehouse - really slow performance

Hi, I'm hitting an issue with loading data from on-prem SQL server to a Fabric Warehouse using dataflows gen2. I'm only moving 34k rows (about 30 columns), and it's taking almost 13 minutes to do this (see image at bottom). Staging is disabled for this (these jobs literally just copy the data as is, with no transformations).

 

  • If I enable staging, I can see the staging objects load in < 30 seconds, and then it takes 12+ minutes to write from the staging warehouse to the actual real warehouse.
  • If I write the records to a Lakehouse instead of a Warehouse, the whole thing always completes in 30 - 45 seconds.

 

Is this type of performance typical for loading a Warehouse??? Can't believe it is, but the jobs are too simple to have anything messed up I think. Any ideas?

 

Thanks,

Scott

 

 

Scott_Powell_0-1697572196809.png

 

26 REPLIES 26
JP_Helsinki
Frequent Visitor

Any update on this issue? Im trying to load data with Knime ETL tool to the Fabric warehouse but it is incredebly slow..  

I"m trying to do the same with Alteryx and still experiencing extreme performance issues.  Data output is just a few thousand rows.  Job runs in 9 sec without trying to write to Fabric Warehouse, and takes 35min plus to run loading to warehouse.  Sometimes it just fails.  

If you're trying to perform this with a different tool other than Dataflows, and you've noticed that the bottleneck is loading the data to the warehouse, definitely try reaching out to our support team or post this question in the Data Engineering community who oversee any topics directly related to the data warehouse 

I don't seem to have the ability to submit support requests.  With all the AI tools available I with Microsoft would do a better job of summarizing issues and consolidating these problems across various forums... I shouldn't have to re-post issues multiple times, but I did it anyways: Poor ODBC insert performance in Warehouse - Microsoft Fabric Community

Here's the link to the supports page:

https://support.fabric.microsoft.com/support

 

if you're having issues creating a support ticket, I'd recommend getting in touch with your admin so they can create it on your behalf. This would be the route where our support team can tell you exactly what could be happening and what ways to address the scenario.

 

I can see that you've created another thread in the data warehouse forum. That's great! Typically we would move entire topics between forums, but given that this was a reply and not a newly created topic, as it's quite specific to external tools and not Dataflow Gen2, we couldn't just move the reply and create a new topic on your behalf. In cases like this we encourage users to post their full scenarios as a new topic in the most appropriate forum.

 

Nevertheless, I still encourage you to go through the support route as it's the route where you can engage with our team directly to get to the bottom of the situation.

 

best!

I basically gave up on dataflows gen2 because there was a speed requirement it couldn't satisfy. However doing the ETL in Pyspark and storing the data to the lake works really good and is fast enough. But you need to include 1-2minutes for the spark pool to find you a session. Besides you have an extreme library of python packages allowing you to basically do anything. BTW. I'm not a true Python coder, but Bing Copilot and chatgpt were enough to get me started.

hi @davelukasaddpro , how do you get the PySpark Notebook to hit on-prem data sources?

 

Thanks,

Scott

What kind of on-prem datasource are thinking of? I haven't tried to access any onprem sql databases or other sources. I just used API calls to several endpoints. But maybe you could try using odbc/jdbc from spark if you want to connect to sql?

spark = SparkSession.builder \
.appName("YourAppName") \
.config("spark.jars", "/path/to/your/sqljdbc42.jar") \ # Path to your SQL Server JDBC driver
.getOrCreate()

df = spark.read.jdbc(url=jdbc_url, table="your_table", properties=connection_properties)

Or:

from pyspark import SparkContext, SparkConf, SQLContext

appName = "PySpark SQL Server Example - via JDBC"
master = "local"
conf = SparkConf() \
.setAppName(appName) \
.setMaster(master) \
.set("spark.driver.extraClassPath", "sqljdbc_7.2/enu/mssql-jdbc-7.2.1.jre8.jar")

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession

database = "test"
table = "dbo.Employees"
user = "username"
password = "password"

jdbcDF = spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://localhost:1433;databaseName={database}") \
.option("dbtable", table) \
.option("user", user) \
.option("password", password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()

jdbcDF.show()

Or:

from pyspark import SparkContext, SparkConf, SQLContext
import pyodbc
import pandas as pd

appName = "PySpark SQL Server Example - via ODBC"
master = "local"
conf = SparkConf() \
.setAppName(appName) \
.setMaster(master)

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession

database = "test"
table = "dbo.Employees"
user = "username"
password = "password"

conn = pyodbc.connect(f'DRIVER={{ODBC Driver 13 for SQL Server}};SERVER=localhost,1433;DATABASE={database};UID={user};PWD={password}')
query = f"SELECT * FROM {table}"
pdf = pd.read_sql(query, conn)
sparkDF = spark.createDataFrame(pdf)
sparkDF.show()

Maybe you need to install the pyodbc: !pip install pyodbc

Please let me know if it works.

I'm experiencing the same performance issues with Dataflow Gen 2 using an On-prem data gateway:

  • Pulling the data from the underlying SQL database is quick. I can see that in the refresh details (about 10 tables each taking less than 1 min to pull the data).

  • However, the second Dataflow Gen 2 step where it's writing the data (WriteToDataDestination) to the warehouse takes ages and eventually times out. It succeeds in writing about 3-4 tables but then fails.

"104100 Couldn't refresh the entity because of an issue with the mashup document MashupException.Error: Microsoft SQL: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)"

Is there a way to see what jobs/requests are running on the warehouse? It seems that running too many requests (e.g., 1 Dataflow Gen 2 with many tables or running multiple Dataflows Gen 2 at the same time against the same Warehouse) results in locking up the Warehouse.
Is there even a way to stop all current jobs?

I'm inclined to look for other solutions, but notebooks and fast copy are still not supported for on-premise sources as of Jan 2024. Many new features will be coming for Dataflow Gen 2 in Q1 of 2024, though. As a next step, I will be trying Dataflow Gen 2 to store data into the lakehouse to get a properly robust ETL process set up and let you know if that resolved it.

UPDATE (to the above comment): Writing & storing data from on-prem SQL server directly to Lakehouse with Staging turned off (using a data gateway) has been able to deliver on my requirements and limitations of warehouse.

Even though I don't know any Python or Scala, I was able to do all the required tranformations in notebooks using Spark SQL. 

davelukasaddpro
Frequent Visitor

I'm having the exact same expirience. But I fetch my data from an api endpoint. The timeout is set to a few seconds. If the endpoint fails. Well, then the next dataflow is executed. My pipeline consists of 21 dataflows. Each dataflow fetches minimal amounts of data from an endpoint, we are talking 2-3kb of json each run. There is just something utterly wrong, if the monitor shows some dataflows run 1minute. But some run 4hours... and they seem to eat all my F2 capacity. As a consequence every other dataflow executed after that one flow that executes 4h, fails.... I mean, Dataflow Gen2 are really powerful and I see them as the main selling point for companies looking for great visual ETL. So, Microsoft, please put some serious attention to these dataflow issues.

Same thing on our end. I can't even get tables with a few thousand rows to write to the warehouse. I've had a support ticket open for 2 months now and they have been stringing me along.

 

I wish MSFT would acknowledge these issues. I have high hopes for DFG2 and Fabric as a whole, but it seems like things just don't work now.

After the christmas/new year holidays I have plan to try something else. Currently the data is stored to a Datawarehouse for practical "SQL" reasons. The json is basically structured as a table row and easy to insert into a table. But from what I'm hearing and reading the Lakehouse seems to work better with Gen2. It's a minor modification to get the dataflows write the fetched endpoint data to a Lakehouse instead of the current warehouse. Also, I changed the scheduling for the main pipeline to 1.5hours from 50minutes. I counted approx 2-3minutes seems to go on "startup" of the dataflow, the rest is the execution of the dataflow, approx 2minutes. So any dataflow should execute approx within 4-5min. Now my pipeline with it's 21 dataflows is stable. But 90minutes to get a few kb of data is way too slow.

v-cboorla-msft
Community Support
Community Support

Hi @Scott_Powell 

 

Apologise for the delay in response and for the inconvenience that you are facing here.

 

Just as a work around, instead of loading on prem data to Dataflow Gen2 and then to warehouse (On-Prem -> DFg2 -> Warehouse) can you please try to test by loading the data from on-prem to Dataflow Gen2 and then to Lakehouse.(On-Prem -> DFg2 -> Lakehouse) (files or tables, would test both)
Then create a Data Pipeline to move data from Lakehouse to Warehouse.

Just like the medallion architecture we see often, we want to do a quick load of the source into a Raw layer, then work our way to the Warehouse which would be Silver/Gold. 


The issue which you are facing while loading the data into warehouse is just temporary. This won't be a thing once Data Pipelines can use On Prem Data Gateway which is in development.

vcboorlamsft_0-1701455507952.png

 


Hope this helps. Please let me know if you have any further questions.

 

hi @v-cboorla-msft , I'm actually in the process of getting this ready to test right now. I was able to successfully load about 1 billion rows of transactions into a lakehouse using Gen2 dataflows (running multiple dataflows in parallel across several gateway clusters). Now I'll try copying those across to a DWH using the copy utility. I'll update this post as soon as I have results to report (this is a side project so might be a couple of days).

 

One question / idea - if really feels to my like the underlying Dataflow Gen2 "engine" should be able to tap into the spark compute just like notebooks do. Similar to how it "pushes down" SQL queries to a database. Wish it could "push down" some of this type of work to spark - seems like it would be much faster.

 

Thanks!

Scott 

Hi @Scott_Powell 

 

Thanks for the feedback.

If you want to enable spark compute please upvote this idea: Microsoft Idea

 

Thanks

Scott_Powell
Advocate III
Advocate III

Hi all and especially @miguel , just a quick update. First, having a really hard time getting support to actually pay any attention to this - even though I submitted a ticket over a month ago. Bit frustrated as I don't think they've actually done anything at all on the ticket.

 

Having said that, I think I've discovered at least one of the causes of really poor performance. I was trying yesterday to load about 11 million rows into a Warehouse. The load job takes 5 minutes writing to a Lakehouse, but timed out after hours on the warehouse. 

 

Looking in to the new "query insights" schema available in the Warehouse, I could see the records were being written in batches of only 1,000 records each. And each commit seems to have been taking 3 or 4 seconds. With 11 million records, that would be 11,000 commits * 4 seconds each = 12+ hours of time.

 

If this is indeed the issue - MS you've GOT to change the commit size. Maybe make it 1 million, or 10 million, but not 1000 rows. 

 

p.s. still not sure this is "all" of the issue - on my original issue above it was only writing 34,000 records which would have been 34 commits at 4 seconds each = several minutes of excess time. I was seeing way more than that. I'll try to go back and reproduce this tomorrow, looking at the query logs to see what's going on in the database.

 

p.p.s. Microsoft if it helps, the ticket I opened was 2310180040012820

 

Thanks,

Scott

I am in the same boat as you. I think this all comes down to the write speed to the warehouse. Things just seem to lock up and time out. I can't even get 100k rows to write...

I watched the query against our on-prem sql server and noticed that it had a ASYNC_NETWORK_IO wait the whole time. According to MSFT docs, these are due to the client application (the warehouse) not accepting results fast enough.

 

I also noticed that the warehouse becomes unresponsive while the dataflow is running. 

 

Joshrodgers123_1-1700538899154.png

 

I can't imagine we are the only ones who are experiencing these issues. There must be some solution or acknoledgment if Fabric is now GA. I'm trying to sell Fabric to my org, but it's a bit hard when we can't get our data into it.

 

Here's my support ticket if it helps: 2311060040014961

 

Same boat here.

Cannot get the dataflows to write into the warehouse for the life of me.

Support ticket is this: 2311150050004676

Joshrodgers123
Advocate V
Advocate V

I'm seeing the same thing as well. I'd really love to start using warehouses but can't seem to get anything but really slow performance or errors.

 

I have a DFG2 writing 2 tables 300k and 400k records each. 

With no destination: 50 seconds

With warehouse destination, staging enabled: 50 minutes

With warehouse destination, staging disabled, times out after 3.5 hour and locks up the warehouse from being accessed.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Feb2025 Sticker Challenge

Join our Community Sticker Challenge 2025

If you love stickers, then you will definitely want to check out our Community Sticker Challenge!

JanFabricDE_carousel

Fabric Monthly Update - January 2025

Explore the power of Python Notebooks in Fabric!

Feb2025 NL Carousel

Fabric Community Update - February 2025

Find out what's new and trending in the Fabric community.