Register now to learn Fabric in free live sessions led by the best Microsoft experts. From Apr 16 to May 9, in English and Spanish.
Hello,
Instead of creating a separate dataflow for each table, I am looking for a way to transfer data generically from an on premise SQL server to MS Fabric in a performant way. (data driven)
I have a list of many tables on the on-premise SQL Server 2019 (Version 2022 in about two month) and I want to copy them to Fabric-Delta-Parquet-Files. (Overwrite)
My experience so far:
- Pipeline could be parametrized, but doesn't support on premise data
- Dataflow Gen couldn't parametrize the destination table
- I was not able to create "Delta"-Parquet files on premise by python, which were accepted by a lakehouse.
Not tested yet (may be possible):
- ODBC/OLE Connection to Warehouse (Fabric) and push data with SSIS?
Every idea is welcome. (Dataflow Gen 2, Pipeline, Python, Spark, Powershell (On Premise).... ). Performance/Compression/Reliabilty is important, because of the amount of data.
Many thanks,
Oliver
Solved! Go to Solution.
Hi @Olbu yes Data Factory uses a "Self Hosted Integration Runtime" that you install on your on-prem infrastructure (eg Windows server/machine) and that creates a tunnel from the Data Factory service to the server you have installed it on.
Create a self-hosted integration runtime - Azure Data Factory & Azure Synapse | Microsoft Learn
You'll need to make sure that if you have any outbound port restrictions inplace, that you allow access to these ports.
You also need to follow these instructions if you want to create Parquet files from your on-prem data sources Parquet format - Azure Data Factory & Azure Synapse | Microsoft Learn
"you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR machine"
At the moment, the process in Dataflows gen2 requires the explicit definition of a specific destination on a per query basis, where one query sinks the data to a single table. Any suggestions outside of this would be great to be posted on the ideas site on aka.ms/FabricIdeas
im moving this to the pipelines forum to check if there are any suggestions from the pipelines side
HI @Olbu
I wrote a blog sometime back to get my on-premises SQL Server data into Lakehouse in Fabric.
https://datavibe.co.uk/getting-on-premises-sql-server-data-to-fabric-lakehouse/
Let me know if this helps.
Hello @Pragati11,
thanks for your reply and your blog, but I already know how I could get data from on-premise SQL Server to Lakehouse with DataFlow Gen 2.
I don't wan't to create for every table an explicit dataflow. I searching for a possibility to do this data driven. I wan't to create one job, which could be parametrized by a tablename. This job would for example for every tablename:
a) copy the onPremise Data to Lakehouse/Warehouse
b) merge the data into existing tables
c) delete the date of a)
and additional logs and error handling, which would be for every table the same. changes at on Premise should be automatically propagated to Fabric.
I solved (b) and also (c) (as soon as an existing bug at pipeline will be fixed). So I am looking for parameterizable possibilites for (a) and I want avoid to create hundreds of static Dataflows.
Hi @Olbu ,
As I understand you would like to create a common connection to On-Prem and try to execute this job with parameters by tablename.
Currently it is not supported by Fabric.
As per fabric release plan - this feature will be available during Q1 2024 (estimated release) - Link
Appreciate your patience.
Hope this helps. Please let us know if you have any further issues.
Hello @v-gchenna-msft ,
are you sure, that this feature would allow to get data by pipeline of an on-premise datasource. I'm not sure about this. It seems to allow me, to parametrize the connection, but the connection itself has to be supported. (For example, connect to different Blob-Storages by parameter).
More likely this feature would help me:
But I'm looking for current possibilties and hoped someone has a good Idea. For example a blog post, where parquet files could created on premise in powershell or ... and uploaded by script, then triggering a notebook at Fabric.... or may be their ar some possibilities with spark. I couldn't imagine that I'm the only one who wants to transfer data from on-premise sources, without building explicit jobs for every table.
Many thanks
Oliver
One thing I could suggest is using the Data Factory service in Azure, I know it's not part of the Fabric suite but once the on-prem data pipeline functionality is available in Fabric then it shouldn't be much effort in porting over the Data Factory version of the Pipeline.
I would say that you could pump the data to the Files area of a Fabric Lakehouse in Parqurt format using the Copy Data activity.
@AndyDDC Many thanks for your answer. Any idea is welcome!!! Does "Data Factory service in Azure" supports On-Premise Enterprise Gateway? (I have to ask, because, we have no VNET to Azure)
For testing. Yesterday I created a little C#-program, which accepts a SQL-Query and creates a Parquet-File. Unfortunatly it isn't the fastest. So my current idea is:
Source, On-Premise-SQL-Server-2019-Query: 63 columns (various data types), 10.597.630 Rows
=> Result Lakehouse\Tables ~ 664 MB after about 15 minutes in total
This could be developed completly data driven. Currently I have no idea, how I could trigger a "Notebook" or "Pipeline" after I uploaded new files, but this should be possible.
For fomparision I created a static "Data Flow Gen 2" -Job, which retrieves the same amount of data, after 80(!) Minutes refreshing, it throws an error. I'll try it again, but at least my "complex" data-driven-able job (3 steps) seems to be a lot faster, even my c#-program isn't very fast.
Compressing the data locally doesn't seem like the worst idea to me. DataFlowGen2 seems to me to transfer the data uncompressed, because I have no other idea why it is so much slower.
Hi @Olbu yes Data Factory uses a "Self Hosted Integration Runtime" that you install on your on-prem infrastructure (eg Windows server/machine) and that creates a tunnel from the Data Factory service to the server you have installed it on.
Create a self-hosted integration runtime - Azure Data Factory & Azure Synapse | Microsoft Learn
You'll need to make sure that if you have any outbound port restrictions inplace, that you allow access to these ports.
You also need to follow these instructions if you want to create Parquet files from your on-prem data sources Parquet format - Azure Data Factory & Azure Synapse | Microsoft Learn
"you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR machine"
Hi @Olbu ,
We haven’t heard from you on the last response and was just checking back to see if your query got resolved? Otherwise, will respond back with the more details and we will try to help.
Thank you
Hi @Olbu ,
We haven’t heard from you on the last response and was just checking back to see if your query got resolved? Otherwise, will respond back with the more details and we will try to help.
If there are no other ideas and solutions, this post could be closed.
Hi @Olbu ,
Glad to know that you got some insights regarding your query. Please continue using Fabric Community for any help regarding your queries.
Covering the world! 9:00-10:30 AM Sydney, 4:00-5:30 PM CET (Paris/Berlin), 7:00-8:30 PM Mexico City
Check out the March 2024 Fabric update to learn about new features.
User | Count |
---|---|
6 | |
4 | |
2 | |
2 | |
1 |