Solved: Re: Any idea for a generic process to retrieve dat...

Olbu · ‎10-23-2023

Hello,

Instead of creating a separate dataflow for each table, I am looking for a way to transfer data generically from an on premise SQL server to MS Fabric in a performant way. (data driven)
I have a list of many tables on the on-premise SQL Server 2019 (Version 2022 in about two month) and I want to copy them to Fabric-Delta-Parquet-Files. (Overwrite)

My experience so far:

- Pipeline could be parametrized, but doesn't support on premise data

- Dataflow Gen couldn't parametrize the destination table

- I was not able to create "Delta"-Parquet files on premise by python, which were accepted by a lakehouse.

Not tested yet (may be possible):
- ODBC/OLE Connection to Warehouse (Fabric) and push data with SSIS?

Every idea is welcome. (Dataflow Gen 2, Pipeline, Python, Spark, Powershell (On Premise).... ). Performance/Compression/Reliabilty is important, because of the amount of data.

Many thanks,

Oliver

AndyDDC · ‎10-26-2023

Hi @Olbu yes Data Factory uses a "Self Hosted Integration Runtime" that you install on your on-prem infrastructure (eg Windows server/machine) and that creates a tunnel from the Data Factory service to the server you have installed it on.

Create a self-hosted integration runtime - Azure Data Factory & Azure Synapse | Microsoft Learn

You'll need to make sure that if you have any outbound port restrictions inplace, that you allow access to these ports.

You also need to follow these instructions if you want to create Parquet files from your on-prem data sources Parquet format - Azure Data Factory & Azure Synapse | Microsoft Learn

"you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR machine"

View solution in original post

miguel · ‎10-23-2023

At the moment, the process in Dataflows gen2 requires the explicit definition of a specific destination on a per query basis, where one query sinks the data to a single table. Any suggestions outside of this would be great to be posted on the ideas site on aka.ms/FabricIdeas

im moving this to the pipelines forum to check if there are any suggestions from the pipelines side

Pragati11 · ‎10-23-2023

HI @Olbu

I wrote a blog sometime back to get my on-premises SQL Server data into Lakehouse in Fabric.

https://datavibe.co.uk/getting-on-premises-sql-server-data-to-fabric-lakehouse/

Let me know if this helps.

Best Regards,

Pragati Jain

LinkedIn | Twitter | Blog | YouTube

Did I answer your question? Mark my post as a solution! This will help others on the forum!

Appreciate your Kudos!!

Proud to be a Super User!!

Olbu · ‎10-23-2023

Hello @Pragati11,

thanks for your reply and your blog, but I already know how I could get data from on-premise SQL Server to Lakehouse with DataFlow Gen 2.

I don't wan't to create for every table an explicit dataflow. I searching for a possibility to do this data driven. I wan't to create one job, which could be parametrized by a tablename. This job would for example for every tablename:
a) copy the onPremise Data to Lakehouse/Warehouse
b) merge the data into existing tables
c) delete the date of a)
and additional logs and error handling, which would be for every table the same. changes at on Premise should be automatically propagated to Fabric.

I solved (b) and also (c) (as soon as an existing bug at pipeline will be fixed). So I am looking for parameterizable possibilites for (a) and I want avoid to create hundreds of static Dataflows.

v-gchenna-msft · ‎10-23-2023

Hi @Olbu ,

As I understand you would like to create a common connection to On-Prem and try to execute this job with parameters by tablename.

Currently it is not supported by Fabric.
As per fabric release plan - this feature will be available during Q1 2024 (estimated release) - Link

Appreciate your patience.

Hope this helps. Please let us know if you have any further issues.

Olbu · ‎10-23-2023

Hello @v-gchenna-msft ,

are you sure, that this feature would allow to get data by pipeline of an on-premise datasource. I'm not sure about this. It seems to allow me, to parametrize the connection, but the connection itself has to be supported. (For example, connect to different Blob-Storages by parameter).
More likely this feature would help me:

But I'm looking for current possibilties and hoped someone has a good Idea. For example a blog post, where parquet files could created on premise in powershell or ... and uploaded by script, then triggering a notebook at Fabric.... or may be their ar some possibilities with spark. I couldn't imagine that I'm the only one who wants to transfer data from on-premise sources, without building explicit jobs for every table.

Many thanks

Oliver

AndyDDC · ‎10-25-2023

One thing I could suggest is using the Data Factory service in Azure, I know it's not part of the Fabric suite but once the on-prem data pipeline functionality is available in Fabric then it shouldn't be much effort in porting over the Data Factory version of the Pipeline.

I would say that you could pump the data to the Files area of a Fabric Lakehouse in Parqurt format using the Copy Data activity.

Olbu · ‎10-26-2023

@AndyDDC Many thanks for your answer. Any idea is welcome!!! Does "Data Factory service in Azure" supports On-Premise Enterprise Gateway? (I have to ask, because, we have no VNET to Azure)

For testing. Yesterday I created a little C#-program, which accepts a SQL-Query and creates a Parquet-File. Unfortunatly it isn't the fastest. So my current idea is:

Source, On-Premise-SQL-Server-2019-Query: 63 columns (various data types), 10.597.630 Rows

C#-Parquet-Export: ~11 Minutes (Result 850MB Parquet File)
Upload by Powershell to Lakehouse\Files <1m
Convert To LakehouseTable (in my test, I did it manually, but Notebook or Pipeline shouldn't be slower): <3m

=> Result Lakehouse\Tables ~ 664 MB after about 15 minutes in total

This could be developed completly data driven. Currently I have no idea, how I could trigger a "Notebook" or "Pipeline" after I uploaded new files, but this should be possible.

For fomparision I created a static "Data Flow Gen 2" -Job, which retrieves the same amount of data, after 80(!) Minutes refreshing, it throws an error. I'll try it again, but at least my "complex" data-driven-able job (3 steps) seems to be a lot faster, even my c#-program isn't very fast.

Compressing the data locally doesn't seem like the worst idea to me. DataFlowGen2 seems to me to transfer the data uncompressed, because I have no other idea why it is so much slower.

AndyDDC · ‎10-26-2023

Hi @Olbu yes Data Factory uses a "Self Hosted Integration Runtime" that you install on your on-prem infrastructure (eg Windows server/machine) and that creates a tunnel from the Data Factory service to the server you have installed it on.

Create a self-hosted integration runtime - Azure Data Factory & Azure Synapse | Microsoft Learn

You'll need to make sure that if you have any outbound port restrictions inplace, that you allow access to these ports.

You also need to follow these instructions if you want to create Parquet files from your on-prem data sources Parquet format - Azure Data Factory & Azure Synapse | Microsoft Learn

"you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR machine"

v-gchenna-msft · ‎10-29-2023

Hi @Olbu ,

We haven’t heard from you on the last response and was just checking back to see if your query got resolved? Otherwise, will respond back with the more details and we will try to help.

Thank you

v-gchenna-msft · ‎11-01-2023

Hi @Olbu ,

We haven’t heard from you on the last response and was just checking back to see if your query got resolved? Otherwise, will respond back with the more details and we will try to help.

Olbu · ‎11-05-2023

If there are no other ideas and solutions, this post could be closed.

v-gchenna-msft · ‎11-05-2023

Hi @Olbu ,

Glad to know that you got some insights regarding your query. Please continue using Fabric Community for any help regarding your queries.

Any idea for a generic process to retrieve data from on premise SQL Server

Helpful resources

Microsoft Fabric Learn Together

Fabric Monthly Update - March 2024

Fabric Community Update - April 2024