Supplies are limited. Contact info@espc.tech right away to save your spot before the conference sells out.
Get your discountScore big with last-minute savings on the final tickets to FabCon Vienna. Secure your discount
Hi. We are intending to move to MS Fabric and are running a number of tests. My first test is importing 1000+ tables from our source system. I want to make all our imports metadata driven.
Note that our primary source systems are still sourced locally and need to be accessed via a gateway.
I've been testing with pipelines to import data, but that is unsatifactory. Importing my first set of tables drained my entire capacity. And I'm only importing a fraction of what my daily load would be.
I'm pretty sure I could build a nice script to import the data via notebooks, but notebooks don't appear to connect to gateway.
I'm trying to acheive what is below.
- What tools in Fabric would you use to acheive these goals? I'm really looking for tips so I can build a very solid foundation for this design?
- Notice that I'm trying to do an outer join on 2 tables, I'm trying to import incrementally where possible using 2 methods and I'm trying to merge the output table. (The 20 databases I use as source use identical datamodels and should be merged in our datawarehouse)
Hi @m_vandermeij,
To accomplish this, begin by separating concerns among ingestion, storage, transformation, and orchestration. For ingestion, use Dataflows Gen2 along with the Fabric Gateway to connect to your local SQL sources. You can parameterize Dataflows with metadata, which removes the need to hardcode table names or load logic, making the ingestion process fully metadata-driven. For storage, load the data into Delta tables within a Lakehouse staging area. Delta tables support ACID transactions, schema evolution, and efficient MERGE INTO operations, which are important for incremental loading.
For transformation, utilize Notebooks (PySpark or SQL). This allows you to loop through metadata tables, apply incremental logic (such as importing where ID > last ID or syscreated/sysmodified > last datetime), and perform joins or merges. Notebooks can also be used to combine data from multiple databases into a single company-level table. For orchestration, use Pipelines solely for coordination. Pipelines should trigger dataflows and notebooks, but avoid creating one activity per table, as this can lead to capacity problems. Instead, have the pipeline iterate through your metadata list (GetList) to dynamically manage ingestion and processing.
When applying this pattern, it is important to separate ingestion and transformation: use Dataflows for ingestion through the gateway and notebooks for processing within Fabric. To improve your metadata tables, consider adding fields such as load frequency (e.g., daily, hourly), last successful load time (to facilitate delta extraction), and partitioning strategy (for large tables). Whenever possible, push incremental filters upstream, but utilize Delta MERGE in the Lakehouse to consolidate and deduplicate data. During testing, carefully monitor Fabric capacity; you may need to batch tables or increase capacity to manage peak ingestion periods.
Playbook for metadata driven Lakehouse implementation in Microsoft Fabric | Microsoft Fabric Blog | ...
Thank you.
Hi @m_vandermeij,
Just wanted to check regarding your question. We haven’t heard back and want to ensure you're not stuck. If you need anything else or have updates to share, we’re here to help!
Thankn you.
Hi @m_vandermeij,
We wanted to follow up since we haven't heard back from you regarding our last response. We hope your issue has been resolved.
Thank you.
Hi;
I have been looking at dataflows Gen2.
But that mainly seems to be using PowerQuery.
My initial import is from about 20 databases, and is sets of 50 tables. So input around 1000 tables, with an output of around 50 tables.
The Dataflow Gen2 method appears to be working with powerquery. And that keeps trying to show previews. That seems very badly suited for my purpose. Or am I mistaken here?
Hi @m_vandermeij,
You’re right that Dataflows Gen2 are built on Power Query, and the preview feature can add overhead, especially with large-scale data ingestion.
For a scenario like yours, with over 1000 tables across 20 databases, relying only on Dataflows for initial bulk loads may not be the most efficient. Still, Dataflows Gen2 are helpful when you need gateway connectivity to on-premises SQL sources and want to use parameterization for ingestion.
In cases like this, a hybrid approach is usually recommended:
Use Dataflows Gen2 with Fabric Gateway for secure ingestion from on-premises systems into your Lakehouse staging area. You can parameterize queries to reduce preview impact and let orchestration manage batch execution, rather than creating a separate dataflow for each table.
After loading raw data into the Lakehouse (Delta tables), use Notebooks (PySpark/SQL) for more complex processing, such as iterating over metadata, applying incremental logic, and merging tables from your 20 sources into 50 output tables. This avoids the limitations of the Power Query UI.
Use Pipelines to orchestrate the workflow triggering ingestion first, then notebook-based transformations. This keeps ingestion and processing separate and helps manage capacity.
To reduce preview issues in Dataflows Gen2, you can disable “Enable Data Profiling” or use parameterized queries that return minimal sample data. For very large table imports, consider using a direct SQL-based copy with Pipelines Copy Data activity and gateway as an alternative if Dataflows previews are too limiting.
Thank you.
We have succesfully made a dataflow and pipeline.
The pipeline now feeds the tablelist & database list to the dataflow.
We're currently having problems with storing the data to the lakehouse though.
It is generating errors which I do no expect.
I've been searching for answers. Strangely the responses start asking questions about firewalls.
But I have been able to load a set of tables directly via pipelines from source gateway to lakehouse files and tables.
I have no clue why this would fail, as I thought browsing your lakehouse is something purely on the cloud side...
Hi @m_vandermeij,
Thank you for the update.
Based on your description, the pipeline and dataflow setup are functioning as expected; but, the issue arises when Dataflow attempts to write to the Lakehouse. Since you can load tables directly from the gateway to the Lakehouse using a pipeline, we can confirm that firewall restrictions are not the cause, as the Lakehouse is a cloud service and firewalls do not apply.
To address this, I recommend first testing the Dataflow output by writing to Lakehouse files (Parquet). If successful, this suggests the issue lies with table schema mapping.
Also, you can use Dataflows for raw ingestion only and perform transformations in the Lakehouse using Notebooks. For large-scale data loads, especially with over 1000 tables, it is generally more effective to use Pipelines with the gateway for raw data ingestion, and then manage transformations within Fabric using Notebooks or SQL.
Thank you.
User | Count |
---|---|
3 | |
2 | |
1 | |
1 | |
1 |
User | Count |
---|---|
5 | |
4 | |
3 | |
2 | |
2 |