topic Gen2 df external table in Data Engineering

Gen2 df external table

smpa01 — Wed, 07 May 2025 20:55:27 GMT

[Edited by admin for unnecessary tagging without context]

A widely preferred pattern for data engineering with lakehouse for us has been creation of External Delta table. This is only possible for data sources that can be consumed from a notebook.

However, there are data sources that exist beyond that and the only alternative is gen2 df for them. But gen 2 df only inserts into a lakehouse table, is there any way to insert into a chosen lakehouse subfolder instead of a table.

I don’t think it is doable now. If that is the case, if it is on cards?

Re: Gen2 df external table

nilendraFabric — Wed, 07 May 2025 21:42:32 GMT

Hi @smpa01

You arr right,Dataflow Gen2 currently supports writing data only to Lakehouse tables, not specific subfolders.

one thing which you can try is Use Dataflow Gen2 to land data in a staging table and then read this staging table in notebook and write to your desired location

Re: Gen2 df external table

smpa01 — Wed, 07 May 2025 22:04:46 GMT

That's too much honestly to maintain. pre_bronze->brone->silver so on and so forth.

Dataflows have an advantage over notebooks when it comes to connecting to certain sources that don't have equivalent connectors available in notebooks — for example, on-premises SQL Server, SharePoint, etc. In such cases, there is no alternative but to use a dataflow.

Currently, dataflows remain relevant largely because of this limitation. So, for writing to destination, it only makes sense that df gen2 provides same options as notebook.

To keep up with the norm, df gen2 must give the ability to write to subfolders. After all, any bronze should land in files for audit trailing.

Re: Gen2 df external table

frithjof_v — Wed, 07 May 2025 22:52:35 GMT

I'm curious, what are the benefits of writing to files instead of just appending to a lakehouse bronze delta table?

Re: Gen2 df external table

smpa01 — Thu, 08 May 2025 00:37:05 GMT

Why External Tables are Ideal for the Bronze Layer in Production Data Lakes (according to my practical experience of data engineering and servicing BI)

In a well-architected Data Lake, data flows through three layers:

Bronze (Raw Ingestion),
Silver (Cleaned & Enriched),
Gold (Curated Business Data with Semantic Models).

The Bronze Layer is where raw data from various sources like on-prem SQL, SharePoint, Azure SQL, Oracle, APIs, and Databricks is ingested. Using external tables for this layer is highly advantageous for the following reasons:

1. Data Persists Beyond Table Lifetime

External tables store data separately from the metadata, so dropping the table does not delete the data.
This ensures raw ingested data is always available for reprocessing or auditing.

2. Easy Table Rebuilds Without Re-ingestion

Since the data remains in the storage layer, you can recreate the table schema at any time without fetching the source data again.
This is crucial for schema adjustments or optimization without risking data loss.

3. Multiple Silver/Gold Views from the Same Data

External tables allow you to build multiple transformations (Silver/Gold) from the same Bronze data.
This eliminates redundancy and maintains a single source of truth for different business units like Finance, Procurement, Leasing, and Engineering.

4. Flexible Backfills and Schema Evolutions

Adding new columns, adjusting schemas, or historical backfills are seamless.
You can introduce new attributes for all past, present, and future data without re-ingesting or dropping the table.

5. Enhanced Audit Traceability

Every row can be traced back to its original source file or API batch.
This provides clear visibility into when and where data was ingested — critical for regulatory compliance and debugging.

Conclusion

External tables in the Bronze layer offer:

Data safety beyond table lifecycle
Rebuild flexibility without re-fetching data
Multi-view capability for different business requirements
Smooth schema evolution and backfills
Full audit traceability for compliance and debugging

This design pattern forms the backbone of a resilient, scalable, and auditable Data Lake architecture.

Re: Gen2 df external table

miguel — Thu, 08 May 2025 03:53:03 GMT

At the moment, Dataflow Gen2 only loads data to tables. Do feel free to suggest new destinations (and formats) in the Fabric Ideas portal (https://aka.ms/FabricIdeas)

An alternative is to leverage the copy activity or a copy job. Especially as the bronze layer is typically used for the files in its raw state, so no transformation should be performed at that layer and instead a simple copy activity should be good enough. If a connector is missing from the copy job / copy activity, then would you mind letting us know what the source is? you can also post a new idea for such connector in the Ideas Portal.