Solved: Data Lakehouse Modelling Q

bw_chec · ‎07-18-2024

Wanting some clarity on how I am modelling my medallion lakehouse.

I have a source system MySQL server with many tables, with rows that are often modified (think accumulating snapshot). This is not modelled well and rows are inserted and modified in each table (has created and modified fields). The server is slow and so loading into my Bronze layer using a query e.g. created > <max_bronze_timestamp> is not feasible.

I am overwriting the whole bronze layer each night (~4 minutes for 2-300GB).

I am modelling the Silver layer in a better way, merging/unioning some tables etc. so it isn't a straight forward ETL process.

To go from bronze -> silver I am thinking of something like:

- storing a timestamp in a file/get max last_updated datetime from silver tables

- getting changes in each bronze table using created > max time | modified > max time

- Doing ETL on just these changes and merging into silver tables

My questions are

- is overwriting the bronze layer right?

- should i check every single table in bronze for changes every day?

- should the software team create an event sourcing solution to send events when anything in the source system changes? This removes the need for the above? This would take time I don't have to implement, but could be swapped out in the future.

AndyDDC · ‎07-18-2024

I don't know whether using copy activity with a delta table enabled for CDF will work, something to test for sure. And yes, if there is a big bulk movement of data, CDF isn't really going to be much use anyway.

IMHO you won't need to use Type 2 in the raw/cleansed layer (I assume you mean slowly changing dimension?) as you can handle this further down the pipeline in your dimensional model. You need to be able to identify the changed records to then apply appropriate transformation downstream.

I think the crux here is to be able to incrementally load from the MySQL box, if this isn't possible then I don't know any other way you could identify changed records other than to bulk load the data into Fabric Lakehouse, then work out what's changed by using date/time stamps. Possibly store the lastupdated timestamp in a custom metadata table in the lakehouse for each table, then when you bulk load into the raw layer, query each table using this timestamp and only load the changed rows into the silver layer.

A blog on CDF if any of it helps in this situation...

Delta Lake Change Data Feed in Fabric Lakehouses (serverlesssql.com)

View solution in original post

AndyDDC · ‎07-18-2024

Hi @bw_chec so it certainly would be good to optimise that bronze/raw layer load, so my thoughts on that are:

Any indexing implemented on relevant date columns in the MySQL db to allow only selecting data that has changed?
Any way to identify only the tables that have changed data (therefore optimising what you're reading)?
I assume there's no possiblity of increasing the power of the compute MySQL is on?

Ideally you don't want to drop/recreate the bronze layer for each load (it should be appended). How are you overwritting the bronze layer (lakehouse tables?)

bw_chec · ‎07-18-2024

Thanks for your response @AndyDDC ,

-There is no indexing on the MySQL side on the datetime columns, only the PKs.

-The only way I can think is to do a comparison with date - or to have events fire from that system when something changes.

-No we can't bump up the MySQL.

I don't want to be overwriting the bronze layer every day as I can't put CDF on this layer when overwriting via copy activity (or it seems to be failing when I tried to). - separate note: If i use copy activity with append would the tables be allowed to have CDF enabled?

-The data in the system isn't append only though - the rows change often (e.g. appointment status). Should I append with Type 2 then? But this feels like a lot of heavy lifting in the Ingestion phase which the MySQL server may struggle to handle?

Keen to hear your thoughts on a solution.

Thanks,

BW

AndyDDC · ‎07-18-2024

I don't know whether using copy activity with a delta table enabled for CDF will work, something to test for sure. And yes, if there is a big bulk movement of data, CDF isn't really going to be much use anyway.

IMHO you won't need to use Type 2 in the raw/cleansed layer (I assume you mean slowly changing dimension?) as you can handle this further down the pipeline in your dimensional model. You need to be able to identify the changed records to then apply appropriate transformation downstream.

I think the crux here is to be able to incrementally load from the MySQL box, if this isn't possible then I don't know any other way you could identify changed records other than to bulk load the data into Fabric Lakehouse, then work out what's changed by using date/time stamps. Possibly store the lastupdated timestamp in a custom metadata table in the lakehouse for each table, then when you bulk load into the raw layer, query each table using this timestamp and only load the changed rows into the silver layer.

A blog on CDF if any of it helps in this situation...

Delta Lake Change Data Feed in Fabric Lakehouses (serverlesssql.com)