The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hi Experts!
Very, very new to Fabric and eager to learn!
I am currently building out my first ETL workflow. The problem I am dealing with is how to pass metadata to Dataflow Gen2. My research says this is not currently possible (maybe in the future). However, I have read one potential workaround is writing the metadata to a table that Dataflow Gen2 can then read. However, I have also read that writing metadata to a Lakehouse is not currently supported—only to a Warehouse. Finally, a lot of people are just bypassing Dataflow Gen2 entirely and deploying Notebook scripts of which I am woefully ignorant. Basically, a lot of gotchas.
Therefore, I am at a complete loss of how to go about integrating Dataflow Gen2 into Pipelines. Can someone please point me to a guide or explain how to pass a list of filenames from a GetMetadata activity to a Dataflow Gen2 activity? Seems simple enough for such a basic, yet critical, step because without it I am unable to build out my ETL workflow.
Thanks in advance!
Solved! Go to Solution.
Here is a solution which includes a simple notebook
I know you didn't want notebook, however this is what I came up with...
I'm asking ChatGPT to help me with the code so it gets a lot easier.
My variable is similar like yours:
My Notebook activity:
My Notebook looks like this:
I have connected my Notebook to a default Lakehouse. This is indicated by the pin symbol.
This is called "attach a default Lakehouse" and you do it easily in the user interface. You see the pin symbol then it's good. You can also do it without a "default Lakehouse" but then you need to add some more code.
ChatGPT helped me with the code.
This creates a file. I just called it output.json. Every time I run the pipeline, the file content gets overwritten.
My DataflowGen2 connects to this output.json file:
let
Source = Lakehouse.Contents([]),
#"Navigation 1" = Source{[workspaceId = "<EnterWorkspaceIdHere>"]}[Data],
#"Navigation 2" = #"Navigation 1"{[lakehouseId = "<EnterLakehouseIdHere>"]}[Data],
#"Navigation 3" = #"Navigation 2"{[Id = "Files", ItemKind = "Folder"]}[Data],
#"Navigation 4" = #"Navigation 3"{[Name = "output.json"]}[Content],
ToJson = Json.Document(#"Navigation 4"),
#"Converted to table" = Record.ToTable(ToJson),
#"Filtered rows" = Table.SelectRows(#"Converted to table", each ([Name] = "value")),
#"Removed other columns" = Table.SelectColumns(#"Filtered rows", {"Value"}),
#"Split column by delimiter" = Table.ExpandListColumn(Table.TransformColumns(Table.TransformColumnTypes(#"Removed other columns", {{"Value", type text}}), {{"Value", Splitter.SplitTextByDelimiter("|"), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Value")
in
#"Split column by delimiter"
I think here are some workarounds if you still don't want to use notebook. The post titles say Web activity, but the same principles should apply even if you're not using Web activity as your source.
Solved: How to save output from a web activity? - Microsoft Fabric Community
How to save complete web activity output to ADLS Gen2? - Microsoft Q&A
Thank you so much for your reply. As you guessed, I really am trying to avoid Notebooks. Therefore, I have come up with a simpler solution, but I am still running into trouble. I will make a new post to better describe it. Your expertise is very much appreciated, so please keep a look out for it! 🙂
Thank you so much for your reply. As you guessed, I really am trying to avoid Notebooks. Therefore, I have come up with a simpler solution, but I am still running into trouble. I will make a new post to better describe it. Your expertise is very much appreciated, so please keep a look out for it! 🙂
Did you take a look at the workarounds I mention (links) at the bottom of my long comment? These workarounds don't use Notebook, only Data Pipeline.
Also consider voting for the Ideas mentioned, in order to highlight the need for native support of cases like this in Data pipeline.
Hi, yes, I did look at those links and even tried to set up a Web activity, but I couldn't figure it out. So, I decided to go with your Notebook solution.
If you are able to implement one of those Web solutions and post a reply, that would be awesome. I know that is a lot to ask though.
I have managed to implement the Notebook solution, but it takes more than 10 minutes to run which is why I haven't marked your reply as a solution just yet. I have another post for that problem too. There are so many issues it seems to just simply setting up a Pipeline. I was even watching a Youtube video where the guys demonstrating Notebooks were encountering errors they were unable to resolve. LOL
Is it the notebook activity which is using 10 minutes?
I would expect the notebook activity to be a lot faster than that
Or is it the dfg2 activity that consumes most of the time?
Unfortunately I don't have the time at the moment for testing different setups. There was also a workaround which included a dummy file and a copy activity. But for my part I think I will just stick with using a Notebook. It seems at the moment that Notebook is a central piece in Data pipelines, at least as long as it's not possible to pass data between pipeline and dfg2.
From where are you getting the metadata, by the way?
Perhaps you could do the get metadata part also in a Dataflow Gen2?
In that case you could do everything inside your dfg2, and thus avoid having to pass the information from pipeline into dfg2.
Here is a solution which includes a simple notebook
I know you didn't want notebook, however this is what I came up with...
I'm asking ChatGPT to help me with the code so it gets a lot easier.
My variable is similar like yours:
My Notebook activity:
My Notebook looks like this:
I have connected my Notebook to a default Lakehouse. This is indicated by the pin symbol.
This is called "attach a default Lakehouse" and you do it easily in the user interface. You see the pin symbol then it's good. You can also do it without a "default Lakehouse" but then you need to add some more code.
ChatGPT helped me with the code.
This creates a file. I just called it output.json. Every time I run the pipeline, the file content gets overwritten.
My DataflowGen2 connects to this output.json file:
let
Source = Lakehouse.Contents([]),
#"Navigation 1" = Source{[workspaceId = "<EnterWorkspaceIdHere>"]}[Data],
#"Navigation 2" = #"Navigation 1"{[lakehouseId = "<EnterLakehouseIdHere>"]}[Data],
#"Navigation 3" = #"Navigation 2"{[Id = "Files", ItemKind = "Folder"]}[Data],
#"Navigation 4" = #"Navigation 3"{[Name = "output.json"]}[Content],
ToJson = Json.Document(#"Navigation 4"),
#"Converted to table" = Record.ToTable(ToJson),
#"Filtered rows" = Table.SelectRows(#"Converted to table", each ([Name] = "value")),
#"Removed other columns" = Table.SelectColumns(#"Filtered rows", {"Value"}),
#"Split column by delimiter" = Table.ExpandListColumn(Table.TransformColumns(Table.TransformColumnTypes(#"Removed other columns", {{"Value", type text}}), {{"Value", Splitter.SplitTextByDelimiter("|"), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Value")
in
#"Split column by delimiter"
I think here are some workarounds if you still don't want to use notebook. The post titles say Web activity, but the same principles should apply even if you're not using Web activity as your source.
Solved: How to save output from a web activity? - Microsoft Fabric Community
How to save complete web activity output to ADLS Gen2? - Microsoft Q&A
This would also be a nice feature which could also help a bit in your case. Consider voting for this one as well.
https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=77919252-7f67-ee11-a81c-000d3ae3b222
To add a little more information in the hopes that someone can steer me in the right direction, I have put together the following steps: Get Metadata -> For Each -> Set Variable. The "For Each" activity iterates through each child item from the "Get Metadata" activity and appends the name to the file_list array variable. Then the "Set Variable " activity joins together the file_list array into the filename variable using the pipe as a separator. See below.
Now, I need to somehow save the filename variable to a table in the Lakehouse, preferably parsed and in columnar format e.g.
ID | Filename |
1 | File1.csv |
2 | File2.csv |
3 | File3.csv |
although the ID column is not critical.
Can someone please guide me on what the next step(s) would be in my pipeline to save the filename variable to a table?
After this, I think I can pretty easily point my Dataflow Gen2 activity to the table and proceed from there.
Many thanks!
Please vote for this idea:
Enable to pass variables from pipelines as parameters into a Dataflow Gen2
https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=2f761603-a400-ee11-a81c-6045bdb98602
User | Count |
---|---|
2 | |
2 | |
1 | |
1 | |
1 |
User | Count |
---|---|
4 | |
3 | |
2 | |
2 | |
2 |