March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Hi, I've uploaded several hundred daily Power BI audit logs into a Lakehouse under the Files area. I'd like to use a Dataflow Gen2 to combine the files and then reshape them. Is there a way to do this?
If I use a Lakehouse connection I seem to be able to select an individual file, but not the whole folder.
If I use a Folder connection, it wants to send everything through an on-prem gateway (which makes no sense).
Appreciate any insights - thanks in advance!
Scott
Solved! Go to Solution.
Were you able to get this to work? Trying to do the same have a ton of files loaded into a folder in lakehouse and would like to use dataflow to shape them
Thanks,
Zach
Hi @ZachRoberts , unfortunately I did not find a way to do this. If you do - please let me know, we could really use this.
Thanks,
Scott
Hi @Scott_Powell ,
I wasn't able to figure out loading the files in my folder through Dataflow but ended up using a notebook to ingest the files to a table and went from there.
Below is the notebook details if you want to give it a shot:
The below loads all files from within a folder and removes the first row in each file (not sure if your files are delmited but below example my files are | delimited), and in the step where you can input the column names you don't have to do this for every column - you can select the columns you want to load you just have to provide the appropiate column # starting from 0
Cell 1
from pyspark.sql.functions import split, row_number, input_file_name
from pyspark.sql import Window
# Read all files from the directory into a DataFrame
df = spark.read.text("Files/Expense/*.txt")
# Add a column for the input file name
df_with_filename = df.withColumn("filename", input_file_name())
# Add a row number to each row, partitioned by the input file name
windowSpec = Window.partitionBy("filename").orderBy("value")
df_with_rownum = df_with_filename.withColumn("rownum", row_number().over(windowSpec))
# Filter out the first row from each file
df_filtered = df_with_rownum.filter(df_with_rownum.rownum > 1)
# Split the value column by the | delimiter
df_split = df_filtered.withColumn("split_values", split(df_filtered["value"], "\|"))
# If you know the number of columns and want to give them names, you can do so
# For example, if there are two columns:
df_final = df_split.select(
df_split["split_values"].getItem(0).alias("Constant"),
df_split["split_values"].getItem(1).alias("BatchId"),
)
# Show the resulting DataFrame
df_final.show()
Cell 2
df_final.write.mode("overwrite").format("delta").saveAsTable("TableName")
Hi @yjh I realize I wasn't clear at all. I'm not trying to load them into Power BI desktop, I'm trying to use a Dataflow Gen2 to load them into tables in the Lakehouse.
Thanks,
Scott
=Lakehouse.Content(null)
Then Drill down.
sorry for my poor English.
This does not work for multiple files in the same folder though, right? You can only select a single file using this method as far as I an tell.
You can use the combine files experience the same as with other file system views (SharePoint, data lake, folder and others). If you're seeing any issues with this approach please create a separate topic / thread so we can take a closer look at it
The method is the same.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.