Microsoft is giving away 50,000 FREE Microsoft Certification exam vouchers!
Enter the sweepstakes now!Prepping for a Fabric certification exam? Join us for a live prep session with exam experts to learn how to pass the exam. Register now.
Streamlined process for appending multiple files with similar structure in a folder
Let’s assume you have multiple files with similar or identical structure that you need to combine into a single table. These files can be on a local folder or (preferably) on a SharePoint or OneDrive (which eliminates the need for a gateway)
The standard approach would be to use the “Combine Files” option of the “Content” column in Power Query.
This works relatively well but comes with some unintended side effects. It makes assumptions on your file origin encoding, and on the column types – based on a sample file (usually the first one) and the first 200 rows.
It then produces multiple folders and artifacts that can quickly clutter up your query list.
As you can see it also messed up the column headers for the sample files. The Power Query code is less than optimal. Invoking a custom function outside of our partition can later get us in trouble with the formula firewall, for example.
let
Source = Folder.Files("C:\Users\bendlin\Documents\Power BI"),
#"Filtered Rows" = Table.SelectRows(Source, each Text.StartsWith([Name], "data") and [Extension]=".txt"),
#"Filtered Hidden Files1" = Table.SelectRows(#"Filtered Rows", each [Attributes]?[Hidden]? <> true),
#"Invoke Custom Function1" = Table.AddColumn(#"Filtered Hidden Files1", "Transform File", each #"Transform File"([Content])),
#"Renamed Columns1" = Table.RenameColumns(#"Invoke Custom Function1", {"Name", "Source.Name"}),
#"Removed Other Columns1" = Table.SelectColumns(#"Renamed Columns1", {"Source.Name", "Transform File"}),
#"Expanded Table Column1" = Table.ExpandTableColumn(#"Removed Other Columns1", "Transform File", Table.ColumnNames(#"Transform File"(#"Sample File")))
in
#"Expanded Table Column1"
Since you know the structure of these files better than Power Query it makes sense explore alternative options that achieve the same goal but with less clutter.
First step is to pick one of the files for the basic transforms. The first file is usually the oldest, so it may make sense to pick the last file instead.
Clicking on the Binary link will recognize the file type (in this case CSV) and start the basic ingestion
Let’s fix a couple of things. Encoding is UTF-8, not ANSI. Number of columns isn’t really needed (and it would prevent the Power Query code from handling scenarios when columns are added to the source files). Lastly, QuoteStyle should always be set to ignore quoted line breaks.
Some of these settings can be done through the dialog
But the column count needs to be removed manually.
Much better. The final transform is to elevate the first row to header status.
Note that all column types are listed as text (since this is a CSV source). We leave that alone for now.
Here is the code that was produced for us. Note that we sampled the third file.
let
Source = Folder.Files("C:\Users\bendlin\Documents\Power BI"),
#"Filtered Rows" = Table.SelectRows(Source, each Text.StartsWith([Name], "data") and [Extension]=".txt"),
#"C:\Users\bendlin\Documents\Power BI\_data3 txt" = #"Filtered Rows"{[#"Folder Path"="C:\Users\bendlin\Documents\Power BI\",Name="data3.txt"]}[Content],
#"Imported CSV" = Csv.Document(#"C:\Users\bendlin\Documents\Power BI\_data3 txt",[Delimiter=",", Encoding=65001, QuoteStyle=QuoteStyle.Csv]),
#"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV", [PromoteAllScalars=true])
in
#"Promoted Headers"
Power Query code is usually written such that a line of code is referencing the previous line. We can replace these references with the actual code, like so:
let
Source = Folder.Files("C:\Users\bendlin\Documents\Power BI"),
#"Filtered Rows" = Table.SelectRows(Source, each Text.StartsWith([Name], "data") and [Extension]=".txt"),
#"C:\Users\bendlin\Documents\Power BI\_data3 txt" = #"Filtered Rows"{[#"Folder Path"="C:\Users\bendlin\Documents\Power BI\",Name="data3.txt"]}[Content],
#"Promoted Headers" = Table.PromoteHeaders(Csv.Document(#"C:\Users\bendlin\Documents\Power BI\_data3 txt",[Delimiter=",", Encoding=65001, QuoteStyle=QuoteStyle.Csv]), [PromoteAllScalars=true])
in
#"Promoted Headers"
The #”Imported CSV” step is now part of the #“Promoted Headers”step. We can go one further and replace the previous reference too
= Table.PromoteHeaders(Csv.Document([Content],[Delimiter=",", Encoding=65001, QuoteStyle=QuoteStyle.Csv]), [PromoteAllScalars=true])
This is now the basis for our new approach. We remove all steps after the “Filtered Rows” step and then select only the columns we need – the [Name] column and the [Content] column.
(Pro tip: remove steps from the bottom up)
Then we add a custom column and paste the ingestion code.
We end up with a column that contains all the data from our CSV files, with the correct encoding and headers. You can verify that by clicking to the right of one of the “Table” links. This will produce a preview of the contents of that table object.
We don’t need the [Content] column any more.
(Pro tip: More advanced users can use a replacer function to replace the [Content] binary with the transformation instead of adding a new column)
Next we expand the [Custom] column to rows which automagically appends all the tables. Note that this will work even if these tables don’t have the exact same structure.
The “List may be incomplete” warning can be cleared with the “Load more” link if you are observing missing columns.
This looks great – the last few steps are to assign the column types and to remove the now unnecessary groups and artifacts
Here is our final version of the code:
let
Source = Folder.Files("C:\Users\bendlin\Documents\Power BI"),
#"Filtered Rows" = Table.SelectRows(Source, each Text.StartsWith([Name], "data") and [Extension]=".txt"),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Content", "Name"}),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns", "Custom", each Table.PromoteHeaders(Csv.Document([Content],[Delimiter=",", Encoding=65001, QuoteStyle=QuoteStyle.Csv]), [PromoteAllScalars=true])),
#"Removed Other Columns1" = Table.SelectColumns(#"Added Custom",{"Name", "Custom"}),
#"Expanded Custom" = Table.ExpandTableColumn(#"Removed Other Columns1", "Custom", {"Timestamp", "Temperature", "Battery"}, {"Timestamp", "Temperature", "Battery"}),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Custom",{{"Timestamp", type datetime}, {"Temperature", type number}, {"Battery", Int64.Type}})
in
#"Changed Type"
(Pro tip: The Table.ExpandTableColumn step can be streamlined further by referencing the column names of one of the “Table” objects from the previous step, instead of specifying all column names..)
#"Expanded Custom" = Table.ExpandTableColumn(#"Removed Other Columns1", "Custom", Table.ColumnNames(#"Removed Other Columns1"[Custom]{0})),
And that’s it. Yes, it does involve a couple of steps but the result is cleaner code and a slight performance improvement as we are no longer loading the sample file twice. You have control over the column types (instead of relying on the sample based guessing), and you avoid all the extra folders and expressions.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.