Dear Pro-Users, most likely an easy question,
What is a performant PQ best practice to query Folder.Contents (directories only) from different servers/directories?
starting from a table with all directories to query, (maybe) using the function Folder.Contents()
in combination with each or (_)=> or better nested functions (Attributes: Directory = true)?
let
dirs = Table.SelectColumns(Table.FromRecords({ [dirID = 001, path = "\\server01\MainFolder01"], [dirID = 002, path = "\\server01\MainFolder02"] , [dirID = ..., path = "\\server..\MainFolder.."] , [dirID = 00N, path = "\\server0M\MainFolder0K"] }))[path],
dirsAsTxt = Lines.toText(Table.ToList(dirs)), // probably not needed
CombTbls = Table.Combine()
in
CombTbls
// for each item in dirsAsTxt ... next
(dir) as text =>
let
SubFolders = Folder.Contents(dir as text),
ListSubFolders = Do.SomeTransforms(...)
in
ListSubFolders
Kind Regards (first question, please advice if non-compliant)
Solved! Go to Solution.
No, you need to bring out the big guns. AD extracts, Purview, Scanner API. Low code/no code tools tend to be overwhelmed when confronted with massive amounts of data.
possible value relevance: Most companies are SMEs (future value/cashflow). Most SMEs are not fully digitalized.
Many SMEs are MS-clients, still running file-servers, many run RDBs, fewer RDFs or kafka, i.e. data is stored in files.
The wish is to have a stackable select-ETL (until full digitalization, if at all) to select/access file-data via PBI (PQ/PP).
AIM: Enable any user to browse & combi-select multiple subdirectories to query file-content with separate ETLs.
Regards
You may want to continue exploring other tools. While Power Query is very powerful and flexible it is not a "proper" ETL tool. Its sole aim is to prepare data so that it can be used by Vertipaq. There was never a design goal to load anything elsewhere.
Note on the concrete task to be accomplished that appears to be a standard BI use case
Task: Load data from distributed XLS-files (of same kind) from changing (renamed/moved) folders
via PBI (PQ) into a flattened MS-Cube to consolidate file-data for comparative pivot-analysis.
Task: Load data from distributed XLS-files (of same kind) from changing (renamed/moved) folders
via PBI (PQ) into a flattened MS-Cube to consolidate file-data for comparative pivot-analysis.
This is from when? 2013 ?
Thank you for the VertiPaq hint, indeed helpful/new.
You may want to continue exploring other tools?
It means power platform apps/page/automate/va, eventhub/dataverse...?
Coming py/R/SQL/git, MS strongly adviced no/low-Code power platform.
No, you need to bring out the big guns. AD extracts, Purview, Scanner API. Low code/no code tools tend to be overwhelmed when confronted with massive amounts of data.
Thanks again. Issue closed Please note: Process is exactly designed to avoid massive data.
The select process serves the purpose to select only those few files (data) of current interest.
There is just no singel point of truth SPOT, since files flaot on file server.
This fact requires iterative (therefore performant) folder search (folder browsing).
Hypothesis: Many SMEs should face a similar situation, (i.e. money on the street?)
Hopefully pointing to a working solution: to be further generalized by recursive call, i.e. flexible folder-levels.
Idea: Each User-Folder-Select returns parent/child folders, which is like browsing the windows explorer (standard)
(Pretty sure, there is a conformant & better standard off-the-shelf solution, but ignored or over looked here)
// fct named GetSubDirs (apology if too trivial, mimics only one step of explorer browsing)
= (MainDir as text) as table =>
let
// GetSubDirs queries (for each row in a table) MainDir only the next level SubDirs
Source = Folder.Contents(MainDir),
SubDir = Table.SelectColumns(Table.SelectRows(Source, each ([Attributes][Directory] = true)),{"Folder Path", "Name"})
in
SubDir
calls custom-fct (probably be better & faster done as if nested)
let
// Sub2Dirs (table[column] with Sub2Dirs) gets for each row get all next level Sub3Dirs
Sub2Dir = Table.TransformColumnTypes(Table.SelectColumns(Excel.CurrentWorkbook(){[Name = "Sub2Dirs"]}[Content], {"Sub2Folder"}),{{"Sub2Folder", type text}}),
CustFct = Table.TransformColumnTypes(Table.ExpandTableColumn(Table.AddColumn(Sub2Dir, "SubDir", each GetSubDirs([Sub2Folder]), type text), "SubDir", {"Name"}, {"Sub3Dir"}),{{"Sub3Dir", type text}}),
Result = Table.SelectColumns(Table.SelectRows(Table.AddColumn(CustFct, "Sub3Folder", each Text.Combine({[Sub2Folder], [Sub3Dir]}, ""), type text), each ([Sub3Folder] <> "")), "Sub3Folder")
in
Result
This can be "simplified" by using recursive functions. But will it be faster? Not likely.
Thank you for your response and hint. The limiting externality is the amount of files and resulting response time.
Folder.Files() would be nice, but each file-server contains millions of files in constantly changing directories.
Table.Combine(Folder.Files("\\server01\???"), Folder.Files("\\server02\???"), more...) requires hours to answer.
The (common?) SME use case is a dynamic, multiple folder-select starting from different main directories.
Loop: For fast response, the user selects folders level by level. Easier approach highly appreciated
Step1: User-select of server(s) & call Folder.Contents( "get only (next) 1st level directories for selection" )
Step2: User-select of 1st.folders & call Folder.Contents( "get only (next) 2nd level directories for selection" )
Step3: User-select of 2st.folders & call Folder.Contents( "get only (next) 3rd level directories for selection" )
stop if no further user select, else continue loop until last deepest level
more PQ-conformant approach would be great help
Regards
Remember that the Folder connector gives you all folder names for free.
let
Source = Folder.Files("\\192.168.3.8\pi"),
#"Removed Other Columns" = Table.SelectColumns(Source,{"Folder Path"}),
#"Removed Duplicates" = Table.Distinct(#"Removed Other Columns")
in
#"Removed Duplicates"
So no need to loop through anything again.
Thank for a lot your hint and response.
Folder.files won't work because response time exceeds common user acceptance (explanation see above)
regards
Unless your data sources ( the file servers ) can help with periodic/fast meta data exports I don't think you'll have much luck making this any faster.