Get certified in Microsoft Fabric—for free! For a limited time, the Microsoft Fabric Community team will be offering free DP-600 exam vouchers. Prepare now
I have a need when connecting to an external Excel file (via Excel.Workbook) to import the entire worksheet including any leading blank rows or columns.
The default behavior is to return only the UsedRange (save to CSV behaves the same way).
For example, if rows 1-6 and columns A-D are blank in the source worksheet the UsedRange will start at cell E7 which will be loaded to row1:Column1 in the PQ editor. The blank rows and columns - areas outside the UsedRange - are ignored.
I need cell E7 data to appear at row7:Column5.
My workaround is to set the Style of A1 in the source worksheet to something other than Normal. That will define the top left corner of the UsedRange to be A1 and the blank rows and columns will be imported.
But, I'd rather not have to do that.
Thanks for any insight.
Solved! Go to Solution.
Hi @JimChisholm,
Try this one. I've optimised the performance by truncating unnecessary checks after the first header match is found, also limiting the number of rows to check for the header (in this case top 100 rows, which seems reasonably high). If you sure this number can be less than 100, change it to what you think is most appropriate. The smaller the number the quicker the code.
let
Source = Excel.Workbook(File.Contents("C:\...\Downloads\Test.xlsx"), null, true),
Sheet1_Sheet = Source{[Item="Sheet1",Kind="Sheet"]}[Data],
fTrimTable = (tbl as table, header as text) =>
let
t = Table.Buffer(tbl),
columns = List.Buffer(Table.ColumnNames(t)),
rowsToCheck = 100,
Column = List.Select(columns, each List.PositionOf(List.FirstN(Table.Column(t, _),rowsToCheck), header)>0){0},
Row = List.PositionOf(Table.Column(t, Column), header),
ScrollRows = Table.RemoveFirstN (t, Row),
ScrollColumns = Table.SelectColumns(ScrollRows, List.RemoveFirstN(columns, List.PositionOf(columns, Column))),
#"Promoted Headers" = Table.PromoteHeaders(ScrollColumns, [PromoteAllScalars=true])
in
#"Promoted Headers",
Trimmed = fTrimTable(Sheet1_Sheet, "Header100")
in
Trimmed
I've tested it on a file with 600 columns and 2000 rows (about 12Mb), on my fairly old laptop it takes about 10-15 sec to refresh, much better then the older version.
For records: the problem with the old code was that (a) it was loading the entire Excel file each time for each column checked for the position of the header, and (b) checking position of the header in the list of thousands of itemsis not quick, when it is not on hte list. Both not a big problem if the colunm is found quickly (ideal case it is in the A1 cell), but very quickly become one as the number of columns to check overgrow 3-5, in my testing for searhing of "Header100" - the 100th column - it took forever and I had to break the cyle. Optimised version does not have any of the above problems.
Kind regards,
John
Hi @JimChisholm,
Would this be a help in resolving this problem?
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("bU67DsIwDPyXzh2wE8f2zsDOWHUIohIzgv/HF9QoqEiR5dzLtyzTNB/eOu/wZav37cn/2C9FsV3ft8fxw12T+pb7JnvMub4qfGQgNVvMYh4zUYnJgQ9SboSASLAZwplOSCAalSCMESoOfW4ub4h6GaWATDATaz/gjBZZf6SNQEdXVLHWVGGyKDEI0Z0UrAiKOiqqAGBcXz8=", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"(blank)" = _t, #"(blank).1" = _t, #"(blank).2" = _t, #"(blank).3" = _t, #"(blank).4" = _t, #"(blank).5" = _t, #"(blank).6" = _t]),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"(blank)", type text}, {"(blank).1", type text}, {"(blank).2", type text}, {"(blank).3", type text}, {"(blank).4", type text}, {"(blank).5", type text}, {"(blank).6", type text}}),
fTrimTable = (t as table, header as text)=>
let
FindFirstCell = List.Accumulate(Table.ColumnNames(t), null, (a, n)=> if List.PositionOf(Table.Column(t, n), header) > 0 then [Row= List.PositionOf(Table.Column(t, n), "Header1"), Column = n] else a),
ScrollRows = Table.RemoveFirstN (t, FindFirstCell[Row]),
ScrollColumns = Table.SelectColumns(ScrollRows, List.RemoveFirstN(Table.ColumnNames(ScrollRows), List.PositionOf(Table.ColumnNames(ScrollRows), FindFirstCell[Column]))),
#"Promoted Headers" = Table.PromoteHeaders(ScrollColumns, [PromoteAllScalars=true])
in
#"Promoted Headers",
Output = fTrimTable(#"Changed Type", "Header1")
in Output
The code searches for the left top corner of the data regardless of the spreadsheet structure (as long as the column name is unique, but can be re-written to take just a first appearance).
Kind regards,
John
Hi John,
Using this approach would require I get header names from the user - that of the top left data column and the index column. Easy enough to do.
However, it takes nearly 3 minutes to complete this query on a 450 column by 1,450 row worksheet.
Directly deleting rows and columns based on user-provided input is nearly instantaneous so I'll have to stick with that approach for now.
Thanks for providing this approach. I'll certainly add it to my toolbox.
Jim.
Hi @JimChisholm,
Try this one. I've optimised the performance by truncating unnecessary checks after the first header match is found, also limiting the number of rows to check for the header (in this case top 100 rows, which seems reasonably high). If you sure this number can be less than 100, change it to what you think is most appropriate. The smaller the number the quicker the code.
let
Source = Excel.Workbook(File.Contents("C:\...\Downloads\Test.xlsx"), null, true),
Sheet1_Sheet = Source{[Item="Sheet1",Kind="Sheet"]}[Data],
fTrimTable = (tbl as table, header as text) =>
let
t = Table.Buffer(tbl),
columns = List.Buffer(Table.ColumnNames(t)),
rowsToCheck = 100,
Column = List.Select(columns, each List.PositionOf(List.FirstN(Table.Column(t, _),rowsToCheck), header)>0){0},
Row = List.PositionOf(Table.Column(t, Column), header),
ScrollRows = Table.RemoveFirstN (t, Row),
ScrollColumns = Table.SelectColumns(ScrollRows, List.RemoveFirstN(columns, List.PositionOf(columns, Column))),
#"Promoted Headers" = Table.PromoteHeaders(ScrollColumns, [PromoteAllScalars=true])
in
#"Promoted Headers",
Trimmed = fTrimTable(Sheet1_Sheet, "Header100")
in
Trimmed
I've tested it on a file with 600 columns and 2000 rows (about 12Mb), on my fairly old laptop it takes about 10-15 sec to refresh, much better then the older version.
For records: the problem with the old code was that (a) it was loading the entire Excel file each time for each column checked for the position of the header, and (b) checking position of the header in the list of thousands of itemsis not quick, when it is not on hte list. Both not a big problem if the colunm is found quickly (ideal case it is in the A1 cell), but very quickly become one as the number of columns to check overgrow 3-5, in my testing for searhing of "Header100" - the 100th column - it took forever and I had to break the cyle. Optimised version does not have any of the above problems.
Kind regards,
John
Success!
I tried it on my notoriously badly-structured dataset and performance is no longer an issue.
The user will now have to supply only one piece of information - the name of the top left header.
I'll put the max row check number in a variable on a configuration sheet.
Thanks for your solution John. I think this is a good general purpose approach that should find wide use.
Jim.
Hi @JimChisholm,
I could not find a way to re-configure the settings to take the entire spreadsheet inside PQ, but you can use a workaround by placing a character in the most right-bottom cell of the desired range. In my testing I just put it in the most right-bottom cell [which is XDF1048576] to fetch the entire spreadsheet and it worked.
Kind regrds,
John
Hi @jbwtp ,
The UsedRange is defined by the top left and the bottom right cells.
If you start with a UsedRange of $E$7:$G$13 (columns A-D are empty; rows 1-6 are empty) then placing a character (or changing the Style) in the bottom right-most cell expands the UsedRange to $E$7:$XFD$1048576.
The top left corner is not affected.
In the case of a UsedRange of $E$7:$G$13, the bottom right corner ($G$13) of the UsedRange is correct - it's the empty top rows and left columns I need.
So, other than manually manipulating the UsedRange to define the area to be imported - I don't see a way to import the entire worksheet.
Jim.
Hi @JimChisholm,
I think you are right, but maybe it is better to start from the other end of the story: why do you need those empty lines/cells? What are you trying to achieve?
Thanks,
John
Hi @jbwtp ,
I have a generic file translation process that takes the following file structure parameters from the user:
This allows the process to handle all the messy worksheets that users create (leading blank rows and columns, multi-row headers, odd-spellings and line feeds) that I have no control over.
If they can at least look at the target worksheet and tell me two simple things about the block of data I'll have all the information I need to process the file - if PQ actually ignored the UsedRange and imported the entire sheet.
Since PQ won't do that I'll ask the user for two more parameters -
Then I'll calculate the following to get the as-imported parameters needed for processing -
Check out the October 2024 Power BI update to learn about new features.
Learn from experts, get hands-on experience, and win awesome prizes.
User | Count |
---|---|
72 | |
67 | |
25 | |
18 | |
12 |