Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hi
I have snapshot report output files stored in a folder. Files in the folder:
ReportID_100 12012016.txt
...
ReportID_100 11012019.txt
ReportID_100 12012019.txt
ReportID_100 01012020.txt
...
Each file is like as below:
ReportID_100 Report Name: "Antibiotics Group" Report Date: 12/01/2019 Report Page 1
11/01/2019 - 11/30/2019
Prescriptions Group by Antibiotics by Department
Prescription PSCP 2 Count PSCP 3 Count PSCP 4 and above Count
---------------------------------------------------------------------------------
Radiology 1 0 1
Outpatient 1 0 1
Pathology 0 0 0
Surgery 1 0 1
Medicine 1 0 1
Orthopedics 1 0 1
ReportID_100 Report Name: "Antibiotics Group" Report Date: 12/06/2019 Report Page 2
11/01/2019 - 11/30/2019
Prescriptions Group by Antibiotics by Department
Prescription PSCP 2 Count PSCP 3 Count PSCP 4 and above Count
---------------------------------------------------------------------------------
Adult/Family Medicine 1 0 1
Dermatology 1 0 1
Gastroenterology 1 0 1
...
ReportID_100 Report Name: "Antibiotics Group" Report Date: 12/06/2019 Report Page 3
11/01/2019 - 11/30/2019
Prescriptions Group by Antibiotics by Department
Prescription PSCP 2 Count PSCP 3 Count PSCP 4 and above Count
---------------------------------------------------------------------------------
Urology 1 0 1
Pulmonary Medicine 1 0 1
...
Total: 11 10 222
#END OF REPORT#
I am having multiple pages data, page numbers (dynamic) and date range in each file. I am trying to bring and model the data.
a) Any tips on modeling of importing this kind of files?
b)I am able to get the data from each file and parse. But, how to identify the header rows in between pages and strip them of? (as each file has first few rows dynamic and also comes in between pages)
ReportID_100 Report Name: "Antibiotics Group" Report Date: 12/06/2019 Report Page 2
11/01/2019 - 11/30/2019
Prescriptions Group by Antibiotics by Department
Prescription PSCP 2 Count PSCP 3 Count PSCP 4 and above Count
---------------------------------------------------------------------------------
Thanks in advance
Hi @sevenhills ,
I'm unable to go into too much detail around your specific scenario as it will require some quite creative transformations, I think. However, your starting point should be to use either the Folder or SharePointFolder data sources in Power Query. Once you've selected your folder with all these reports in it, you will get the option to Combine & Transform them:
This will allow you to select an example file (ideally one that includes all the issues you'd like to fix across the group), and perform transformations on it that will be automatically replicated across all the files in that folder, before they are all finally appended together into a single query.
While I can't specifically advise on the actual transformations you would need to do, my guess would be that you would use basic filters on certain words to remove header rows, use 'Fill Up' and 'Fill Down' functions to move values up and down to rows that aren't going to be filtered out, and functions like Text.Contains etc. to identify certain words/phrases to be retained.
Again, I think you're going to have to get pretty creative to knock these into a workable shape, but this is the process you need to complete what you require.
Pete
Proud to be a Datanaut!
Thanks for your reply.
My question was related to and adding more details:
a) Any tips on modeling of importing this kind of files?
More details:
b)I am able to get the data from each file and parse. But, how to identify the header rows in between pages and strip them of? (as each file has first few rows dynamic and also comes in between pages)
More details:
Hope this clarifies the needs 🙂
Thanks
For (b), using power query, I solved it as below
* Step 1: Analyze and Identified the sets as (hard coded for each report)
sets: page header, table header, footer row, page end row ...
This is more for categorizing
* Step 2:
Get data all the files, and line number in each file
File Name | Row Number |
File 1 | 1 |
File 1 | 2 |
File 1 | 3 |
File 1 | … |
File 2 | 1 |
File 2 | 2 |
File 2 | 3 |
File 2 | … |
File 3 | 1 |
File 3 | 2 |
File 3 | 3 |
File 3 | … |
For each file, go through each row and based on the data identified first row of each set
* Step 3: Remove the dynamic text involved in these rows like removing page number. say,
ReportID_100 Report Name: "Antibiotics Group" Report Date: 12/06/2019 Report Page 2
as
ReportID_100 Report Name: "Antibiotics Group" Report Date: 12/06/2019 Report Page
* Step 4: look the whole table with these category rows and mark the first row (and last row of each set) and create summary of rows for each file as that need to be deleted as
Delete Row Details as
File Name | Set Name | Row Begin | Row End |
File 1 | Page Header | 1 | 6 |
File 1 | Table Header | 7 | 9 |
File 1 | Page Header | 22 | 27 |
File 1 | Table Header | 28 | 31 |
... | |||
File 1 | Report End | 1011 | 1011 |
* Expand by adding custom column to the table:
{[Row Begin]..[Row End]}
* Later did left anti join to get only data rows.
🙂Hope it helps for others later
(moving this code logic to dataflow, which is outside this post, FYI)