Re: advanced unstructured data files processing

sevenhills · ‎08-10-2023

Hi

I am trying to process some advanced semi structured and unstructured data files. Like Reverse ETL.

There are about 100 of files per each quarter. Like snapshot report output files.
Each file will have multiple records
Each record will have info spanning 1 to multiple pages.
if the record is spanned into multiple pages, then certain sections repeats.
Each record has multiple sections
Few sections has fixed rows and Few sections has optional rows
Each row can have multiple fields (or) one field can span multiple rows.
Also, converting into file processing functions.

What I am looking is some examples of data processing using power query. please could anyone share the links of power query code with advanced unstructured data files processing?

Regards

sevenhills · ‎09-08-2023

I was away due to personal reasons.

coming back to the solution, I could not solve the four scenarios I mentioned in the reply "‎08-21-2023 07:15 PM"

Any other guys, who can help me, I appreciate their help.

Regards

sevenhills · ‎08-11-2023

I am attaching two sample mockup files.

Thank

ronrsnfld · ‎08-12-2023

Please add information (a screenshot) as to what you expect for output, given the input data you posted

sevenhills · ‎08-14-2023

Sorry, I thought I have included. Please check the updated zip file with the testing output planned file.

My plan is, Since it is historical files of 10 years, deal with the most obvious items first and outliers or special things as we encounter.

Make it as robust and flexible. Also, by doing as functions for each file.

Each file in real time is 4 MB to 10 MB. There are about few hundred files.

thanks

ronrsnfld · ‎08-14-2023

You can extract the data using text functions (or possibly regular expressions). How simple this might be will depend on the extent of variability in the data.

For example, if the type of report, report date, etc is always in the first line, that information can be extracted using text functions to split the line at various points.

The Patient Information lines can be extracted by searching for lines that begin with, for example "Patient ID:" and then using the text functions to split that up.

Same with Tobacco information and Treatment information.

So long as the various labels are the same, you should be able to search the data and sort the bits of information into columns.

But Power Query will produce an output consisting of a Table with rows and columns. It won't be able to create the kind of output you show in your zip file. You can probably do that using Power BI.

Here is an example of extracting some of the data.

This should get you started.

You can expand on these techniques for the rest:

Custom Function

Rename: fnFindInList

// rename fnFindInList

(li as list, findPhrase as text)=>

let 
    item = List.Select(li, each Text.Contains(_,findPhrase,Comparer.OrdinalIgnoreCase)),
    pos = List.PositionOf(li,item{0})
in  
    pos

Main Query

let
    Source = Csv.Document(File.Contents("C:\Users\ron\Desktop\SH Testing 2.txt"),1,"""""",ExtraValues.Ignore,1252),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Column1", type text}}),
    #"Trim" = Table.TransformColumns(#"Changed Type",{"Column1", each Text.Trim(_)}),
    #"Added Index" = Table.AddIndexColumn(#"Trim", "Index", 0, 1, Int64.Type),
    #"Added Custom" = Table.AddColumn(#"Added Index", "Patients", each 
        if Text.Contains([Column1],"EHR Tobacco Cessation Report") then [Index] else null, Int64.Type),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Index"}),
    #"Filled Down" = Table.FillDown(#"Removed Columns",{"Patients"}),
    #"Grouped Rows" = Table.Group(#"Filled Down", {"Patients"}, {
        
        {"Report Info", (t)=>
            let 
                #"Report ID" = Text.Split(t[Column1]{0}," "){0},
                #"Report Date" = Date.From(Text.Split(Text.Trim(Text.Split(t[Column1]{0},"Date:"){1})," "){0}),
                Report = #table({"Report ID","Report Date"},{{#"Report ID",#"Report Date"}})
            in 
                Report, type table[Report ID=text, Report Date=date]},

        {"Pt Info", (t)=>
            let 
                rw = fnFindInList(t[Column1],"Patient ID:"),
                splitPT = Text.Split(t{rw}[Column1]," "),
                #"Patient ID" = splitPT{2},
                #"Age" = Number.From(List.Last(splitPT)),
                splitNS = Text.Split(Text.Split(t{rw+1}[Column1],"Name: "){1},"Sex: "),
                #"Name" = splitNS{0},
                #"Sex" = splitNS{1},
                Report = #table({"Patient ID","Name","Age","Sex"}, {{#"Patient ID",#"Name",#"Age",#"Sex"}})
            in 
                Report, type table[Patient ID=text, Name=text, Age=Int64.Type, Sex=text]},

        {"Tobacco Info", (t)=>
            let
                #"Cigarette Info" = Text.Split(t[Column1]{fnFindInList(t[Column1],"Cigarettes")},":"),
                    #"Cig Amount" = if Text.StartsWith(Text.Trim(#"Cigarette Info"{1}),"Yes") 
                                    then Number.From(Text.Split(Text.Trim(#"Cigarette Info"{2})," "){0}) *
                                            Number.From(Text.Split(Text.Trim(#"Cigarette Info"{3})," "){0})
                                else null
            in 
                #table({"Cigarettes" & "#(lf)" & "per/day"},{{#"Cig Amount"}}), Int64.Type},
        
        
        {"Count", each _, type table [Column1=nullable text, Patients=number]}}),
    #"Removed Columns1" = Table.RemoveColumns(#"Grouped Rows",{"Patients"}),
    #"Expanded Report Info" = Table.ExpandTableColumn(#"Removed Columns1", "Report Info", {"Report ID", "Report Date"}, {"Report ID", "Report Date"}),
    #"Expanded Pt Info" = Table.ExpandTableColumn(#"Expanded Report Info", "Pt Info", {"Patient ID", "Name", "Age", "Sex"}, {"Patient ID", "Name", "Age", "Sex"}),
    #"Expanded Tobacco Info" = Table.ExpandTableColumn(#"Expanded Pt Info", "Tobacco Info", {"Cigarettes#(lf)per/day"}, {"Cigarettes#(lf)per/day"})
in
    #"Expanded Tobacco Info"

Output from your two files

sevenhills · ‎08-21-2023

Thanks for the reply. Sorry I am not doing well.

Wondering what is the best strategy to deal with

~ record spanning into multiple pages (Page 2, ...)

~ dealing with multiple lines of dosing in medical treatment

~ optional lines or lines that show on certain conditions

~ say, section info lines goes into multiple pages

~ In SH Testing 2.txt for Patient ID: 101 goes into multiple pages.

~ Payment details spanning in Page 2 and Page 3.

~ Note: When a record spans into multiple pages like Page 2, 3, Patient Info section and TYPE OF TOBACCO section repeats.

Thanks in advance.

ronrsnfld · ‎08-21-2023

It worked on the data sample you presented.

If the data sample is not representative of your actual data format and variations, I am not surprised that you are running into trouble.

What is the answer if you are unable to adapt what I provided? YOU have the task of creating a representative test document.

sevenhills · ‎08-21-2023

Yes, you are right, the real time scenarios are many more combinations and variations.

I am NOT doing those scenarios until I solved these test documents and planned model solution. For now, I am currently working on the test documents and planned test model. I can change the model as I progressed if needed. I am going one by one scenario and adopting the same in real time scenarios later.

With the logic you presented and my knowledge, able to get the details for ("Report file" details, "Patient Info" section details, "Type of Tobacco" section details) i.e.,
Table: File Info, Table: Record Info, Table: Patient Info, Table: Type of Tobacco ~ 80%, except few things.

These sections or table info are based on fixed rows. The data fields always present and the order of rows also same.

Other sections are tricky and hence posted the questions back.

Scenario: record spanning into multiple pages (Page 2, ...)

A record is one patient visit info.
A record can span into multiple pages.

This situation is tricky as we need info from all pages as tabular format.

How to deal and model and present well ? (model question)

Scenario: dealing with multiple lines of dosing in medical treatment

Dosing is multiple lines. The issue here is previous line without any identifier has the treatment type. Dosing is multiple lines and they are not fixed number of multiple lines. It can span into multiple pages. How to handle?

Scenario: optional lines or lines that show on certain conditions

Say, "Visit Notes" line. This line is optional and like this there are other optional lines.

how to handle all optional lines?

Scenario: A section spanning into multiple pages

the first scenario is at record level. where as here it is about the section spanning and the section header info is on one page and details can go into multiple pages. i.e., section info lines goes into multiple pages.

Some times one dosing rows can spill into multiple pages.

In SH Testing 2.txt for Patient ID: 101 goes into multiple pages. And section payment details spanning in Page 2 and Page 3.

Note: When a record spans into multiple pages like Page 2, 3, Report file info, Patient Info section and TYPE OF TOBACCO section repeats.

Thanks in advance.

sevenhills · ‎08-25-2023

Hi @ronrsnfld

Any further help? I am not able to get. Thanks in advance and appreciate your help in advance. Pls.

ronrsnfld · ‎08-25-2023

I did not see the link to the test document I asked you to prepare in my previous post, only a description. So no, I will not be providing further help.

sevenhills · ‎08-28-2023

Thanks @ronrsnfld .

I provided the test documents in the zip file.

SH Testing 1.txt

SH Testing 2.txt

All my usecases/scenarios that I provided in the description are based on the text files i.e., multiple pages record. Same scenarios I am trying to solve.

With the help fixed set/number of rows are solved. It is little complicated when a record spans more than 1 page. These fixed set/number of rows repeat when a record spans more than one page. For the same record and for the remaining other sections, section header line can be in previous page and lines can be in other page. Also, there are dynamic (count of) rows in case of dosing and there are optional rows.

No worries if you cannot further help. Appreciate for helping the fixed section of rows portion. Thanks

ronrsnfld · ‎08-28-2023

And the code I supplied showed you a methodology to deal with the data you supplied, for several of the categories.

You should be able to use that same methodology for the other information included in those text files.

sevenhills · ‎09-08-2023

Thanks. No worries if you cannot further help. Appreciate for helping the fixed section of rows portion.

advanced unstructured data files processing

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

Party with Power BI’s own Guy in a Cube

advanced unstructured data files processing

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025