Solved: Data Inconsistency excel files different structure...

Anonymous · ‎01-01-2020

Hi Guys,

I need to build a model using excel files which has different structure into a unique structure.

My client has a bunch of excel files coming from an APP (CostX)

The excel files have different structure for example:

Different descriptions for the same item
Bunch of spaces
has no headers
The address is in the first rown then the second row start the columns (name, description of the item, cost, etc)
subtotals in the middle then some spaces
has no pattern for these inconsistencies

The challenge here is the build a model in power bi to automate when the users save these files in the client's database.

Basically I need to build a model to retrieve all these excel files into a unique pattern to build the dashboards,

any idea or help would be very appreciate, thank you so much guys.

Fabricio,

Anonymous · ‎01-02-2020

HI @Anonymous ,

For large amount of randomly structure files, I'd like to suggest you do these transform/conversation by external tools instead of M query code. (notice: these M query codes of transform are complex, low performance and hard to maintain, you need to manually fix these steps codes every time major data structure changes)

In fact, current I haven't found any simple solutions to automatically handle fully randomly data structure formatting, maybe you can try to machine learning?

Guide to File Formats for Machine Learning: Columnar, Training, Inferencing, and the Feature Store

Regards,

Xiaoxin Sheng

View solution in original post

Anonymous · ‎01-01-2020

HI @Anonymous,

>>Different descriptions for the same item

For this scenario, I'd like to suggest you create a dictionary table or use 'fuzzy match' feature to mapping different descriptions of similar items.

Automatically create data dictionary for your Power Pivot model

Fuzzy Match with Power Query

>>Bunch of spaces

You can take a look at the following blog about use 'Text.Split' function to remove space characters into values:

Clean WhiteSpace in PowerQuery

>>has no headers

In my opinion, I'd like to suggest you use 'Promoted Headers' function to define the table header if your header is stored in first row of table records or use 'Table.RenameColumns' with 'List.zip' and 'Table.ColumnNames' functions to achieve auto change default table names with default value:

	Table.RenameColumns(PreviousStep,List.Zip({Table.ColumnNames(PreviousStep), {"Columname1","Columname2","Columname3","Columname4","Columname5"}}))

Notice: list.zip function expression should have similar count of items in left and right parameters.

>>The address is in the first rown then the second row start the columns (name, description of the item, cost, etc)

You can add an index column to your table, then you can add a custom column to extract the next/previous odd/even index records(current index +/- 1) then remove empty row to display address and detail descriptions at the same row.

>>subtotals in the middle then some spaces

I'd like to suggest you remove these blank columns and subtotal column. Power bi will aggregation your records based on its category and legends fields when you use them in visuals, so the subtotal fields are not necessarily.

Work with aggregates (sum, average, and so on) in the Power BI service

I hope the above description will help with your scenarios.

Regards,

Xiaoxin Sheng

Anonymous · ‎01-01-2020

Hi @Anonymous

thanks for your message

I'll try to use the fuzzy Match,

for you solution, I can do manually for one or two excel files but do you think would be good if I have hundreds of excel files? I don't know excel how many but I know this app where I'm getting the excel files provides all in different patterns, for example It doesn't provide exctely the same strucure of spaces or columns names and etc,

so I need somehow to automate this ETL,

Fabricio,

Anonymous · ‎01-02-2020

HI @Anonymous ,

For large amount of randomly structure files, I'd like to suggest you do these transform/conversation by external tools instead of M query code. (notice: these M query codes of transform are complex, low performance and hard to maintain, you need to manually fix these steps codes every time major data structure changes)

In fact, current I haven't found any simple solutions to automatically handle fully randomly data structure formatting, maybe you can try to machine learning?

Guide to File Formats for Machine Learning: Columnar, Training, Inferencing, and the Feature Store

Regards,

Xiaoxin Sheng

MFelix · ‎01-02-2020

Hi @Anonymous ,

You need to create a function with Power query that will allow you to automate the treatment of each of the files in a block check the post below:

https://www.poweredsolutions.co/2019/02/19/parameters-and-functions-in-power-bi-power-query-custom-functions/

The only question here is that you probably need to apply this to a folder or a sharepoint folder to get all the files as a parameter in your custom function.

Regards,

MFelix

Regards

Miguel Félix

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

Check out my blog: Power BI em Português

MFelix · ‎01-01-2020

Hi @Anonymous,

When you refer yhqt the fike as inconsistency is that every download is different or the files are always in the same format although is not a table format?

If the files are always downloaded in the same format you can make the query to automate the cleaning of the files to become a table format.

If you can share a sample I can make a small query to exemplify what can be done.

Regards,
MFelix

Regards

Miguel Félix

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

Check out my blog: Power BI em Português