Dear Power BI Community,
First of all, a quick overview of my situation: I am a univesrity student writing his thesis and who is completly new to using Power Query. As a part of my thesis I have to analyze stock transactions of individuals who have been made public through PDF-files. To organise the data I "bulk" importes hundreds of PDFs containing the transactions via power query and transforming the entire folder into a docuent with tha data formated as tables. As a last step, I would then need to merge all the tables extracted from all the different PDFs (each PDF representing a different filling) so that I could sort the table content and get "one large table with all transactions from all individualds" containing the following properties (headers for columns): "asset", "transaction type", "date" and "transaction value".
My Problem
When executing the last step, which is expanding the tables (figure 1) by all columns they contain, I am encountering the following difficulty: Not every table has the same amount of rows (figure 2). Power query seems to have added columns in between (maybe due to some minor differnces in the length of certain words in the PDFs) which makes the resulting output of this step useless. The table which is generated as an output is distorted, meaning that some columns contain multiple properties such as "tranaction type", "date" and "transaction value" at the same time (figure 3). This makes it impossible for me to use the data generated without manually adjusting every couple of rows (there are over 11'000 of them, about 100'000+ considering my entire data set).
Figure 1
Figure 1
Figure 2
Figure 2
Figure 3
Figure 3
My Attempts to Solve it
I tried to add a custom column (actaully row) to the power query spreadsheet and using the command Table.SelectColumns(#"Filtered Rows" , each ([Data] = "Asset", [Data] = "Transaction Type", [Data] = "Date", [Data] = "Transaction Value")) to select columns refering to the column "Data" to extract only the columns from each table I need. When running this command, the error "we cannot convert the value of type function to type text", which makes sense to me bacause of the difference in data formats, but I don't know how to solve that and I think that there is an easier solution to this problem anyway.
Figure 4
Figure 4
My Question
How can I expand the table in power query (as marked in figure 4) by the columns I need from the tables ("Asset", "Transaction Type", "Date", "Transaction Value") and apply this step for all rows?
Unfortunately I can't attach my excel file. If you need any further information please feel free to ask me 🙂
Thank you very much in advance for your answers! Solving this problem would help me great deal and would get you a mentionning in my thesis (if you'd want) 😉
Kind Regards,
Colin
Dear @ImkeF
Thanks a lot for your answer! It worked.
I was able to follow your instructions and changed my methods: I searched for the longest document (in my case it was a 12 page PDF containing 10 tables) and put it as an example file "000_example" in every folder I imported with Power Query). Then, as you mentioned, I checked the tranform sample file for its format and an removed all unnecassary columns.
Kind Regards,
Colin
Hi @colcol ,
you need to solve this problem before you expand all the data.
In your Queries-Pane you should see a folder like this:
This has autmatically been created when you expanded the files from the folder.
The query "Transform Sample File" contains the steps that will be performed to each file before it will be expanded.
Open that query and check all the colums in there (you can use Ctrl+A for it). Then right-click the mouse and choose "Remove Other columns".
This won't change anything to that current sample query, but it will make sure that from all the other files, only those selected columns will be imported. That should solve the problem.
To improve your general understanding of the Import-from-folder method, I recommend this video:
Combining Multiple Files from a folder using Power Query in Excel or Power BI - YouTube
Imke Feldmann (The BIccountant)
If you liked my solution, please give it a thumbs up. And if I did answer your question, please mark this post as a solution. Thanks!
How to integrate M-code into your solution -- How to get your questions answered quickly -- How to provide sample data -- Check out more PBI- learning resources here -- Performance Tipps for M-queries