cancel
Showing results for 
Search instead for 
Did you mean: 
Reply
NJamigos
Frequent Visitor

Transform multiple sample page / query

I'm trying get data from pdf files. Its first page data is slightly different from other pages. First I filtered 1st page and did transformation in sample transformation query. After that from there with add to new query, from filter page step I selected other pages and did the transformations..

Finally I Append both to initial sample query to form my data.. 

 

But when I run this query with a folder with multiple pdf...other than first page, data is getting repeated for other pdfs.. Ie. First pdf data getting repeated..

 

Can you please guide what was wrong.. Or how to solve this

 

@amitchandak @olgad 

16 REPLIES 16
Sahir_Maharaj
Community Champion
Community Champion

Let me know if this helps.


Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


@Sahir_Maharaj why did you post one message in 10 reply?



____________
Please join the Power BI User Group if you need help with dashboard design and usability
https://community.powerbi.com/t5/Power-BI-UX-UI-User-Group/gh-p/PowerBIUXUIUserGroup

Subscribe to my medium blog

@technolog I apologize  if my previous responses caused any confusion. However, I wanted to clarify that my action to post one message in multiple replies was intentional and made with the aim of making it easier for the recipient to read and respond.

 

By breaking it up into smaller pieces, I aimed to provide a better user experience and make it more manageable for the recipient to process the information.


Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


It seems to me that information is much easier to absorb when it is presented in one message😊



____________
Please join the Power BI User Group if you need help with dashboard design and usability
https://community.powerbi.com/t5/Power-BI-UX-UI-User-Group/gh-p/PowerBIUXUIUserGroup

Subscribe to my medium blog

I understand that this may not be the preferred method for everyone. 

 

Thank you for your input as I am open to learning new ways of presenting my suggestions.‌

 

I will consider your feedback for future interactions 😊


Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


Sahir_Maharaj
Community Champion
Community Champion

Here are the general steps:


Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


Here is an example M code to get you started:

 


Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


let
    // Step 1: Get list of PDF files
    Source = Folder.Files("C:\Path\To\PDF\Folder"),
    PDFFiles = Table.SelectRows(Source, each [Extension] = ".pdf"),

    // Step 2: Combine PDF files into single binary column
    CombinePDFs = Table.AddColumn(PDFFiles, "Contents", each Binary.Combine({[Content]})),

    // Step 3: Extract tables from PDF files
    ExtractTables = Table.AddColumn(CombinePDFs, "Tables", each Pdf.Tables([Contents], [Name])),
    ExpandedTables = Table.ExpandTableColumn(ExtractTables, "Tables", {"Data", "Columns"}, {"Data", "Columns"}),

    // Step 4: Clean and reshape data as needed
    // ...

    // Combine all tables into a single table
    CombinedTables = Table.Combine(ExpandedTables[Data])
in
    CombinedTables

Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


Data is not structured as tables in pdf. In my case first page structure is little different from other pages. I did transformation to make both same but when appending to 'transform sample '.. Other pages is repeating first pdf data.

 

What might I have done wrong? 

This code assumes that all of the PDF files in the folder have tables on their pages.

 

If some of the files do not have tables or have different structures, you may need to add additional logic to handle those cases.


Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


4. Use any necessary transformations to clean and reshape the data.


Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


3. Use the "Pdf.Tables" function to extract the tables from the combined binary column.


Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


2. Use the "Binary.Combine" function to combine the contents of all the PDF files into a single binary column.


Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


  1. Use the "Folder.Files" function to get a list of all the PDF files in the folder.

Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


Sahir_Maharaj
Community Champion
Community Champion

To solve this, you can try creating a fully dynamic query that can handle the varying structure of the PDF files. One way to do this is by using the "Combine binaries (binary.Combine)" function to combine the PDF files into a single binary column.

 


Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


Sahir_Maharaj
Community Champion
Community Champion

Hello @NJamigos,

 

It sounds like the issue may be that your queries are not fully dynamic and are therefore not handling the varying structure of the PDF files in the folder. When you append the queries from the different PDF files, the queries are likely still referring to the first file's structure, causing the data to be repeated.


Did I answer your question? Mark my post as a solution, this will help others!

If my response(s) assisted you in any way, don't forget to drop me a "Kudos" 🙂

Kind Regards,
Sahir Maharaj

Data Scientist | Data Engineer | Data Analyst | AI Engineer

➤ Website: https://sahirmaharaj.com

➤ Email: sahir@sahirmaharaj.com

➤ Lets connect on LinkedIn: Join my network of 10K+ professionals

➤ Want me to build your Power BI solution? Lets chat about how I can assist!

➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence


Helpful resources

Announcements
May 2023 update

Power BI May 2023 Update

Find out more about the May 2023 update.

Submit your Data Story

Data Stories Gallery

Share your Data Story with the Community in the Data Stories Gallery.

Top Solution Authors