I'm trying get data from pdf files. Its first page data is slightly different from other pages. First I filtered 1st page and did transformation in sample transformation query. After that from there with add to new query, from filter page step I selected other pages and did the transformations..
Finally I Append both to initial sample query to form my data..
But when I run this query with a folder with multiple pdf...other than first page, data is getting repeated for other pdfs.. Ie. First pdf data getting repeated..
Can you please guide what was wrong.. Or how to solve this
Let me know if this helps.
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
@Sahir_Maharaj why did you post one message in 10 reply?
@technolog I apologize if my previous responses caused any confusion. However, I wanted to clarify that my action to post one message in multiple replies was intentional and made with the aim of making it easier for the recipient to read and respond.
By breaking it up into smaller pieces, I aimed to provide a better user experience and make it more manageable for the recipient to process the information.
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
It seems to me that information is much easier to absorb when it is presented in one message😊
I understand that this may not be the preferred method for everyone.
Thank you for your input as I am open to learning new ways of presenting my suggestions.
I will consider your feedback for future interactions 😊
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
Here are the general steps:
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
Here is an example M code to get you started:
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
let
// Step 1: Get list of PDF files
Source = Folder.Files("C:\Path\To\PDF\Folder"),
PDFFiles = Table.SelectRows(Source, each [Extension] = ".pdf"),
// Step 2: Combine PDF files into single binary column
CombinePDFs = Table.AddColumn(PDFFiles, "Contents", each Binary.Combine({[Content]})),
// Step 3: Extract tables from PDF files
ExtractTables = Table.AddColumn(CombinePDFs, "Tables", each Pdf.Tables([Contents], [Name])),
ExpandedTables = Table.ExpandTableColumn(ExtractTables, "Tables", {"Data", "Columns"}, {"Data", "Columns"}),
// Step 4: Clean and reshape data as needed
// ...
// Combine all tables into a single table
CombinedTables = Table.Combine(ExpandedTables[Data])
in
CombinedTables
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
Data is not structured as tables in pdf. In my case first page structure is little different from other pages. I did transformation to make both same but when appending to 'transform sample '.. Other pages is repeating first pdf data.
What might I have done wrong?
This code assumes that all of the PDF files in the folder have tables on their pages.
If some of the files do not have tables or have different structures, you may need to add additional logic to handle those cases.
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
4. Use any necessary transformations to clean and reshape the data.
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
3. Use the "Pdf.Tables" function to extract the tables from the combined binary column.
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
2. Use the "Binary.Combine" function to combine the contents of all the PDF files into a single binary column.
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
To solve this, you can try creating a fully dynamic query that can handle the varying structure of the PDF files. One way to do this is by using the "Combine binaries (binary.Combine)" function to combine the PDF files into a single binary column.
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
Hello @NJamigos,
It sounds like the issue may be that your queries are not fully dynamic and are therefore not handling the varying structure of the PDF files in the folder. When you append the queries from the different PDF files, the queries are likely still referring to the first file's structure, causing the data to be repeated.
➤ Website: https://sahirmaharaj.com
➤ Email: sahir@sahirmaharaj.com
➤ Lets connect on LinkedIn: Join my network of 10K+ professionals
➤ Want me to build your Power BI solution? Lets chat about how I can assist!
➤ Join my Medium community of 30k readers! Sharing my knowledge about data science and artificial intelligence
User | Count |
---|---|
116 | |
62 | |
56 | |
47 | |
38 |
User | Count |
---|---|
110 | |
65 | |
63 | |
52 | |
48 |