Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
labmgf
New Member

Select range from many identical non tabular excel dimensional report

Good day,

the Q.dept. of the company I'm working for has produced many dimensional analisys reports of hoses in non tabular form like this one  below. 

excel report crop.png

I'm struggling trying to exctract hose code and batch n.

labmgf_4-1701855791361.png

and the dimensional measurements organized in sub batches (1A-1B etc..) in order to later analytics.

 

labmgf_3-1701855721708.png

labmgf_5-1701855821750.png

What woul be Your approach?

Many Thanks

Alessio

 

 

 

1 ACCEPTED SOLUTION
rubayatyasmin
Super User
Super User

Hi, @labmgf 

 

To extract specific data like hose code, batch number, and dimensional measurements from a non-tabular report like the one you've provided, I would suggest the following approach:

  1. Data Identification:

    • Manually review the report to identify the pattern or location where the hose code and batch numbers are consistently located. This might be in the header, footer, or a specific section of the report.
  2. Structured Extraction:

    • If the report is in a consistent format, you can use a script to extract the information. For example, if the report is a PDF, you could use a PDF parsing library in Python, such as PyPDF2 or PDFMiner, to extract text.
    • For an Excel file, you could use libraries like openpyxl or pandas to read the Excel file and extract the relevant cells based on their location.
  3. Regular Expressions:

    • Use regular expressions to match and extract hose codes and batch numbers if they follow a specific pattern.
  4. OCR and Text Recognition:

    • If the report is scanned or if you have to work with images, Optical Character Recognition (OCR) tools like Tesseract can convert images of text into actual text data that can be further processed. Tesseract can be used via the pytesseract Python library.
  5. Data Organization:

    • Once you've extracted the raw data, you can organize it into a structured format like a CSV file or an Excel workbook. You could use the pandas library to handle and organize the data efficiently.

Did I answer your question? Mark my post as a solution!super-user-logo

Proud to be a Super User!


View solution in original post

1 REPLY 1
rubayatyasmin
Super User
Super User

Hi, @labmgf 

 

To extract specific data like hose code, batch number, and dimensional measurements from a non-tabular report like the one you've provided, I would suggest the following approach:

  1. Data Identification:

    • Manually review the report to identify the pattern or location where the hose code and batch numbers are consistently located. This might be in the header, footer, or a specific section of the report.
  2. Structured Extraction:

    • If the report is in a consistent format, you can use a script to extract the information. For example, if the report is a PDF, you could use a PDF parsing library in Python, such as PyPDF2 or PDFMiner, to extract text.
    • For an Excel file, you could use libraries like openpyxl or pandas to read the Excel file and extract the relevant cells based on their location.
  3. Regular Expressions:

    • Use regular expressions to match and extract hose codes and batch numbers if they follow a specific pattern.
  4. OCR and Text Recognition:

    • If the report is scanned or if you have to work with images, Optical Character Recognition (OCR) tools like Tesseract can convert images of text into actual text data that can be further processed. Tesseract can be used via the pytesseract Python library.
  5. Data Organization:

    • Once you've extracted the raw data, you can organize it into a structured format like a CSV file or an Excel workbook. You could use the pandas library to handle and organize the data efficiently.

Did I answer your question? Mark my post as a solution!super-user-logo

Proud to be a Super User!


Helpful resources

Announcements
July 2025 community update carousel

Fabric Community Update - July 2025

Find out what's new and trending in the Fabric community.

July PBI25 Carousel

Power BI Monthly Update - July 2025

Check out the July 2025 Power BI update to learn about new features.