How to Extract tabular data from PDF document using Camelot in Python

Introduction

PDF or Portable Document File format is one of the most common file formats in today’s time. It is widely used across every industry such as in government offices, healthcare, and even in personal work. As a result, there is a large unstructured data that exists in PDF format and extracting this data to generate meaningful insights is a common work among data scientists.

There are several Python libraries dedicated to working with PDF documents such as PYPDF2 etc. In this tutorial, I will be using .

camelot

Why Camelot?

  • You are in control: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates into .
  • Export to multiple formats, including JSON, Excel, HTML, and Sqlite.

Let’s Begin

Before installing Camelot libraries we have to install , once we install the ghost script lets install camelot-py.

Run below commands :

pip install "camelot-py[cv]"

Once you have installed camelot-py library we are all set to go. We are trying to extract a state-wise GST revenue table from this .

Camelot - Pdf table

Pdf table

import camelot

If you have camelot, Python will not print an error message, and if not, you will see an ImportError.

# Syntax of the camelot.read_pdf function 
camelot.read_pdf(
    filepath,
    pages='1',
    password=None,
    flavor='lattice',
    suppress_stdout=False,
    layout_kwargs={},
    **kwargs,
)

If you have to extract a table from different pages you have to give the page number.

tables2=camelot.read_pdf('gst-revenue-collection-march2020.pdf', flavor='stream', pages='0-3')
tables2

Camelot - Table2

This will give you a total Table list that is there in a pdf doc. we can select a table passing the index.

tables2[2]  # 2 is the index 

Camelot - Table Shape

tables2[2].parsing_report

Camelot - Accuracy

The above code will give you the details such as accuracy and page no. Note that there are 2 pages.

The following code will extract the table from the pdf document.

df2=tables2[2].df
df2  

Camelot - Dataframe

In this case, because the table is split into two different pages. So we can do a workaround.

tables2[3]
tables2[3].parsing_report

Image for post

Here you can notice, we extract the table from page no 3.

df3=tables2[3].df
df3

Image for post

The following is the code to append df2 and df3.

df4=df2.append(df3)
df4

Image for post

df5=df4[1:]
df5.head()
new_header = df5.iloc[0]df5 = df5[1:]df5.columns = new_header

Image for post

Here you go, we have extracted a table from pdf, now we can export this data in any format to the local system.

Conclusion

Extracting tabular data from pdf with help of camelot library is really easy. Moreover, we know there is a huge amount of unstructured data in pdf formats and after extracting the tables we can do lots of analysis and visualization based on your business need.

I hope this article will help you and save a good amount of time. Let me know if you have any suggestions.

HAPPY CODING.

About the Author

Author

Prabhat Kumar – Associate Analyst

I am an engineer currently working in Top MNCs as an Associate Analyst and Innovation Enthusiast, I love learning new things, I Believe Every data has a story and I love reading the stories.

Prabhat Pathak () is an Associate Analyst.

You can also read this article on our Mobile APP Get it on Google Play

Related Articles

Author: admin

Leave a Reply

Your email address will not be published.