deepdoctection - An open source package for document intelligence

Janis Meyer

Tuesday 09:00 in A1 tuesday tuesday-09-00

Type/Track Talk pydata-natural-language-processing

Document Intelligence refers to the task of understanding and extracting information from visual rich business documents, let it be noisy scans, images, PDFs etc.

While there has been a lot of improvement using deep learning with CNN architectures or Transformer based multimodal approaches, open source projects that offer a framework for using these powerful tools as components of a pipeline are very sparse to non-existent. Moreover, when starting a project with digitalized documents, every step involved, like loading a multi page document, processing an OCR task, cropping ROIs must be written from scratch.

This talk aims to bring deepdoctection on the radar, a new package that offers building document analysis pipelines.

Deepdoctection is being developed from the original problem of extracting and normalizing table contents from investments documents. It therefore offers pre-trained models for document layout analysis, table recognition as well as wrappers for OCR tools. The goal is to further include solutions for key figure extractions and entity recognition on visual rich structures and to provide a framework for solving information extraction tasks on visual rich documents.

Tags Computer Vision Natural Language Processing

Level Domain Expertise some Python Skill Level some

Janis Meyer

Affiliation: self employed

Janis has a PhD in mathematics from TU Berlin and has been consulting in financial services for more than 13 years now. There, he focused on Data Warehousing, Risk Control, Regulatory and Financial Reporting as well as Internal Audit.

Recently he shifted towards machine learning and especially deep learning solutions driven by the prospect of solving back office problems that could not be tackled with ordinary methods before. This includes information extraction from complex documents let it be business reports, forms, filings or other types of complex structured documents.

visit the speaker at: Github