AI
Financial data extraction

Transforming Financial Data Extraction With AI for a Global Professional Services Firm

Leveraging Generative AI, we automated the extraction of data from 3,000 to 7,000 financial documents yearly for a top professional services firm, saving 3,100 hours annually.

Person examining a bar chart on a computer screen with a magnifying glass

The project in a nutshell

Automated data extraction
AI solution
Financial document processing
Enhanced efficiency
Cognitive transformation
A person analyzing financial documents

The Challenge

Our client is one of the largest global professional services networks in the world. They operate in numerous industries and have a significant global presence.

They had specific challenges related to the extraction and analysis of information from a vast collection of 10-K and 8-K reports. The main need was to extract specific fields from documents in 10-K Exhibit 10, which are mostly legal agreements and contracts with different fields of interest. They used to perform this task manually, consuming substantial time and resources.

The first challenge was the lack of a dataset for document classification. Additionally, the initial interaction of the system lacked a test dataset, which was necessary for achieving high precision in the extracted fields. Finally, there was a need to design an efficient validation mechanism for the extracted data.

How this partnership impacted our client’s business

8 months
to develop and launch an MVP
3,000-7,000
documents processed per year
3,100 hours
saved annually

Solution

Our client had a necessity to train and deploy AI models able to extract the right information from all the different types of documents. Together with our client’s team, CloudX leveraged Generative AI to build an advanced data pipeline designed for extracting and analyzing information from 10-K and 8-K documents.

Human head with gears symbolizing AI, and two people working on it

Phase 1: Classification

We utilized Azure OpenAI to generate a classification dataset, in which a series of documents were labeled according to their specific types. This dataset was then used to train the classification model, using embeddings and Random Forest classification. This algorithm categorizes new, unseen documents with a high confidence level.

Robot interacting with floating data panels

Phase 2: Extraction

We developed a solution using Retrieval-Augmented Generation (RAG) and prompts specifically designed to extract the correct fields from the already classified documents. In an iterative process, both the prompts and the RAG model were continuously tested and improved, with the goal of increasing precision in the model’s answers.

Three people interacting with digital devices

Phase 3: UAT

In this stage we performed iterative User Acceptance Testing (UAT), testing our solution with its final users: the business team. Their deep business knowledge helped us determine whether the precision of the extraction was high enough or if it could be improved. We built a user interface (UI) embedded in Microsoft Teams that allows users to operate the extracted fields, manually editing them or even modifying the prompts to refine the extraction process. Once the required data is ready, it can be exported to a CSV file and made available for analyzing and sharing as needed.

Still curious about this project?

Dive deeper with our full Case Study

Results

The implementation of our AI-driven solution marked a significant milestone for our client, addressing a challenge that had persisted for nearly a decade. Remarkably, within just 8 months, we developed a Minimum Viable Product (MVP) that effectively resolved this long-standing issue.

Our solution efficiently processes between 3,000 and 7,000 documents annually, automating the extraction of critical information from 10-K and 8-K documents. Previously, this task was performed manually, consuming substantial time and resources. The automation not only streamlined the process but also enhanced accuracy and reliability.


Building on this success, we are now extending the solution's capabilities to include an agreement analyzer. This new application is projected to save approximately 3,100 hours annually, demonstrating the scalability and versatility of our AI solution.

The project’s tech stack

OpenAI
Python
Azure AI Search
Scikit-learn
Azure Blob Storage
Angular