Automating Bank Statement Processing with AI

The digitization of banking transformed how we handle money - from waiting days for cheque deposits to instant transfers. But with this digital revolution came a new challenge: processing the enormous volume of financial data flowing through modern organizations. While digital banking solved the speed problem, it created a data standardization nightmare. Every bank has its own statement format, making it incredibly difficult for businesses to process statements at scale.

The Challenge: A Sea of Formats

Picture this: You're building a platform that needs to process bank statements from hundreds of different banks. Each bank has its own unique way of structuring data, including:

Column naming conventions that can exist in multiple languages and vary widely between banks
Date formats with no standardization (DD/MM/YYYY, MM/DD/YYYY, or YYYY/MM/DD)
Transaction categorizations that might appear as separate credit and debit columns, a single column with positive/negative values, or a combination with type suffixes (CR/DR)
Number representation formats that vary across banks - from comma-separated thousands to space-separated, different decimal markers (periods vs commas), and various currency symbol placements (€100.00 vs 100.00€)
Additional metadata fields that vary in both content and location

The Before and After: Normalization in Action

The transformed version we need to create

Traditionally, handling these variations required:

Creating manual templates for each bank format
Constant maintenance as formats changed
Developer intervention for special cases
Custom implementation for fields like Account Number and Currency that exist as metadata
Hours of human effort in mapping fields
Rigid templates that break with even minor changes in column names or order

Manual mapping of bank transactions to a template

Enter AI: The Universal Translator

We tackled this challenge by building an AI-powered transformation system that can automatically analyze and standardize any bank statement format. Here's how it works:

The Architecture

Our system operates in three main phases:

Analysis & Detection
- Intelligent region detection identifies the main data table
- Header analysis determines the true column structure
- Pattern recognition identifies key financial data points
Template Generation (Transformation)
- AI agent maps detected columns to standard fields using Excel-like semantics (e.g., A7:A10)
- Regions are added to transformed files
- For unmapped columns, AI searches outside the main table
- Transformation rules are cached for reusability, reducing LLM calls
Validation & Processing
- Validation agent checks transformed data for accuracy
- Preview generation with sample data for user validation
- Full dataset normalization after confirmation

The Technical Innovation: Parsing Bank Statements with AI

Bank statements come in countless formats - different layouts, languages, and structures. Here's how we built an AI pipeline to handle this complexity:

The Three Pillars

1. Smart Region Detection
Instead of strict rules or templates, we use an island detection algorithm to find the main transaction table. This first detects all the tables in the entire file and returns the largest table as a pandas dataframe:


def detect_largest_island(df: pd.DataFrame) -> Tuple[str, pd.DataFrame]:
    """Find the largest coherent data region while ignoring noise."""
    # First, we look for empty rows to identify potential data boundaries
    empty_rows = df.isna().all(axis=1)
    
    # Then find the largest contiguous block of populated cells
    largest_island = find_max_data_region(df, empty_rows)
    
    return largest_island.region, largest_island.data

2. Intelligent Field Mapping
We use the table from the island detection algorithm to prompt an LLM, enabling us to infer the intent of each column and later rename them to match our internal format. This is the intelligent step empowers us to handle any bank statement globally without needing custom code:


def get_column_mappings(df: pd.DataFrame, column_info: Dict) -> Dict[str, Any]:
    """Map variant column names to standard fields."""
    # Send column names and sample data to Claude
    response = claude.analyze(
        column_info=column_info,
        sample_data=df.head(10)
    )
    
    # Get standardized mappings (e.g., "Credit" = "Inflow" = "دائن")
    return response.mapped_columns

3. Dynamic Transformation Engine
Finally, we convert everything to standard formats:

def transform(df: pd.DataFrame, template_config: Dict) -> pd.DataFrame:
    """Transform data based on detected patterns."""
    # Extract columns using the mapping
    result = extract_columns(df, template_config)
    
    # Add any required metadata
    result = add_attributes(result, template_config)
    
    return result

Smart Region Detection: Finding Data Islands in Bank Statements

When processing bank statements, one of the trickiest challenges is identifying where the actual transaction data lives. Bank statements are essentially a maze of headers, metadata, transaction rows, and empty spaces. Here's how we solve this using our "Island Detection" algorithm.

The Core Challenge: Locating Data Islands
Let's start with a real example. Consider this statement with both Arabic and English headers, scattered data, and regular blank rows:

To make sense of this, we need to:

Ignore the metadata/headers until we find the real table
Handle mixed languages
Deal with regular blank rows that actually indicate transaction grouping.

def detect_largest_island(df: pd.DataFrame, start_row: int = 0) -> Tuple[str, pd.DataFrame, int]:
    """
    Find the largest rectangular block of non-empty cells.
    
    Key Steps:
    1. Find all empty rows (our "water")
    2. Between each pair of empty rows:
        - Find non-empty columns
        - Calculate area (rows × columns)
        - Track the largest area found
    3. Return the largest block's coordinates and data
    """
    start_row = start_row - 1
    df_slice = df.iloc[start_row:]
    
    # Find rows that are completely empty
    empty_rows = df_slice.isna().all(axis=1)
    empty_row_indices = empty_rows[empty_rows].index.tolist()
    
    # Handle case with no empty rows
    if not empty_row_indices:
        empty_row_indices = [len(df_slice)]
    
    largest_island_size = 0
    largest_island_region = ""
    largest_island_df = None
    
    # Scan between empty rows for islands
    island_start = 0
    for end_row in empty_row_indices:
        if end_row > island_start:
            # Get the current island slice
            island_slice = df_slice.iloc[island_start:end_row]
            
            # Find columns containing any data
            non_empty_cols = island_slice.notna().any()
            
            if non_empty_cols.any():
                # Get the start and end columns of this island
                non_empty_col_indices = non_empty_cols[non_empty_cols].index
                start_col = non_empty_col_indices[0]
                end_col = non_empty_col_indices[-1]
                
                # Calculate area in cells
                island_size = (end_row - island_start) * (end_col - start_col + 1)
                
                # Update if this is the largest island found
                if island_size > largest_island_size:
                    largest_island_size = island_size
                    largest_island_region = (
                        f"{col_num_to_letter(start_col + 1)}{start_row + island_start}:"
                        f"{col_num_to_letter(end_col + 1)}{start_row + end_row -1}"
                    )
                    largest_island_df = island_slice.iloc[:, start_col:end_col+1]
                    
        island_start = end_row + 1 - start_row
    
    return largest_island_region, largest_island_df

# Convert a column number to Excel-style letters using base-26.
def col_num_to_letter(col_num: int) -> str:
    """
    Example:
    28 -> "AB" works like this:
    28-1 = 27
    27 % 26 = 1 (B)
    27 // 26 = 1
    1-1 = 0
    0 % 26 = 0 (A)
    Result: AB
    """
    col_num -= 1 
    letters = []
    while col_num >= 0:
        letters.append(chr(65 + (col_num % 26)))  # Convert to A-Z
        col_num = (col_num // 26) - 1             # Move to next digit
    return ''.join(reversed(letters))             # Reverse to get correct order

Island detection

How It Works in Practice
The algorithm moves through the document systematically:

Column Reference System
- First, we convert numeric column indices to Excel-style references (1 → A, 27 → AA)
- This helps us track and return precise regions in a format financial professionals understand
Empty Row Detection
- Scans the document vertically
- Identifies rows where all cells are empty
- These empty rows serve as boundaries between potential data islands
Island Measurement
- For each region between empty rows:
  - Finds the leftmost and rightmost columns containing data
  - Calculates the area as: number of rows × number of columns
  - Keeps track of the largest rectangular region found
Final Output
- Returns both Excel-style coordinates (e.g., "A2:F15") and the actual data
- The largest region typically contains our main transaction table

Edge Cases and Optimizations
Our algorithm handles several edge cases elegantly: when a document has no empty rows, it's treated as a single potential island; columns must contain at least one non-empty cell to be included (preventing empty blocks from skewing our calculations); and all islands must have positive size to be considered.

This combination of techniques gives us robust table detection that works across different bank statement formats, languages, and layouts. The largest contiguous block of data is almost always our main transaction table, making this a reliable first step in our parsing pipeline.

Intelligent Field Mapping with LLMs

After finding our data island, we use Claude to understand what each column means. Bank statements use different terms ("Credit" vs "Inflow"), split fields ("Amount" into credit and debit), and mix languages (Arabic/Polish/English). Here's how we standardize this:


def get_column_mappings(island_df: pd.DataFrame, column_info: Dict[str, Any]) -> Dict[str, Any]:
    """Map variant column names to standardized fields using LLM analysis."""
    try:
        client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
        
        # Convert data for LLM input
        column_mapping = json.dumps(column_info, indent=2)
        csv_buffer = io.StringIO()
        island_df.head(10).to_csv(csv_buffer, index=False, header=None)
        sample_data = csv_buffer.getvalue()

        # Core prompt for the LLM
        
        llm_message = f"""
        You are a financial analyst, mapping bank statement data to a standardized format.

        The accepted format is:
        <accepted_format>
        {ACCEPTED_FORMAT}  # Contains fields like transaction_date, value_date, debit, credit etc.
        </accepted_format>

        Analyze these inputs:
        1. Column Mapping: Column definitions from the bank statement
        2. Sample Data: First few rows to understand the data format

        Your task:
        1. Map each input column to the standard format
        2. Handle similar names (Credit/Inflow)
        3. Deal with split fields (dates in multiple columns)
        4. Manage multilingual content (Arabic/English)
        5. Note any unmapped fields or errors

        Return a JSON with:
        - mapped_columns: Array of {{"name", "type", "region", "mapped_attribute"}}
        - unmapped_attributes: Fields we couldn't map
        - errors: Any issues found

        Example mapping:
        Input column: "تاريخ التسجيل" (Registration Date)
        {
            "name": "تاريخ التسجيل",
            "type": "date", 
            "region": "A1:A10",
            "mapped_attribute": "transaction_date"
        }
        """

        # Get LLM's interpretation
        message = client.messages.create(
            model="claude-3-5-sonnet-20240620",
            max_tokens=3000,
            temperature=0,
            system=llm_message,
            messages=[{
                "role": "user",
                "content": [{
                    "type": "text",
                    "text": f"""
                        Input:
                        <column_mapping>{column_mapping}</column_mapping>
                        <sample_data>{sample_data}</sample_data>
                    """
                }]
            }]
        )

        # Parse and validate response
        response = json.loads(message.content[0].text)
        return extract_and_validate_llm_response(response)

    except Exception as e:
        print(f"Error: {str(e)}")
        return [], [], []

How It Works

Data Preparation
- Convert column definitions to JSON
- Take sample rows for context
- Format everything for the LLM
Intelligent Mapping
- Handles semantic variations: "Credit" = "Inflow" = "دائن"
- Manages mixed languages
- Validates data types match expected format
Structured Output

{
    "mapped_columns": [
        {
            "name": "Posted Date",
            "type": "date",
            "region": "A1:A10",
            "mapped_attribute": "transaction_date"
        }
    ],
    "unmapped_attributes": ["value_date"],
    "errors": ["Inconsistent date format in column A"]
}

The LLM approach gives us flexibility to handle new statement formats without hardcoding rules for each bank's specific format. For any unmapped required attributes, we make another LLM call to scan the non-island regions (headers, footers, metadata sections) of the document. When found, these attributes are added to our contract as additional attributes:

{
    "type": "add_attributes",
    "config": {
        "additional_attributes": [
            {
                "name": "account_number",
                "value": "234897489023",
                "region": "B2:B2"  # Found in header
            },
            {
                "name": "currency_code",
                "value": "USD",
                "region": "D1:D1"  # Found in metadata
            }
        ]
    }
}

This two-pass approach ensures we capture all required information, even when it's scattered across different parts of the document.

Dynamic Transformation Engine: Converting Bank Data to Standard Formats

After mapping columns, we need to transform the data into a consistent format. Bank statements show dates, numbers, and currencies in various ways - our engine handles this automatically.

The Transformation Pipeline
We use a configuration-driven approach to transform data:

class ColumnMapping(BaseModel):
    """Defines how to map a source column to our standard format"""
    name: str              # Original column name
    type: str             # Source data type
    region: str           # Excel-style location (e.g., "A1:A10")
    mapped_attribute: str  # Target field name
    attribute_type: str   # Target data type

def transform(df: pd.DataFrame, template_config: Dict[str, Any]) -> pd.DataFrame:
    """
    Transforms bank statement data based on configuration rules.
    Handles multiple actions like column extraction and attribute addition.
    """
    try:
        original_df = df.copy()
        result_dfs = []

        for action in template_config["actions"]:
            action_type = action.get("type")
            
            if action_type == "extract_columns":
                result_dfs.append(extract_columns(original_df, action.config))
            elif action_type == "add_attributes":
                result_dfs.append(add_attributes(original_df, action.config))

        return pd.concat(result_dfs, axis=1)
    except Exception as e:
        raise Exception(f"Transform error: {str(e)}")

Smart Data Type Handling
We handle various data formats through specialized transformers:

def transform_float(value):
    """
    Converts various number formats to standard floats:
    - Removes currency symbols and commas
    - Handles negative values
    - Converts strings to numbers
    """
    if isinstance(value, str):
        cleaned = re.sub(r'[^\d.]', '', value)
        return abs(float(cleaned))
    return abs(value) if isinstance(value, (int, float)) else None

def transform_date(value):
    """
    Standardizes date formats to ISO:
    - Handles various input formats
    - Converts to UTC timezone
    - Outputs in ISO format
    """
    try:
        dt = parser.parse(value, fuzzy=True)
        dt = dt.replace(tzinfo=timezone.utc)
        return dt.strftime('%Y-%m-%dT%H:%M:%SZ')
    except ValueError:
        return None

Example Transformation
Here's how it works with real data:

# Input Configuration
config = {
    "type": "extract_columns",
    "config": {
        "column_mappings": [
            {
                "name": "Posting Date",
                "type": "date",
                "region": "A7:A10",
                "mapped_attribute": "updated_date",
                "attribute_type": "date"
            },
            {
                "name": "Balance",
                "type": "number",
                "region": "I7:I10",
                "mapped_attribute": "balance",
                "attribute_type": "float"
            }
        ]
    }
}

# Input Data (Sample)
"01/10/2023"   -> "2023-10-01T00:00:00Z"
"1,34,13,442.89" -> 13413442.89

The engine:

Reads the column region ("A7:A10")
Identifies the data type (date/float)
Applies the appropriate transformer
Returns standardized data

Added Attributes
We can also inject constants and computed fields:

# Adding account metadata
{
    "type": "add_attributes",
    "config": {
        "additional_attributes": [
            {
                "name": "account_number",
                "value": "234897489023"
            },
            {
                "name": "currency_code",
                "value": "USD"
            }
        ]
    }
}

This configuration-driven approach lets us handle new bank statement formats by just updating the configuration, without changing the code.

Results and Impact

The implementation of this AI-powered system has delivered remarkable improvements:

Processing speed reduced from hours to seconds
Human error eliminated in field mapping
Automatic handling of new bank formats
Self-adjusting system for minor format changes

Key Takeaways

The success of this project taught us several valuable lessons:

Start Simple: Begin by manually performing transformations on a few examples
Validate Early: Generate previews and validate before full processing
Stay Flexible: Keep the system modular for future AI enhancements

The AI revolution isn't about replacing human effort—it's about amplifying our ability to solve problems at an unprecedented scale. By transforming a manual, error-prone process into an automated, adaptive system, we've made bank statement processing seamless and reliable, no matter the complexity.

And this is just the beginning. As we continue to refine and expand our agentic system, we're paving the way for a future where financial automation is not only intelligent but transformative. We're excited to share the journey ahead, full of breakthroughs and innovations that will redefine what's possible