The digitization of banking transformed how we handle money - from waiting days for cheque deposits to instant transfers. But with this digital revolution came a new challenge: processing the enormous volume of financial data flowing through modern organizations. While digital banking solved the speed problem, it created a data standardization nightmare. Every bank has its own statement format, making it incredibly difficult for businesses to process statements at scale.
The Challenge: A Sea of Formats
Picture this: You're building a platform that needs to process bank statements from hundreds of different banks. Each bank has its own unique way of structuring data, including:
- Column naming conventions that can exist in multiple languages and vary widely between banks
- Date formats with no standardization (DD/MM/YYYY, MM/DD/YYYY, or YYYY/MM/DD)
- Transaction categorizations that might appear as separate credit and debit columns, a single column with positive/negative values, or a combination with type suffixes (CR/DR)
- Number representation formats that vary across banks - from comma-separated thousands to space-separated, different decimal markers (periods vs commas), and various currency symbol placements (€100.00 vs 100.00€)
- Additional metadata fields that vary in both content and location
The Before and After: Normalization in Action


Traditionally, handling these variations required:
- Creating manual templates for each bank format
- Constant maintenance as formats changed
- Developer intervention for special cases
- Custom implementation for fields like Account Number and Currency that exist as metadata
- Hours of human effort in mapping fields
- Rigid templates that break with even minor changes in column names or order

Enter AI: The Universal Translator
We tackled this challenge by building an AI-powered transformation system that can automatically analyze and standardize any bank statement format. Here's how it works:
The Architecture
Our system operates in three main phases:
- Analysis & Detection
- Intelligent region detection identifies the main data table
- Header analysis determines the true column structure
- Pattern recognition identifies key financial data points
- Template Generation (Transformation)
- AI agent maps detected columns to standard fields using Excel-like semantics (e.g., A7:A10)
- Regions are added to transformed files
- For unmapped columns, AI searches outside the main table
- Transformation rules are cached for reusability, reducing LLM calls
- Validation & Processing
- Validation agent checks transformed data for accuracy
- Preview generation with sample data for user validation
- Full dataset normalization after confirmation
The Technical Innovation: Parsing Bank Statements with AI
Bank statements come in countless formats - different layouts, languages, and structures. Here's how we built an AI pipeline to handle this complexity:
The Three Pillars
1. Smart Region Detection
Instead of strict rules or templates, we use an island detection algorithm to find the main transaction table. This first detects all the tables in the entire file and returns the largest table as a pandas dataframe:
def detect_largest_island(df: pd.DataFrame) -> Tuple[str, pd.DataFrame]:
"""Find the largest coherent data region while ignoring noise."""
# First, we look for empty rows to identify potential data boundaries
empty_rows = df.isna().all(axis=1)
# Then find the largest contiguous block of populated cells
largest_island = find_max_data_region(df, empty_rows)
return largest_island.region, largest_island.data
2. Intelligent Field Mapping
We use the table from the island detection algorithm to prompt an LLM, enabling us to infer the intent of each column and later rename them to match our internal format. This is the intelligent step empowers us to handle any bank statement globally without needing custom code:
def get_column_mappings(df: pd.DataFrame, column_info: Dict) -> Dict[str, Any]:
"""Map variant column names to standard fields."""
# Send column names and sample data to Claude
response = claude.analyze(
column_info=column_info,
sample_data=df.head(10)
)
# Get standardized mappings (e.g., "Credit" = "Inflow" = "دائن")
return response.mapped_columns
3. Dynamic Transformation Engine
Finally, we convert everything to standard formats:
def transform(df: pd.DataFrame, template_config: Dict) -> pd.DataFrame:
"""Transform data based on detected patterns."""
# Extract columns using the mapping
result = extract_columns(df, template_config)
# Add any required metadata
result = add_attributes(result, template_config)
return result
Smart Region Detection: Finding Data Islands in Bank Statements
When processing bank statements, one of the trickiest challenges is identifying where the actual transaction data lives. Bank statements are essentially a maze of headers, metadata, transaction rows, and empty spaces. Here's how we solve this using our "Island Detection" algorithm.

The Core Challenge: Locating Data Islands
Let's start with a real example. Consider this statement with both Arabic and English headers, scattered data, and regular blank rows:
bتاريخ التسجيل | مدين | دائن | الرصيد
Posted Date | Debit| Credit| Balance
01/10/2023 | 300 | | 1,34,13,442.89
[blank row]
01/10/2023 | 300 | | 1,34,13,442.89
To make sense of this, we need to:
- Ignore the metadata/headers until we find the real table
- Handle mixed languages
- Deal with regular blank rows that actually indicate transaction grouping.
def detect_largest_island(df: pd.DataFrame, start_row: int = 0) -> Tuple[str, pd.DataFrame, int]:
"""
Find the largest rectangular block of non-empty cells.
Key Steps:
1. Find all empty rows (our "water")
2. Between each pair of empty rows:
- Find non-empty columns
- Calculate area (rows × columns)
- Track the largest area found
3. Return the largest block's coordinates and data
"""
start_row = start_row - 1
df_slice = df.iloc[start_row:]
# Find rows that are completely empty
empty_rows = df_slice.isna().all(axis=1)
empty_row_indices = empty_rows[empty_rows].index.tolist()
# Handle case with no empty rows
if not empty_row_indices:
empty_row_indices = [len(df_slice)]
largest_island_size = 0
largest_island_region = ""
largest_island_df = None
# Scan between empty rows for islands
island_start = 0
for end_row in empty_row_indices:
if end_row > island_start:
# Get the current island slice
island_slice = df_slice.iloc[island_start:end_row]
# Find columns containing any data
non_empty_cols = island_slice.notna().any()
if non_empty_cols.any():
# Get the start and end columns of this island
non_empty_col_indices = non_empty_cols[non_empty_cols].index
start_col = non_empty_col_indices[0]
end_col = non_empty_col_indices[-1]
# Calculate area in cells
island_size = (end_row - island_start) * (end_col - start_col + 1)
# Update if this is the largest island found
if island_size > largest_island_size:
largest_island_size = island_size
largest_island_region = (
f"{col_num_to_letter(start_col + 1)}{start_row + island_start}:"
f"{col_num_to_letter(end_col + 1)}{start_row + end_row -1}"
)
largest_island_df = island_slice.iloc[:, start_col:end_col+1]
island_start = end_row + 1 - start_row
return largest_island_region, largest_island_df
# Convert a column number to Excel-style letters using base-26.
def col_num_to_letter(col_num: int) -> str:
"""
Example:
28 -> "AB" works like this:
28-1 = 27
27 % 26 = 1 (B)
27 // 26 = 1
1-1 = 0
0 % 26 = 0 (A)
Result: AB
"""
col_num -= 1
letters = []
while col_num >= 0:
letters.append(chr(65 + (col_num % 26))) # Convert to A-Z
col_num = (col_num // 26) - 1 # Move to next digit
return ''.join(reversed(letters)) # Reverse to get correct order
Island detection
How It Works in Practice
The algorithm moves through the document systematically:
- Column Reference System
- First, we convert numeric column indices to Excel-style references (1 → A, 27 → AA)
- This helps us track and return precise regions in a format financial professionals understand
- Empty Row Detection
- Scans the document vertically
- Identifies rows where all cells are empty
- These empty rows serve as boundaries between potential data islands
- Island Measurement
- For each region between empty rows:
- Finds the leftmost and rightmost columns containing data
- Calculates the area as: number of rows × number of columns
- Keeps track of the largest rectangular region found
- For each region between empty rows:
- Final Output
- Returns both Excel-style coordinates (e.g., "A2:F15") and the actual data
- The largest region typically contains our main transaction table
Edge Cases and Optimizations
Our algorithm handles several edge cases elegantly: when a document has no empty rows, it's treated as a single potential island; columns must contain at least one non-empty cell to be included (preventing empty blocks from skewing our calculations); and all islands must have positive size to be considered.
This combination of techniques gives us robust table detection that works across different bank statement formats, languages, and layouts. The largest contiguous block of data is almost always our main transaction table, making this a reliable first step in our parsing pipeline.
Intelligent Field Mapping with LLMs
After finding our data island, we use Claude to understand what each column means. Bank statements use different terms ("Credit" vs "Inflow"), split fields ("Amount" into credit and debit), and mix languages (Arabic/Polish/English). Here's how we standardize this:
def get_column_mappings(island_df: pd.DataFrame, column_info: Dict[str, Any]) -> Dict[str, Any]:
"""Map variant column names to standardized fields using LLM analysis."""
try:
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
# Convert data for LLM input
column_mapping = json.dumps(column_info, indent=2)
csv_buffer = io.StringIO()
island_df.head(10).to_csv(csv_buffer, index=False, header=None)
sample_data = csv_buffer.getvalue()
# Core prompt for the LLM
llm_message = f"""
You are a financial analyst, mapping bank statement data to a standardized format.
The accepted format is:
<accepted_format>
{ACCEPTED_FORMAT} # Contains fields like transaction_date, value_date, debit, credit etc.
</accepted_format>
Analyze these inputs:
1. Column Mapping: Column definitions from the bank statement
2. Sample Data: First few rows to understand the data format
Your task:
1. Map each input column to the standard format
2. Handle similar names (Credit/Inflow)
3. Deal with split fields (dates in multiple columns)
4. Manage multilingual content (Arabic/English)
5. Note any unmapped fields or errors
Return a JSON with:
- mapped_columns: Array of {{"name", "type", "region", "mapped_attribute"}}
- unmapped_attributes: Fields we couldn't map
- errors: Any issues found
Example mapping:
Input column: "تاريخ التسجيل" (Registration Date)
{
"name": "تاريخ التسجيل",
"type": "date",
"region": "A1:A10",
"mapped_attribute": "transaction_date"
}
"""
# Get LLM's interpretation
message = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=3000,
temperature=0,
system=llm_message,
messages=[{
"role": "user",
"content": [{
"type": "text",
"text": f"""
Input:
<column_mapping>{column_mapping}</column_mapping>
<sample_data>{sample_data}</sample_data>
"""
}]
}]
)
# Parse and validate response
response = json.loads(message.content[0].text)
return extract_and_validate_llm_response(response)
except Exception as e:
print(f"Error: {str(e)}")
return [], [], []
How It Works
- Data Preparation
- Convert column definitions to JSON
- Take sample rows for context
- Format everything for the LLM
- Intelligent Mapping
- Handles semantic variations: "Credit" = "Inflow" = "دائن"
- Manages mixed languages
- Validates data types match expected format
- Structured Output
{
"mapped_columns": [
{
"name": "Posted Date",
"type": "date",
"region": "A1:A10",
"mapped_attribute": "transaction_date"
}
],
"unmapped_attributes": ["value_date"],
"errors": ["Inconsistent date format in column A"]
}
The LLM approach gives us flexibility to handle new statement formats without hardcoding rules for each bank's specific format. For any unmapped required attributes, we make another LLM call to scan the non-island regions (headers, footers, metadata sections) of the document. When found, these attributes are added to our contract as additional attributes:
{
"type": "add_attributes",
"config": {
"additional_attributes": [
{
"name": "account_number",
"value": "234897489023",
"region": "B2:B2" # Found in header
},
{
"name": "currency_code",
"value": "USD",
"region": "D1:D1" # Found in metadata
}
]
}
}
This two-pass approach ensures we capture all required information, even when it's scattered across different parts of the document.
Dynamic Transformation Engine: Converting Bank Data to Standard Formats
After mapping columns, we need to transform the data into a consistent format. Bank statements show dates, numbers, and currencies in various ways - our engine handles this automatically.
The Transformation Pipeline
We use a configuration-driven approach to transform data:
class ColumnMapping(BaseModel):
"""Defines how to map a source column to our standard format"""
name: str # Original column name
type: str # Source data type
region: str # Excel-style location (e.g., "A1:A10")
mapped_attribute: str # Target field name
attribute_type: str # Target data type
def transform(df: pd.DataFrame, template_config: Dict[str, Any]) -> pd.DataFrame:
"""
Transforms bank statement data based on configuration rules.
Handles multiple actions like column extraction and attribute addition.
"""
try:
original_df = df.copy()
result_dfs = []
for action in template_config["actions"]:
action_type = action.get("type")
if action_type == "extract_columns":
result_dfs.append(extract_columns(original_df, action.config))
elif action_type == "add_attributes":
result_dfs.append(add_attributes(original_df, action.config))
return pd.concat(result_dfs, axis=1)
except Exception as e:
raise Exception(f"Transform error: {str(e)}")
Smart Data Type Handling
We handle various data formats through specialized transformers:
def transform_float(value):
"""
Converts various number formats to standard floats:
- Removes currency symbols and commas
- Handles negative values
- Converts strings to numbers
"""
if isinstance(value, str):
cleaned = re.sub(r'[^\d.]', '', value)
return abs(float(cleaned))
return abs(value) if isinstance(value, (int, float)) else None
def transform_date(value):
"""
Standardizes date formats to ISO:
- Handles various input formats
- Converts to UTC timezone
- Outputs in ISO format
"""
try:
dt = parser.parse(value, fuzzy=True)
dt = dt.replace(tzinfo=timezone.utc)
return dt.strftime('%Y-%m-%dT%H:%M:%SZ')
except ValueError:
return None
Example Transformation
Here's how it works with real data:
# Input Configuration
config = {
"type": "extract_columns",
"config": {
"column_mappings": [
{
"name": "Posting Date",
"type": "date",
"region": "A7:A10",
"mapped_attribute": "updated_date",
"attribute_type": "date"
},
{
"name": "Balance",
"type": "number",
"region": "I7:I10",
"mapped_attribute": "balance",
"attribute_type": "float"
}
]
}
}
# Input Data (Sample)
"01/10/2023" -> "2023-10-01T00:00:00Z"
"1,34,13,442.89" -> 13413442.89
The engine:
- Reads the column region ("A7:A10")
- Identifies the data type (date/float)
- Applies the appropriate transformer
- Returns standardized data
Added Attributes
We can also inject constants and computed fields:
# Adding account metadata
{
"type": "add_attributes",
"config": {
"additional_attributes": [
{
"name": "account_number",
"value": "234897489023"
},
{
"name": "currency_code",
"value": "USD"
}
]
}
}
This configuration-driven approach lets us handle new bank statement formats by just updating the configuration, without changing the code.
Results and Impact
The implementation of this AI-powered system has delivered remarkable improvements:
- Processing speed reduced from hours to seconds
- Human error eliminated in field mapping
- Automatic handling of new bank formats
- Self-adjusting system for minor format changes
Key Takeaways
The success of this project taught us several valuable lessons:
- Start Simple: Begin by manually performing transformations on a few examples
- Validate Early: Generate previews and validate before full processing
- Stay Flexible: Keep the system modular for future AI enhancements
The AI revolution isn't about replacing human effort—it's about amplifying our ability to solve problems at an unprecedented scale. By transforming a manual, error-prone process into an automated, adaptive system, we've made bank statement processing seamless and reliable, no matter the complexity.
And this is just the beginning. As we continue to refine and expand our agentic system, we're paving the way for a future where financial automation is not only intelligent but transformative. We're excited to share the journey ahead, full of breakthroughs and innovations that will redefine what's possible