ReconPy is a powerful yet user-friendly data reconciliation engine designed to compare records across two dataframes. Its key features include:
- Flexibility: Adaptable to various data structures and comparison needs.
- Score-based Matching: Utilizes a scoring system to determine the optimal matches between records.
- One-to-One Matching: Each record is matched only once, ensuring no duplication in the reconciliation process.
- Deterministic Resolution: In cases of multiple optimal matches, it consistently selects the first occurrence.
While ReconPy is versatile enough for use across various industries, it's particularly well-suited for financial sector applications, including: asset management firms, banks, brokers-dealers and hedge funds.
- Front Office/Back Office (FOBO) Reconciliation
- Trade Reconciliation
- Position Reconciliation
- Cash Activity Reconciliation
- Cash Balance Reconciliation
ReconPy streamlines these critical financial processes, enhancing accuracy and efficiency in data comparison and validation tasks.
output = reconPy.recon(left_df, right_df, left_exact_match_cols, right_exact_match_cols, left_compare_cols, right_compare_cols, show_left_cols=[], show_right_cols=[], weight=None, tolerance_percentage=None, suffix=['_left', '_right'], show_matching_status=True, show_matching_score=False)
left_df: DataFrame
The primary DataFrame for comparison.
right_df: DataFrame
The secondary DataFrame for comparison.
left_exact_match_cols: list of string
Columns from left_df used for exact matching. These columns join left_df and right_df, requiring identical values.
right_exact_match_cols: list of string
Columns from right_df used for exact matching. These columns join left_df and right_df, requiring identical values.
Note
The order of left_exact_match_cols and right_exact_match_cols is crucial. They link corresponding columns when header names differ between left_df and right_df:
Correct example:
left_exact_match_cols = ['Trade date', 'Settlement date', 'Stock code']
right_exact_match_cols = ['TD', 'SD', 'Bloomberg code']
Incorrect example:
'Trade date' incorrectly linked to 'Bloomberg code'
left_exact_match_cols = ['Trade date', 'Settlement date', 'Stock code']
right_exact_match_cols = ['Bloomberg code', 'SD', 'TD']
left_compare_cols: list of string
Columns from left_df used to calculate matching scores and determine if record pairs are within matching tolerance. Supports float, integer, date, datetime, and string data types. String comparisons will have blank diff columns.
right_compare_cols: list of string
Columns from right_df used to calculate matching scores and determine if record pairs are within matching tolerance. Supports float, integer, date, datetime, and string data types. String comparisons will have blank diff columns.
Note
The order of left_compare_cols and right_compare_cols is crucial. They link corresponding columns when header names differ between left_df and right_df:
Example:
left_compare_cols = ['Quantity', 'Gross amount', 'Settlement amount']
right_compare_cols = ['Qty', 'Gross', 'Net']
show_left_cols: list of string, default = []
Left_df columns to display in the output. Leave empty to omit all left_df headers.
show_right_cols: list of string, default = []
Right_df columns to display in the output. Leave empty to omit all right_df headers.
weight: list of string, default = [1/len(left_compare_cols)] * len(left_compare_cols)
Weights applied when calculating matching scores. Must be positive numbers summing to 1. Length should match left_compare_cols. Default applies equal weights.
Example: For left_compare_cols = ['Quantity', 'Gross amount', 'Settlement amount'], weight = [0.7, 0.2, 0.1] means Quantity's score is multiplied by 0.7, Gross amount by 0.2, etc.
tolerance_percentage: list of string, default = [0] * len(left_compare_cols)
Defines acceptable percentage difference for each compared column. Values must be between 0-100. Length should match left_compare_cols. Default is zero tolerance.
For date/datetime objects, 1% tolerance equals 3.6525 days. See Matching score calculation > Date/ datetime comparison for details.
Example: tolerance_percentage = [7, 8, 9.15] means pairs won't match if the percentage difference exceeds 7%, 8%, and 9.15% for the first, second, and third compared columns respectively.
Note
The lengths of left_compare_cols, right_compare_cols, weight, and tolerance_percentage must be identical.
suffix: list of string, default = ['_left', '_right']
Suffixes appended to left_compare_cols and right_compare_cols. Must be a list of two strings.
show_matching_status: bool, default = True
If True, displays matching status in matching_status in the output dataframe. Possible statuses:
*Matched: Perfect match (zero tolerance)
- Matched with tolerance: Matched within specified tolerance_percentage
- Unmatched: No possible match found
show_matching_score: bool, default = False
If True, displays matching score of the matched record in matching_score in the output dataframe.
>>> py -m pip install reconPy>>> import reconPyThis example demonstrates the reconciliation of broker's trade confirmation against OMS trade reports, a routine task in financial institutions. The goal is to ensure that trades booked in the OMS align with broker trade confirmations by comparing multiple fields such as trade date, settlement date, security details, price, quantity, commission, market charges, and settlement amount.
In this scenario, while both dataframes refer to the same trade records, they differ in headers, values, formatting, and order.
df1 - JPM trade confirmation:
| JPM ID | Trade date | Settlement date | Client direction | BBG Code | ISIN | CUSIP | Gross Price | Executed quantity | Currency | Commission | Market charges | Gross amount | Settlement amount | Flow Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| JPM_1 | 10-Oct-24 | 12-Oct-24 | Buy | MSFT US | US5949181045 | 594918104 | 440.37 | 8,500.00 | USD | 425.00 | 12.00 | 3,743,145.00 | 3,743,582.00 | Algo |
| JPM_2 | 10-Oct-24 | 12-Oct-24 | Sell | TSLA US | US88160R1014 | 88160R101 | 249.23 | 5,980.00 | USD | 299.00 | 10.00 | 1,490,395.40 | 1,490,086.40 | DMA |
| JPM_3 | 10-Oct-24 | 12-Oct-24 | Short Sell | TSLA US | US88160R1014 | 88160R101 | 249.23 | 4,650.00 | USD | 930.00 | 8.00 | 1,158,919.50 | 1,157,981.50 | Voice |
| JPM_4 | 10-Oct-24 | 12-Oct-24 | Buy | MSFT US | US5949181045 | 594918104 | 440.00 | 8,500.00 | USD | 425.00 | 12.00 | 3,740,000.00 | 3,740,437.00 | Algo |
| JPM_5 | 10-Oct-24 | 12-Oct-24 | Buy | MSFT US | US5949181045 | 594918104 | 441.00 | 8,500.00 | USD | 425.00 | 12.00 | 3,748,500.00 | 3,748,937.00 | Algo |
df2 - Internal OMS trade report:
| TD | SD | B/S | Stock code | Avg price | Net Price | Qty | Comm | Total charges | Gross | Net | Ccy | Internal reference |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2024-10-10 | 2024-10-12 | Short Sell | TSLA US | 249.230000 | 249.023978 | 4,650 | 950.00 | 8.00 | 1,158,919.50 | 1,157,961.50 | USD | INTERNAL_1 |
| 2024-10-10 | 2024-10-12 | Sell | TSLA US | 249.230000 | 249.169799 | 5,980 | 350.00 | 10.00 | 1,490,395.40 | 1,490,035.40 | USD | INTERNAL_2 |
| 2024-10-10 | 2024-10-12 | Buy | MSFT US | 440.000000 | 440.051412 | 8,500 | 425.00 | 12.00 | 3,740,000.00 | 3,740,437.00 | USD | INTERNAL_3 |
| 2024-10-10 | 2024-10-12 | Buy | MSFT US | 440.370000 | 440.421412 | 8,500 | 425.00 | 12.00 | 3,743,145.00 | 3,743,582.00 | USD | INTERNAL_4 |
| 2024-10-10 | 2024-10-12 | Buy | MSFT US | 441.000000 | 441.051412 | 8,500 | 425.00 | 12.00 | 3,748,500.00 | 3,748,937.00 | USD | INTERNAL_5 |
>>> df1 = pd.read_excel('JPM_Trade_Confo_20241010.xlsx')
>>> df2 = pd.read_excel('OMS_20241010.xlsx')
>>> output = recon(left_df=df1,
right_df=df2,
show_left_cols=['Trade date','Client direction','BBG Code','JPM ID'],
show_right_cols=['Internal reference'],
left_exact_match_cols=['Trade date','Client direction','BBG Code','Currency'],
right_exact_match_cols=['TD','B/S','Stock code','Ccy'],
left_compare_cols=['Gross Price','Executed quantity','Commission','Market charges','Settlement amount'],
right_compare_cols=['Avg price','Qty','Comm','Total charges','Net'],
weight=[0.1,0.5,0.1,0.1,0.2],
tolerance_percentage=[1,1,20,20,1],
suffix=['_JPM','_OMS'])
>>> print(output)Output:
| Trade date | Client direction | BBG Code | JPM ID | Internal reference | Gross Price_JPM | Avg price_OMS | Gross Price_diff | Executed quantity_JPM | Qty_OMS | Executed quantity_diff | Commission_JPM | Comm_OMS | Commission_diff | Market charges_JPM | Total charges_OMS | Market charges_diff | Settlement amount_JPM | Net_OMS | Settlement amount_diff | matching_status |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10/10/2024 | Buy | MSFT US | JPM_1 | INTERNAL_4 | 440.370 | 440.370 | 0 | 8,500.00 | 8,500.00 | 0 | 425.00 | 425.00 | 0.00 | 12.00 | 12.00 | 0.00 | 3,743,582.00 | 3,743,582.00 | 0 | Matched |
| 10/10/2024 | Buy | MSFT US | JPM_4 | INTERNAL_3 | 440.000 | 440.000 | 0 | 8,500.00 | 8,500.00 | 0 | 425.00 | 425.00 | 0.00 | 12.00 | 12.00 | 0.00 | 3,740,437.00 | 3,740,437.00 | 0 | Matched |
| 10/10/2024 | Buy | MSFT US | JPM_5 | INTERNAL_5 | 441.000 | 441.000 | 0 | 8,500.00 | 8,500.00 | 0 | 425.00 | 425.00 | 0.00 | 12.00 | 12.00 | 0.00 | 3,748,937.00 | 3,748,937.00 | 0 | Matched |
| 10/10/2024 | Short Sell | TSLA US | JPM_3 | INTERNAL_1 | 249.230 | 249.230 | 0 | 4,650.00 | 4,650.00 | 0 | 930.00 | 950.00 | -20.00 | 8.00 | 8.00 | 0.00 | 1,157,982.00 | 1,157,962.00 | 20 | Matched within tolerance |
| 10/10/2024 | Sell | TSLA US | JPM_2 | INTERNAL_2 | 249.230 | 249.230 | 0 | 5,980.00 | 5,980.00 | 0 | 299.00 | 350.00 | -51.00 | 10.00 | 10.00 | 0.00 | 1,490,086.00 | 1,490,035.00 | 51 | Matched within tolerance |
The program paired the records and detected small differences in commission, as shown in the Commission_diff column. Also in the matching_status column, those records are marked as "Matched within tolerance".
This example involves reconciling monthly commission payments from two MIS reports. It's common to have discrepancies between reports from different systems, and reconciliation is crucial to understand the source of these discrepancies.
As the reports are exported from different systems, they have different columns and slightly different broker names:
JP Morgan Sec plc
J.P. Morgan Securities plc.
df1 - Monthly_Commission_Paid_2024Q1.xlsx:
| Broker entity name | Month | USD brokerage expenses |
|---|---|---|
| JP Morgan Sec | 202401 | 54,772.00 |
| JP Morgan Sec | 202402 | 87,377.00 |
| Merrill Lynch international | 202401 | 638,131.00 |
| Merrill Lynch international | 202402 | 659,135.00 |
| UBS Securities | 202401 | 876,369.00 |
| UBS Securities | 202402 | 574,978.00 |
df2 - Brokerage_Expenses_2024Q1.xlsx:
| Period | Broker long name | Total commission paid (USD) |
|---|---|---|
| 202401 | J.P. Morgan Securities plc | 54,681.00 |
| 202401 | UBS Securities plc | 876,341.00 |
| 202401 | Merrill Lynch international | 637,898.00 |
| 202402 | J.P. Morgan Securities plc | 87,153.00 |
| 202402 | UBS Securities plc | 574,873.00 |
| 202402 | Merrill Lynch international | 658,741.00 |
>>> df1 = pd.read_excel('Monthly_Commission_Paid_2024Q1.xlsx')
>>> df2 = pd.read_excel('Brokerage_Expenses_2024Q1.xlsx')
>>> output_2 = recon(left_df=df1,
right_df=df2,
show_left_cols=[],
show_right_cols=['Period'],
left_exact_match_cols=['Month'],
right_exact_match_cols=['Period'],
left_compare_cols=['Broker entity name','USD brokerage expenses'],
right_compare_cols=['Broker long name','Total commission paid (USD)'],
tolerance_percentage=[50,5])
>>> print(output_2)The commission paid from the same brokers (but with different names on two reports) are paired up, and at the same time, the commission amounts are slightly off, as shown in the USD brokerage expenses_diff column.
output_2:
String comparisons is expected to have blank diff columns.
| Period | Broker entity name_left | Broker long name_right | Broker entity name_diff | USD brokerage expenses_left | Total commission paid (USD)_right | USD brokerage expenses_diff | matching_status |
|---|---|---|---|---|---|---|---|
| 202401 | Merrill Lynch international | Merrill Lynch international | 638,131 | 637,898 | 233 | Matched within tolerance | |
| 202402 | Merrill Lynch international | Merrill Lynch international | 659,135 | 658,741 | 394 | Matched within tolerance | |
| 202401 | UBS Securities | UBS Securities plc | 876,369 | 876,341 | 28 | Matched within tolerance | |
| 202402 | UBS Securities | UBS Securities plc | 574,978 | 574,873 | 105 | Matched within tolerance | |
| 202401 | JP Morgan Sec | J.P. Morgan Securities plc | 54,772 | 54,681 | 91 | Matched within tolerance | |
| 202402 | JP Morgan Sec | J.P. Morgan Securities plc | 87,377 | 87,153 | 224 | Matched within tolerance |
The matching score quantifies the difference or dissimilarity between values in the compared columns. A lower score indicates that the values being compared are more similar or alike. A matching score of 0 would indicate perfect similarity or an exact match.
The matching score is measured using the square of the relative change. To avoid #div/0 error, 0.00000001 is added to the divisor.
The date score is measured as the difference between two dates relative to the number of days in a year. Considering leap years, the average number of days in a year is 365.25.
When the tolerance is 1%, the tolerance is 3.6525 days.
The string comparison is measured using the ratio function from fuzzywuzzy library.
xychart-beta
%%{init: { "themeVariables": {"xyChart": {"plotColorPalette": "#0000FF"} }}}%%
title "Performance"
x-axis "Number of records in left dataframe" ["100", "1,000", "5,000", "50,000", "100,000", "200,000", "500,000"]
y-axis "Runtime (seconds)" 0 --> 50
line [0.0159,0.0931,0.4484,4.4604,9.0372,17.6410,45.4180]
Show numbers in table
| Number of records | Run time (seconds) |
|---|---|
| 100 | 0.0159 |
| 1,000 | 0.0931 |
| 5,000 | 0.4484 |
| 50,000 | 4.4604 |
| 100,000 | 9.0372 |
| 200,000 | 17.641 |
| 500,000 | 45.418 |
The dataset size is determined by the number of rows in the left dataframe. The reported execution time is calculated as the mean duration across 5 separate runs. The specific test parameters are as follows:
- Number of columns used for exact matching: 5
- Number of columns used for comparison: 6 (floats)
- Weighting: Equal weights
- Tolerance level: 0%