Skip to content

montassar-agrebi/data-validator-tool

Repository files navigation

📊 Data Validation & Reconciliation Tool

Python Streamlit License

A Python-based application designed to automate data validation between source and target datasets during data migration, ETL validation, or system reconciliation processes.

The tool allows analysts and data engineers to quickly detect data inconsistencies, missing records, and column mismatches while providing clear insights about overall data migration health.


Overview

Data migrations and ETL pipelines require validating that the target system accurately reflects the source dataset.

Manual spreadsheet comparisons become inefficient and unreliable when working with large datasets or complex schemas.

This application automates the validation workflow and produces clear validation summaries that highlight discrepancies between datasets.


Key Features

Flexible Primary Key Selection

Users can manually select the primary key used to align records between the source and target datasets.

This flexibility allows validation across different table structures and migration scenarios.

Duplicate Key Detection

The tool automatically checks the selected comparison key for duplicate values in both datasets.

If duplicates are detected, validation stops and an error message is displayed to prevent incorrect comparisons.

Users can then select another column as the comparison key.

Dataset Alignment

The validator aligns rows between datasets using the selected key before performing comparisons.

It also detects records that exist only in one dataset, helping identify potential data loss or unexpected records.

Column-Level Validation

Each column shared between the datasets is compared to detect value differences including:

  • value mismatches
  • null inconsistencies
  • missing values
  • unexpected value changes

Comparison Modes

The tool provides two comparison modes:

  • Normalized Mode – removes formatting differences such as case sensitivity, trailing spaces, and numeric formatting
  • Strict Mode – compares values exactly as stored in the datasets

Migration Health Insights

After validation, the application produces a summary showing the overall data migration health.

The dashboard includes:

  • rows compared
  • columns compared
  • mismatched values
  • rows containing discrepancies
  • attribute accuracy scores

Mismatch Classification

Detected issues are automatically categorized to help understand the root cause of discrepancies.

Examples include:

  • perfect matches
  • missing values on source
  • missing values on target
  • mostly incorrect values
  • mixed mismatch patterns

Issue Sampling for Tracking

For each column containing mismatches, the tool generates up to five example records showing the detected issue.

These samples are formatted as text so they can easily be copied into issue trackers, validation logs, or data quality reports.


Quick Launch (Recommended)

The repository includes a Windows launcher allowing the application to run without installing Python manually.

The launcher uses a bundled portable Python environment and installs required dependencies automatically when needed.

To run the application:

  1. Download or clone the repository
  2. Open the project folder
  3. Double-click 🚀 Start Data Validation Tool.bat

The Streamlit dashboard will automatically open in your browser.


Developer Setup

If you prefer running the application manually:

Install dependencies:


pip install -r requirements.txt

Run the application:


streamlit run app.py

Tech Stack

  • Python
  • Pandas
  • Streamlit

Project Structure


data-validator-tool
│
├── launch_validator.bat
├── python_portable/
│
├── app.py
│
├── src/
│   ├── validator.py
│   ├── comparison.py
│   └── profiling.py
│
├── sample_data/
│
├── screenshots/
│
├── requirements.txt
└── README.md

Example Workflow

  1. Upload the source dataset
  2. Upload the target dataset
  3. Select the comparison key
  4. Choose the comparison mode
  5. Run the validation
  6. Review mismatch insights and migration metrics

Example Output

  • dataset comparison metrics
  • column accuracy scores
  • migration health summary
  • mismatch samples for issue tracking
  • exportable mismatch reports

Screenshots of the interface can be added in this section.


Application Preview


Future Improvements

  • database connection support
  • automated validation report exports
  • large dataset optimization
  • scheduled validation workflows
  • support for additional file formats

Download Portable Version

A portable version of the application is available in the repository releases.

Download the ZIP package and run the launcher to start the tool without installing Python.

About

Python tool for automated dataset reconciliation and data migration validation.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors