A Python-based application designed to automate data validation between source and target datasets during data migration, ETL validation, or system reconciliation processes.
The tool allows analysts and data engineers to quickly detect data inconsistencies, missing records, and column mismatches while providing clear insights about overall data migration health.
Data migrations and ETL pipelines require validating that the target system accurately reflects the source dataset.
Manual spreadsheet comparisons become inefficient and unreliable when working with large datasets or complex schemas.
This application automates the validation workflow and produces clear validation summaries that highlight discrepancies between datasets.
Users can manually select the primary key used to align records between the source and target datasets.
This flexibility allows validation across different table structures and migration scenarios.
The tool automatically checks the selected comparison key for duplicate values in both datasets.
If duplicates are detected, validation stops and an error message is displayed to prevent incorrect comparisons.
Users can then select another column as the comparison key.
The validator aligns rows between datasets using the selected key before performing comparisons.
It also detects records that exist only in one dataset, helping identify potential data loss or unexpected records.
Each column shared between the datasets is compared to detect value differences including:
- value mismatches
- null inconsistencies
- missing values
- unexpected value changes
The tool provides two comparison modes:
- Normalized Mode – removes formatting differences such as case sensitivity, trailing spaces, and numeric formatting
- Strict Mode – compares values exactly as stored in the datasets
After validation, the application produces a summary showing the overall data migration health.
The dashboard includes:
- rows compared
- columns compared
- mismatched values
- rows containing discrepancies
- attribute accuracy scores
Detected issues are automatically categorized to help understand the root cause of discrepancies.
Examples include:
- perfect matches
- missing values on source
- missing values on target
- mostly incorrect values
- mixed mismatch patterns
For each column containing mismatches, the tool generates up to five example records showing the detected issue.
These samples are formatted as text so they can easily be copied into issue trackers, validation logs, or data quality reports.
The repository includes a Windows launcher allowing the application to run without installing Python manually.
The launcher uses a bundled portable Python environment and installs required dependencies automatically when needed.
To run the application:
- Download or clone the repository
- Open the project folder
- Double-click 🚀 Start Data Validation Tool.bat
The Streamlit dashboard will automatically open in your browser.
If you prefer running the application manually:
Install dependencies:
pip install -r requirements.txt
Run the application:
streamlit run app.py
- Python
- Pandas
- Streamlit
data-validator-tool
│
├── launch_validator.bat
├── python_portable/
│
├── app.py
│
├── src/
│ ├── validator.py
│ ├── comparison.py
│ └── profiling.py
│
├── sample_data/
│
├── screenshots/
│
├── requirements.txt
└── README.md
- Upload the source dataset
- Upload the target dataset
- Select the comparison key
- Choose the comparison mode
- Run the validation
- Review mismatch insights and migration metrics
- dataset comparison metrics
- column accuracy scores
- migration health summary
- mismatch samples for issue tracking
- exportable mismatch reports
Screenshots of the interface can be added in this section.
- database connection support
- automated validation report exports
- large dataset optimization
- scheduled validation workflows
- support for additional file formats
A portable version of the application is available in the repository releases.
Download the ZIP package and run the launcher to start the tool without installing Python.


