A comprehensive Streamlit-based application for creating balanced, non-overlapping data panels with precise stratification control.
- N-Way Panel Splitting: Split panels into any number of sets (2, 3, 4, etc.) - not just two!
- User-Specified Subdivision: Choose exactly how many sub-sets each panel should be split into
- Enhanced Statistics: Compare distributions across all N sets with max deviation tracking
- Flexible File Naming: Automatic naming convention
panel_X_set_Y.csvfor any number of sets
- Upload & Analyze: Support for CSV and Excel files with automatic validation
- Flexible Stratification: Select any columns for stratification with custom target proportions
- Equal Deviation Distribution: When samples are insufficient, deviation is distributed equally across ALL panels (Version 1.2)
- Multi-Panel Creation: Create multiple non-overlapping panels with iterative proportional fitting
- N-Way Splitting: Split each panel into 2, 3, 4, or more equally sized, stratified sets (NEW!)
- Comprehensive Validation:
- Distribution matching verification
- Overlap detection across all panels and sets
- Detailed statistical summaries for all N sets
- Easy Export: Download individual files or batch export all results to CSV
-
Install Python dependencies:
pip install -r requirements.txt
-
Verify installation:
streamlit --version
-
Launch the application:
./run_app.sh
Or directly:
streamlit run main.py
-
Follow the workflow:
Step 1: Upload Data
- Upload your CSV or Excel file
- Review dataset overview and column information
Step 2: Configure Targets
- Select columns for stratification
- Define target proportions for each category
- Option to use current dataset distributions
Step 3: Create Panels
- Specify number of panels and panel size
- System checks data availability
- Creates non-overlapping panels with balanced distributions
- Review detailed statistics for each panel
Step 4: Split Panels
- NEW: Choose how many sets to split each panel into (2-10 sets)
- Split each panel into N equal, balanced sets
- Maintains proportional balance across all stratification variables
- Compare distributions across all sets
Step 5: Validate & Export
- Verify no overlaps exist between any sets
- Export all files to CSV
- Download individual files or batch export
- Generate summary report
Panneling/
├── main.py # Consolidated Streamlit application (all-in-one)
├── tests/ # Test files
│ ├── test_paneling.py
│ ├── test_n_way_split.py
│ └── test_equal_deviation.py
├── data/ # Output directory for CSV files
├── config/
├── requirements.txt # Python dependencies
├── run_app.sh # Launch script
├── README.md # This file
└── Bihar_panels.xlsx # Example dataset
Note: All code has been consolidated into a single main.py file for easier deployment and distribution.
The tool uses iterative proportional fitting with multi-dimensional stratified sampling:
- Hierarchical Stratification: Stratifies across multiple features simultaneously
- Joint Sampling: Creates joint strata from combinations of target features
- Proportional Allocation: Allocates samples to match target proportions
- Non-Overlapping: Maintains a set of used indices to ensure exclusivity
Each panel can be split into N equal sets using round-robin stratified assignment:
- Strata Creation: Combines all target features into unique strata
- N-Way Distribution: Within each stratum, distributes samples evenly across N sets using round-robin
- Balance Preservation: Ensures all N sets maintain the same proportional distributions
- Max Deviation Tracking: Monitors maximum deviation across all sets for each category
The consolidated main.py file contains all functionality organized into sections:
check_master_distribution(): Validates dataset has sufficient samplescreate_balanced_sample(): Creates a single balanced sample with target distributionscompute_adjusted_targets(): Applies equal deviation distribution when samples are insufficientcreate_panels(): Creates multiple non-overlapping panelssplit_panel_into_n_sets(): Splits a panel into N balanced sets (NEW!)split_panel_into_two(): Legacy function, wrapper around N-way splitsplit_all_panels(): Splits all panels into N sets each
validate_uploaded_file(): Validates input datavalidate_target_proportions(): Ensures targets sum to 1.0check_availability(): Verifies sufficient samples for requested panelsprint_distribution_table(): Creates formatted distribution tablescheck_overlap_between_sets(): Verifies mutual exclusivitycreate_comparison_table(): Compares distributions across multiple setscalculate_max_possible_panels(): Calculates maximum feasible panelsget_feature_statistics(): Provides feature-level statistics
initialize_session_state(): Manages application statemain(): Main application with 5-step workflow
The Streamlit UI handles this automatically, but the underlying logic flow is:
# 1. Load data
df = pd.read_excel('Bihar_panels.xlsx')
# 2. Define targets
targets = {
'Gender': {'Male': 0.5, 'Female': 0.5},
'zone': {...},
'2020 AE': {...},
'2024 GE': {...}
}
# 3. Create panels
panels, stats = create_panels(
df, targets, features=['Gender', 'zone', '2020 AE', '2024 GE'],
num_panels=3, panel_size=1050, random_state=42
)
# 4. Split panels into N sets (e.g., 3 sets per panel)
splits, split_stats = split_all_panels(
panels, targets, features, num_sets=3, random_state=42
)
# 5. Export (e.g., for 3 sets per panel)
for panel_idx, sets in enumerate(splits, 1):
for set_idx, set_df in enumerate(sets, 1):
set_df.to_csv(f'data/panel_{panel_idx}_set_{set_idx}.csv')The tool performs multiple validation checks:
- Pre-Creation: Checks if master dataset has sufficient samples for targets
- Post-Creation: Compares actual vs. target distributions for each panel
- Post-Split: Verifies all N sets have matching distributions with deviation tracking
- Overlap Check: Ensures complete mutual exclusivity across all panels and sets
Generated CSV files follow this naming convention (example with 3 sets per panel):
panel_1_set_1.csvpanel_1_set_2.csvpanel_1_set_3.csvpanel_2_set_1.csvpanel_2_set_2.csvpanel_2_set_3.csv- etc.
Each file contains:
- All original columns from the input dataset
- Original index preserved for traceability
- Large Datasets: The tool handles datasets with 40,000+ rows efficiently
- Multiple Features: Supports stratification across 4+ features simultaneously
- Progress Tracking: Shows progress bars for long-running operations
- Memory Efficient: Uses pandas for efficient data handling
Issue: "Not enough samples" error
- Solution: Reduce panel size or number of panels, or adjust target proportions to match available data
Issue: Target proportions don't sum to 1.0
- Solution: Use the "Use current distribution" option or manually adjust proportions
Issue: Large deviations from targets
- Solution: Some categories may be underrepresented in the master dataset. Check availability warnings.
- Python Version: 3.8+
- Architecture: Single consolidated Python file for easy deployment
- Main Dependencies:
- Streamlit 1.28+
- pandas 2.0+
- numpy 1.24+
- Random State: Fully reproducible results with seed control
- Code Organization: ~1,570 lines organized into utility, core paneling, and UI sections
For questions or issues, please refer to this README or contact your administrator.
- v1.3: N-way panel splitting (2-10 sets per panel)
- v1.2: Equal deviation distribution for insufficient samples
- v1.1: Initial release with 2-way splitting
- v1.4: Consolidated all code into single
main.pyfile (Current)
Internal use only.