A pure Python implementation of high-dimensional embeddings generator designed for active learning research. Features a web interface for parameter control and 3D visualization, supports FPS (Farthest Point Sampling) algorithm and multiple dimensionality reduction methods. Backend implemented using CPU-only libraries like numpy, scipy, and scikit-learn.
- ๐ง Neural Network Encoder Simulation: Simulates various characteristics of real neural network encoders
- ๐๏ธ Precise Parameter Control: 11 parameters for precise control of geometric and statistical properties
- ๐ Modern Web Interface: Dark theme, responsive design, professional research tool experience
- ๐ Multiple Dimensionality Reduction: Supports PCA, t-SNE, UMAP algorithms
- ๐ฏ FPS Sampling Algorithm: Complete Farthest Point Sampling implementation with 5 distance metrics
- ๐พ Intelligent Caching System: HDF5 caching improves performance, avoids redundant computation
- ๐ง Flexible Dimension Support: 3-2048 dimensional embedding generation
ALQuery3D/
โโโ src/ # Source code directory
โ โโโ data/ # Data processing modules
โ โ โโโ __init__.py
โ โ โโโ embedding_generator.py # High-dimensional embeddings generator
โ โโโ algorithms/ # Algorithm implementations
โ โ โโโ __init__.py
โ โ โโโ fps.py # FPS Farthest Point Sampling algorithm
โ โโโ web/ # Web interface
โ โ โโโ app.py # Flask backend
โ โ โโโ templates/
โ โ โโโ index.html # Web frontend interface
โ โโโ __init__.py
โโโ data/ # Data cache directory
โ โโโ tmp_data.h5 # HDF5 cache file (generated at runtime)
โโโ examples/ # Example code
โ โโโ generate_embeddings_demo.py # Embeddings generation demo
โโโ tests/ # Test directory
โ โโโ test_embedding_generator.py # Unit tests
โโโ run_web.py # Web application startup script
โโโ requirements.txt # Project dependencies
โโโ README.md
โโโ LICENSE
pip install -r requirements.txtOr manually install core dependencies:
pip install numpy scikit-learn matplotlib scipy flask plotly h5py umap-learnpython run_web.pyThen visit http://localhost:5000 in your browser
from src.data.embedding_generator import EmbeddingGenerator
# Create generator
generator = EmbeddingGenerator(embedding_dim=128, random_state=42)
# Generate embeddings (all parameters normalized to 0-1 range)
embeddings, labels = generator.generate_clustered_embeddings(
n_samples_per_class=[100, 150, 120], # Number of samples per class
dispersion=0.6, # Dispersion (0.0-1.0)
curvature=0.2, # Curvature (0.0-1.0)
flatness=0.7, # Flatness (0.0-1.0)
inter_class_distance=0.8, # Inter-class distance (0.0-1.0)
intra_class_correlation=0.4 # Intra-class correlation (0.0-1.0)
)
print(f"Generated embeddings shape: {embeddings.shape}")
print(f"Number of classes: {len(np.unique(labels))}")Simulates high-dimensional embeddings generated by neural network encoders with the following characteristics:
- Dispersion: Controls the spread of intra-class samples
- Curvature: Controls nonlinear deformation, forming cone-like distributions
- Flatness: Controls compression in certain dimensions, approaching hyperplanes
- Inter-class Distance: Controls distance between different class centers
- Intra-class Correlation: Controls correlation between intra-class features
- Inter-hyperplane Parallelism: Controls parallelism between class hyperplanes
- Manifold Complexity: Simulates nonlinear activation function effects in neural networks
- Feature Sparsity: Simulates feature sparsity caused by ReLU and other activation functions
- Noise Level: Simulates information loss during encoding process
- Boundary Sharpness: Controls clarity of class boundaries
- Dimensional Anisotropy: Importance differences across different dimensions
- PCA: Principal Component Analysis, preserves maximum variance
- t-SNE: t-distributed Stochastic Neighbor Embedding, preserves local structure
- UMAP: Uniform Manifold Approximation and Projection, balances global and local structure
Complete FPS (Farthest Point Sampling) implementation:
- Multiple Distance Metrics: Euclidean, cosine, Chebyshev, Manhattan, Minkowski distances
- Interactive Point Selection: Click any point in 3D visualization to set starting position
- Path Visualization: Cyan gradient display of complete FPS traversal path
- Range View Function: View any continuous subsequence of FPS path
- Statistical Analysis: Path distances, class distribution, sampling quality assessment
- ๐๏ธ Parameter Control: Intuitive sliders and input boxes control all parameters
- ๐ Real-time Visualization: 3D interactive charts with rotation and zoom support
- ๐ Dimensionality Reduction Switching: One-click switching between PCA, t-SNE, UMAP
- ๐พ Intelligent Caching: Automatic caching of dimensionality reduction results for improved response speed
- ๐ Statistical Information: Real-time display of data statistics and dimension information
- ๐ฏ FPS Sampling: Complete FPS sampling and visualization functionality
| Parameter | Type | Range | Description |
|---|---|---|---|
n_samples_per_class |
List[int] | 10-5000 | Number of samples per class |
embedding_dim |
int | 3-2048 | Embedding dimension |
| Parameter | Range | Internal Mapping | Description |
|---|---|---|---|
dispersion |
0.0-1.0 | 0.001-20.0 | Dispersion, controls intra-class sample spread |
curvature |
0.0-1.0 | 0.0-5.0 | Curvature, controls nonlinear deformation |
flatness |
0.0-1.0 | 0.001-1.0 | Flatness, controls dimensional compression |
intra_class_correlation |
0.0-1.0 | 0.0-0.99 | Intra-class correlation, controls feature correlation |
| Parameter | Range | Internal Mapping | Description |
|---|---|---|---|
manifold_complexity |
0.0-1.0 | 0.0-2.0 | Manifold complexity, simulates nonlinear activation functions |
feature_sparsity |
0.0-1.0 | 0.0-0.9 | Feature sparsity, simulates ReLU activation |
noise_level |
0.0-1.0 | 0.0-0.5 | Noise level, simulates information loss |
boundary_sharpness |
0.0-1.0 | 0.0-5.0 | Boundary sharpness, controls decision boundary clarity |
dimensional_anisotropy |
0.0-1.0 | 0.0-0.8 | Dimensional anisotropy, simulates feature importance differences |
| Parameter | Range | Internal Mapping | Description |
|---|---|---|---|
inter_class_distance |
0.0-1.0 | 0.1-50.0 | Inter-class distance, controls distance between class centers |
inter_hyperplane_parallelism |
0.0-1.0 | 0.0-0.99 | Inter-hyperplane parallelism |
# Set different parameters for each class
embeddings, labels = generator.generate_clustered_embeddings(
n_samples_per_class=[80, 100, 120],
dispersion=[0.3, 0.6, 0.9], # Different dispersion per class
curvature=[0.1, 0.3, 0.5], # Different curvature per class
flatness=[0.4, 0.7, 1.0], # Different flatness per class
inter_class_distance=0.7, # Global inter-class distance
intra_class_correlation=[0.2, 0.5, 0.8] # Different correlation per class
)# Simulate real neural network encoder
embeddings, labels = generator.generate_clustered_embeddings(
n_samples_per_class=[200, 200, 200],
dispersion=0.5,
curvature=0.3,
flatness=0.6,
manifold_complexity=0.3, # Moderate nonlinearity
feature_sparsity=0.2, # Slight sparsity
noise_level=0.05, # Small amount of noise
boundary_sharpness=0.7, # Clear boundaries
dimensional_anisotropy=0.4 # Moderate anisotropy
)# PCA dimensionality reduction to 3D
reduced_pca = generator.reduce_dimensions(n_components=3, method='pca')
# t-SNE dimensionality reduction to 3D
reduced_tsne = generator.reduce_dimensions(n_components=3, method='tsne')
# UMAP dimensionality reduction to 3D
reduced_umap = generator.reduce_dimensions(n_components=3, method='umap')
# Get dimensionality reduction information
info = generator.dimensionality_reduction_info
print(f"Dimensionality reduction method: {info['method']}")from src.algorithms.fps import create_fps_sampler
# Create FPS sampler
fps_sampler = create_fps_sampler()
# Execute FPS sampling
selected_indices = fps_sampler.sample(
embeddings, # Original high-dimensional data
start_idx=0, # Starting point index
num_samples=50, # Number of samples
distance_metric='euclidean' # Distance metric
)
# Get statistical information
stats = fps_sampler.get_path_statistics(
embeddings, selected_indices, labels, 'euclidean'
)
print(f"Sampled {stats['total_points']} points")
print(f"Total path length: {stats['total_distance']:.3f}")- Select number of classes (1-10)
- Set embedding dimension (3-2048)
- Set independent parameters for each class
- Choose dimensionality reduction method (PCA/t-SNE/UMAP)
- Click "Generate Embeddings" button
- Wait for backend processing (loading animation displayed)
- View 3D visualization results on the right
- Click any point in 3D plot to set starting position
- Configure sampling parameters (quantity, distance metric)
- Click "Start FPS Sampling" to execute sampling
- View FPS path visualization and statistical information
- Set view range (start and end indices)
- Click "View Range" to view specified range
- Observe statistical information within the range
- HDF5 Caching: Intelligent caching of dimensionality reduction results, avoiding redundant computation
- Data Standardization: Automatic standardization to -1~1 range, preserving relative relationships
- Memory Management: Efficient data structure design, supports large-scale data
- Cyan Gradient: FPS paths displayed with cyan gradient
- Interactive 3D: High-quality interactive charts with Plotly
- Responsive Design: Adapts to different screen sizes
- Modular Design: Easy to add new distance metrics and dimensionality reduction methods
- API Friendly: Provides complete programming interface
- Test Coverage: Complete unit test suite
- Generate datasets with specific characteristics
- Test effectiveness of different sampling strategies
- Visualize sampling results and data distribution
- Compare PCA, t-SNE, UMAP effects on same data
- Study impact of different parameters on dimensionality reduction results
- Simulate different types of neural network encoder outputs
- Study geometric properties of high-dimensional features
- Intuitively demonstrate high-dimensional data characteristics
- Understand impact of different parameters on data distribution
# Launch web application
python run_web.py# Run complete demo
python examples/generate_embeddings_demo.py
# Run unit tests
python tests/test_embedding_generator.py- First Use: t-SNE and UMAP first-time computation is slow, please be patient
- Large Samples: t-SNE with 5000 samples may take several minutes to compute
- Cache Cleanup: Regenerating data automatically cleans old cache
- Memory Usage: For large datasets, recommend closing other programs to free memory
- Parameter Effects: Extreme parameter values may produce unexpected data distributions
- Port Occupied: Modify port number in run_web.py
- Missing Dependencies: Run
pip install -r requirements.txt - Network Access: Ensure firewall allows port 5000
- Browser Compatibility: Recommend using latest Chrome/Firefox
- For large datasets, recommend reducing sample count
- t-SNE and UMAP computation is slow, please be patient
- Close other browser tabs to free memory
This project is licensed under the MIT License - see the LICENSE file for details.
Welcome to submit Issues and Pull Requests to improve this project!
ALQuery3D - Providing powerful high-dimensional data generation and visualization tools for active learning research! ๐