A comprehensive Python application for analyzing large-scale gene expression data and extracting meaningful biological insights using Machine Learning.
This application provides a complete pipeline for biological data analysis including:
- Data Preprocessing: Cleaning, normalization, and preparation
- Exploratory Data Analysis: Visualization and statistical analysis
- Unsupervised Learning: Clustering analysis to discover patterns
- Supervised Learning: Classification models for condition prediction
- Comprehensive Reporting: Automated insights and visualizations
The app works with gene expression datasets where:
- Rows = Samples (e.g., patients, cell lines)
- Columns = Gene expressions (e.g., GENE_0001, GENE_0002, ...)
- Target = Condition column (e.g., 'Healthy', 'Cancer_TypeA', 'Cancer_TypeB')
If no dataset is provided, the app automatically generates a realistic sample dataset with 500 samples and 1000 genes.
- Python 3.8+
- pip package manager
-
Clone or download the project
cd bioai-project -
Install dependencies
pip install -r requirements.txt
-
Prepare your data (Optional)
- Place your gene expression CSV file in the
data/folder - Ensure it has a 'Condition' column for the target variable
- Or let the app generate sample data automatically
- Place your gene expression CSV file in the
-
Run the analysis
python main.py
bioai-project/
βββ data/ # Raw dataset storage
β βββ gene_expression_data.csv # Gene expression data
βββ notebooks/ # Jupyter notebooks (optional)
βββ output/ # Generated results
β βββ *.png # Visualization plots
β βββ *.pkl # Trained models
β βββ *.txt # Analysis reports
β βββ analysis_report.txt # Main summary report
βββ main.py # Main application entry point
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ bioai_analysis.log # Application logs
- Automatic data loading and validation
- Missing value handling
- Duplicate removal
- Feature scaling (StandardScaler)
- Data quality reporting
- Class Distribution: Bar plots of condition frequencies
- Gene Expression Distributions: Histograms of expression levels
- Correlation Analysis: Heatmaps of gene correlations
- PCA Analysis: Principal component analysis and visualization
- K-Means Clustering: Automatic cluster discovery
- Elbow Method: Optimal cluster number determination
- Cluster Visualization: 2D PCA scatter plots
- Comparison with True Labels: When available
- Random Forest Classification: Condition prediction
- Model Evaluation: Accuracy, F1-score, confusion matrix
- Feature Importance: Most important genes identification
- Model Persistence: Save/load trained models
All plots are automatically saved to the output/ folder:
- condition_distribution.png - Target class frequencies
- gene_distributions.png - Expression level distributions
- correlation_heatmap.png - Gene correlation patterns
- pca_analysis.png - PCA variance and 2D projection
- clustering_analysis.png - Discovered clusters
- confusion_matrix.png - Classification performance
- feature_importance.png - Most important genes
python main.pyfrom main import BioDataAnalyzer
# Initialize with custom data path
analyzer = BioDataAnalyzer('data/my_gene_data.csv')
# Run full analysis
analyzer.run_full_analysis()analyzer = BioDataAnalyzer()
analyzer.load_data()
analyzer.preprocess_data()
# Run specific analyses
X_pca = analyzer.run_eda()
clusters = analyzer.run_clustering(X_pca)
accuracy = analyzer.train_model()The application generates comprehensive results including:
𧬠Biological Data Analysis App using AI
==================================================
Loading data from: data/gene_expression_data.csv
Data loaded successfully. Shape: (500, 1001)
==================================================
DATASET OVERVIEW
==================================================
Shape: (500, 1001)
Columns: ['GENE_0000', 'GENE_0001', 'GENE_0002', 'GENE_0003', 'GENE_0004']...
Memory usage: 3.81 MB
==============================
CLASS DISTRIBUTION
==============================
Healthy: 200 (40.0%)
Cancer_TypeA: 150 (30.0%)
Cancer_TypeB: 150 (30.0%)
BIOLOGICAL DATA ANALYSIS REPORT
==================================================
Analysis Date: 2024-01-20 14:30:45
Dataset: data/gene_expression_data.csv
DATASET SUMMARY:
- Total Samples: 500
- Total Genes: 1000
- Conditions: ['Healthy', 'Cancer_TypeA', 'Cancer_TypeB']
PCA ANALYSIS:
- Components for 95% variance: 87
- First PC explains: 12.45%
- Second PC explains: 8.32%
CLUSTERING ANALYSIS:
- Algorithm: K-Means
- Number of clusters: 3
- Inertia: 2847.52
CLASSIFICATION MODEL:
- Algorithm: Random Forest
- Number of trees: 100
- Model saved for future predictions
For very large datasets, uncomment the Dask imports and modify data loading:
import dask.dataframe as dd
# For large datasets
def load_large_data(self):
self.raw_data = dd.read_csv(self.data_path)
return self.raw_data.compute() # Convert to pandas when neededExtend the application with argument parsing:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--data', default='data/gene_expression_data.csv')
parser.add_argument('--model', choices=['rf', 'svm'], default='rf')
args = parser.parse_args()# In the train_model method
from sklearn.svm import SVC
if model_type == 'svm':
self.classifier = SVC(kernel='rbf', random_state=42)# Add custom plots in run_eda method
def plot_custom_analysis(self):
# Your custom visualization code
plt.savefig(f'{self.output_dir}/custom_plot.png')-
Memory Issues with Large Datasets
- Use Dask for out-of-core processing
- Reduce number of features with feature selection
- Process data in chunks
-
Missing Dependencies
pip install --upgrade pip pip install -r requirements.txt
-
Data Format Issues
- Ensure CSV has proper headers
- Check for correct 'Condition' column name
- Verify numeric data types for gene expressions
This application implements standard bioinformatics workflows:
- Gene Expression Analysis: Study of RNA expression levels
- Principal Component Analysis: Dimensionality reduction for genomics
- Clustering: Discovery of molecular subtypes
- Classification: Biomarker identification and disease prediction
Contributions are welcome! Areas for improvement:
- Additional ML algorithms
- Interactive visualizations with Plotly
- Real-time analysis dashboard
- Integration with biological databases
- Statistical significance testing
This project is open source and available under the MIT License.
- Scikit-learn for machine learning algorithms
- Matplotlib/Seaborn for visualizations
- Pandas for data manipulation
- NumPy for numerical computing
Ready to analyze your biological data? Run python main.py and discover insights in your gene expression data! π§¬β¨