Skip to content

A machine learning pipeline that analyzes real gene expression data to predict disease types with high accuracy.

Notifications You must be signed in to change notification settings

hasini-venisetti/bioai-gene-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 Biological Data Analysis App using AI

A comprehensive Python application for analyzing large-scale gene expression data and extracting meaningful biological insights using Machine Learning.

πŸ“‹ Overview

This application provides a complete pipeline for biological data analysis including:

  • Data Preprocessing: Cleaning, normalization, and preparation
  • Exploratory Data Analysis: Visualization and statistical analysis
  • Unsupervised Learning: Clustering analysis to discover patterns
  • Supervised Learning: Classification models for condition prediction
  • Comprehensive Reporting: Automated insights and visualizations

🧬 Dataset

The app works with gene expression datasets where:

  • Rows = Samples (e.g., patients, cell lines)
  • Columns = Gene expressions (e.g., GENE_0001, GENE_0002, ...)
  • Target = Condition column (e.g., 'Healthy', 'Cancer_TypeA', 'Cancer_TypeB')

If no dataset is provided, the app automatically generates a realistic sample dataset with 500 samples and 1000 genes.

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • pip package manager

Installation

  1. Clone or download the project

    cd bioai-project
  2. Install dependencies

    pip install -r requirements.txt
  3. Prepare your data (Optional)

    • Place your gene expression CSV file in the data/ folder
    • Ensure it has a 'Condition' column for the target variable
    • Or let the app generate sample data automatically
  4. Run the analysis

    python main.py

πŸ“ Project Structure

bioai-project/
β”œβ”€β”€ data/                          # Raw dataset storage
β”‚   └── gene_expression_data.csv   # Gene expression data
β”œβ”€β”€ notebooks/                     # Jupyter notebooks (optional)
β”œβ”€β”€ output/                        # Generated results
β”‚   β”œβ”€β”€ *.png                     # Visualization plots
β”‚   β”œβ”€β”€ *.pkl                     # Trained models
β”‚   β”œβ”€β”€ *.txt                     # Analysis reports
β”‚   └── analysis_report.txt       # Main summary report
β”œβ”€β”€ main.py                        # Main application entry point
β”œβ”€β”€ requirements.txt               # Python dependencies
β”œβ”€β”€ README.md                      # This file
└── bioai_analysis.log            # Application logs

πŸ”§ Features

🧹 Data Preprocessing

  • Automatic data loading and validation
  • Missing value handling
  • Duplicate removal
  • Feature scaling (StandardScaler)
  • Data quality reporting

πŸ“Š Exploratory Data Analysis

  • Class Distribution: Bar plots of condition frequencies
  • Gene Expression Distributions: Histograms of expression levels
  • Correlation Analysis: Heatmaps of gene correlations
  • PCA Analysis: Principal component analysis and visualization

πŸ€– Machine Learning

Unsupervised Learning

  • K-Means Clustering: Automatic cluster discovery
  • Elbow Method: Optimal cluster number determination
  • Cluster Visualization: 2D PCA scatter plots
  • Comparison with True Labels: When available

Supervised Learning

  • Random Forest Classification: Condition prediction
  • Model Evaluation: Accuracy, F1-score, confusion matrix
  • Feature Importance: Most important genes identification
  • Model Persistence: Save/load trained models

πŸ“ˆ Visualizations

All plots are automatically saved to the output/ folder:

  1. condition_distribution.png - Target class frequencies
  2. gene_distributions.png - Expression level distributions
  3. correlation_heatmap.png - Gene correlation patterns
  4. pca_analysis.png - PCA variance and 2D projection
  5. clustering_analysis.png - Discovered clusters
  6. confusion_matrix.png - Classification performance
  7. feature_importance.png - Most important genes

🎯 Usage Examples

Basic Usage

python main.py

With Custom Dataset

from main import BioDataAnalyzer

# Initialize with custom data path
analyzer = BioDataAnalyzer('data/my_gene_data.csv')

# Run full analysis
analyzer.run_full_analysis()

Individual Analysis Steps

analyzer = BioDataAnalyzer()
analyzer.load_data()
analyzer.preprocess_data()

# Run specific analyses
X_pca = analyzer.run_eda()
clusters = analyzer.run_clustering(X_pca)
accuracy = analyzer.train_model()

πŸ“Š Sample Output

The application generates comprehensive results including:

Console Output

🧬 Biological Data Analysis App using AI
==================================================
Loading data from: data/gene_expression_data.csv
Data loaded successfully. Shape: (500, 1001)

==================================================
DATASET OVERVIEW
==================================================
Shape: (500, 1001)
Columns: ['GENE_0000', 'GENE_0001', 'GENE_0002', 'GENE_0003', 'GENE_0004']... 
Memory usage: 3.81 MB

==============================
CLASS DISTRIBUTION
==============================
Healthy: 200 (40.0%)
Cancer_TypeA: 150 (30.0%)
Cancer_TypeB: 150 (30.0%)

Analysis Report

BIOLOGICAL DATA ANALYSIS REPORT
==================================================
Analysis Date: 2024-01-20 14:30:45
Dataset: data/gene_expression_data.csv

DATASET SUMMARY:
- Total Samples: 500
- Total Genes: 1000
- Conditions: ['Healthy', 'Cancer_TypeA', 'Cancer_TypeB']

PCA ANALYSIS:
- Components for 95% variance: 87
- First PC explains: 12.45%
- Second PC explains: 8.32%

CLUSTERING ANALYSIS:
- Algorithm: K-Means
- Number of clusters: 3
- Inertia: 2847.52

CLASSIFICATION MODEL:
- Algorithm: Random Forest
- Number of trees: 100
- Model saved for future predictions

πŸ”§ Advanced Features

Large Dataset Support (Optional)

For very large datasets, uncomment the Dask imports and modify data loading:

import dask.dataframe as dd

# For large datasets
def load_large_data(self):
    self.raw_data = dd.read_csv(self.data_path)
    return self.raw_data.compute()  # Convert to pandas when needed

Command Line Interface

Extend the application with argument parsing:

import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--data', default='data/gene_expression_data.csv')
parser.add_argument('--model', choices=['rf', 'svm'], default='rf')
args = parser.parse_args()

πŸ› οΈ Customization

Adding New Algorithms

# In the train_model method
from sklearn.svm import SVC

if model_type == 'svm':
    self.classifier = SVC(kernel='rbf', random_state=42)

Custom Visualizations

# Add custom plots in run_eda method
def plot_custom_analysis(self):
    # Your custom visualization code
    plt.savefig(f'{self.output_dir}/custom_plot.png')

πŸ” Troubleshooting

Common Issues

  1. Memory Issues with Large Datasets

    • Use Dask for out-of-core processing
    • Reduce number of features with feature selection
    • Process data in chunks
  2. Missing Dependencies

    pip install --upgrade pip
    pip install -r requirements.txt
  3. Data Format Issues

    • Ensure CSV has proper headers
    • Check for correct 'Condition' column name
    • Verify numeric data types for gene expressions

πŸ“š Scientific Background

This application implements standard bioinformatics workflows:

  • Gene Expression Analysis: Study of RNA expression levels
  • Principal Component Analysis: Dimensionality reduction for genomics
  • Clustering: Discovery of molecular subtypes
  • Classification: Biomarker identification and disease prediction

🀝 Contributing

Contributions are welcome! Areas for improvement:

  • Additional ML algorithms
  • Interactive visualizations with Plotly
  • Real-time analysis dashboard
  • Integration with biological databases
  • Statistical significance testing

πŸ“„ License

This project is open source and available under the MIT License.

πŸ™ Acknowledgments

  • Scikit-learn for machine learning algorithms
  • Matplotlib/Seaborn for visualizations
  • Pandas for data manipulation
  • NumPy for numerical computing

Ready to analyze your biological data? Run python main.py and discover insights in your gene expression data! 🧬✨

About

A machine learning pipeline that analyzes real gene expression data to predict disease types with high accuracy.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages