Skip to content

eroj333/DeepDataMiningLearning

 
 

Repository files navigation

DeepDataMiningLearning

Data mining, machine learning, and deep learning sample codes for SJSU CMPE255 Data Mining (Fall2023 SJSU Official Syllabus) and CMPE258 Deep Learning (Fall2023 SJSU Official Syllabus).

  • Some google colab examples need SJSU google account to view)
  • Large language Models (LLMs) part is newly added
  • You can also view the documents in: readthedocs

Setups

Install this python package (optional) via

% python3 -m pip install flit
% flit install --symlink

ref "docs/python.rst" for detailed python package description

Open the Jupyter notebook in local machine:

jupyter lab --ip 0.0.0.0 --no-browser --allow-root

Sphinx docs

Activate python virtual environment, you can use 'sphinx-build' command to build the document

   % pip install -r requirements.txt
   (mypy310) kaikailiu@kaikais-mbp DeepDataMiningLearning % sphinx-build docs ./docs/build
   #check the integrity of all internal and external links:
   (mypy310) kaikailiu@kaikais-mbp DeepDataMiningLearning % sphinx-build docs -W -b linkcheck -d docs/build/doctrees docs/build/html

The generated html files are in the folder of "build". You can also view the documents in: readthedocs

Python Data Analytics

Basic python tutorials, numpy, Pandas, data visualization and EDA

Python data apps based on streamlit: streamlittest

Cloud Data Analytics

  • Data Mining based on Google Cloud:
    • Google Cloud access via Colab: colablink
      • Configure Gcloud, Google Cloud Storage, Compute Engine, Colab Terminal
    • Google BigQuery with Colab/Jupyter introduction BigQuery-intro.ipynb -- colablink
      • Natality dataset and Weather data from Google BigQuery
    • COVID19 Data EDA and Visualization based on Google BigQuery (Fall 2022 updated): colablink
      • COVID NYT data, COVID-19 JHU data
    • Additional Google BigQuery examples: colablink
      • Chicago Crime Dataset, Austin Waste Dataset, COVID Racial Dataset (race graph)
    • BigQuery ML examples: colablink
      • COVID, CREDIT_CARD_FRAUD, Predict penguin weight, Natality, US Census Dataset Classification, time-series forecasting from Google Analytics data

Machine Learning Algorithm

Deep Learning

Deep learning notebooks (colab link is better)

New Deep Learning sample code based on Pytorch (under the folder of "DeepDataMiningLearning")

  • Pytorch Single GPU image classification with/without automatic mixed precision (AMP) training: singleGPU
  • Pytorch Multi-GPU DDP test: testTorchDDP
  • Pytorch Multi-GPU image classification: multiGPU
  • Pytorch Torchvision image classification (Efficientnet) notebook on HPC: torchvisionHPC.ipynb
  • Pytorch Torchvision vision transformer (ViT) notebook on HPC: torchvisionvitHPC.ipynb
  • Pytorch ViT implement from scratch on HPC: ViTHPC.ipynb
  • Pytorch ImageNet classification example: imagenet
  • Pytorch inference example for top-k class: inference.py
  • TIMM models: testtimm.ipynb
  • Huggingface Images via Transformers: huggingfaceimage.ipynb
  • Siamese network: siamese_network
  • TensorRT example: tensorrt.ipynb
  • Advanced Image Classification: githubrepo
    • General purpose framework for all-in-one image classification for Tensorflow and Pytorch
    • Support for multiple datasets: imagenet_blurred, tiny-imagenet-200, hymenoptera_data, CIFAR10, MNIST, flower_photos
    • Support for multiple custom models ('mlpmodel1', 'lenet', 'alexnet', 'resnetmodel1', 'customresnet', 'vggmodel1', 'vggcustom', 'cnnmodel1'), all models from Torchvision and TorchHub
    • Support HPC training and evaluation
  • Object detection (other repo)

Unsupervised Learning

  • Unsupervised Learning Jupyter notebooks
    • PCA: colablink
      • Numpy/SKlearn SVD, PCA for digits and noise filtering, eigenfaces, PCA vs LDA vs NCA
    • Manifold Learning: colablink
      • Multidimensional Scaling (MDS), Locally Linear Embedding (LLE), Isomap Embedding, T-distributed Stochastic Neighbor Embedding for HELLO, S-Curve, and Swiss roll dataset; Isomap on Faces; Regression with Mainfold Learning
    • Clustering: colablink
      • K-Means, Gaussian Mixture Models, Spectral Clustering, DBSCAN

NLP and Text Mining

  • Text Mining Jupyter notebooks
    • Text Representations: colablink
      • One-Hot encoding, Bag-of-Words, TF-IDF, and Word2Vec (based on gensim); Word2Vec WiKi and Shakespeare examples; Gather data from Google and WordCLoud
    • Texrtact and NLTK: colablink
      • Text Extraction via textract; NLTK text preprocessing
    • Text Mining via Tensorflow-text: colablink
      • Using Keras embedding layer; sentiment classification example; prepare positive and negative samples and create a Skip-gram Word2Vec model
    • Text Classification via Tensorflow: colablink
      • RNN, LSTM, Transformer, BERT
    • Twitter NLP all-in-one example: colablink
      • NTLK, LSTM, Bi-LSTM, GRU, BERT

Recommendation

  • Recommendation
    • Recommendation via Python Surprise and Neural Collaborative Filtering (Tensorflow): colablink
    • Tensorflow Recommender: colab

Large Language Models (LLMs) and Apps

NLP models based on Huggingface Transformer libraries

Pytorch Transformer

Open Source LLMs

LLMs Apps based on OpenAI API

LLMs Apps based on LangChain

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 99.5%
  • Other 0.5%