Abstract_Decoder

Text Classification Model for Medical Abstracts

Built and deployed a CNN model to classify sentences in medical abstracts with 83.5% accuracy in distinguishing between different sections of abstracts (e.g., Background, Methods, Results) using TensorFlow. Developed preprocessing pipelines for text vectorization and embedding, enhancing model performance on large-scale medical datasets. Utilized pretrained embeddings (e.g., Universal Sentence Encoder) to improve feature extraction and classification accuracy.

1. Environment Setup and Data Loading

GPU Check: Confirm you have access to GPU with !nvidia-smi -L.

Data Download: Clone the dataset from GitHub.

!git clone https://github.com/Franck-Dernoncourt/pubmed-rct

2. Data Inspection and Preprocessing

List Files: Verify dataset files.

!ls pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/

Preprocessing Function: The preprocess_text_with_line_numbers function formats your data into dictionaries. This approach helps with creating a structured dataset.
Visualization: You visualized the data by converting it into dataframes and plotting distributions, which is excellent for understanding the data characteristics.

3. Vectorization and Embedding

Text Vectorization: Convert text to numerical format using TextVectorization and Embedding layers.
```
from tensorflow.keras.layers import TextVectorization, Embedding
```
Universal Sentence Encoder: Use pre-trained embeddings from TensorFlow Hub for feature extraction.
```
import tensorflow_hub as hub
```

4. Modeling

Baseline Model: You started with a TF-IDF + Naive Bayes model as a baseline.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

Conv1D Model: Implemented a Conv1D model for sequence processing with token embeddings.
```
from tensorflow.keras import layers
```

Feature Extraction with Pretrained Embeddings: Used TensorFlow Hub's Universal Sentence Encoder.

tf_hub_embedding_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4", trainable=False)

Character-Level Embeddings: Created character-level tokenization and vectorization for fine-grained text representation.

import string
char_vectorizer = TextVectorization(max_tokens=NUM_CHAR_TOKENS, output_sequence_length=output_seq_char_len)

Tips and Next Steps:

Evaluate Models: Continuously compare your models' performance using metrics and validation results. It’s crucial to track metrics like accuracy, precision, recall, and F1-score.
Experiment with Hyperparameters: Tune hyperparameters for better model performance. For example, you could adjust the number of filters in Conv1D layers, sequence length, or embedding dimensions.
Explore Advanced Models: Consider exploring more advanced models like BERT or GPT if the simpler models don’t meet your needs. These models can capture more complex relationships in text data.
Ensure Data Balance: Check for class imbalances and use techniques such as oversampling or undersampling if necessary to improve model performance.
Consider Deployment: Once you have a robust model, think about how to deploy it for real-world use. This might involve creating a web service or integrating it into an application.

Let me know if you need any more help with specific parts of the process or further clarification!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
abstarct_decoder.ipynb		abstarct_decoder.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Abstract_Decoder

1. Environment Setup and Data Loading

2. Data Inspection and Preprocessing

3. Vectorization and Embedding

4. Modeling

Tips and Next Steps:

About

Uh oh!

Releases

Packages

Languages

anushka7220/Abstract_Decoder

Folders and files

Latest commit

History

Repository files navigation

Abstract_Decoder

1. Environment Setup and Data Loading

2. Data Inspection and Preprocessing

3. Vectorization and Embedding

4. Modeling

Tips and Next Steps:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages