Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
bc903de
Setup folder configuration. Add README
piradiusquared Oct 1, 2025
a495b75
Renamed topic folder. Added skeleton files
piradiusquared Oct 4, 2025
6d91d17
Added basic imports to load dataset from HuggingFace.
piradiusquared Oct 8, 2025
bb8a7cd
Some basic imports
piradiusquared Oct 8, 2025
3e09217
Added dataset loading. Columns are as expected (in modules for now)
piradiusquared Oct 8, 2025
fa8bed2
Updated requirements. Found additional library for t5 tokenizer
piradiusquared Oct 14, 2025
85d233c
Test preprocessor function. Added additional requirements for t5 models
piradiusquared Oct 14, 2025
5aea675
Preprocessing works without error for now.
piradiusquared Oct 14, 2025
b00fbb2
Import basic LoRA adapter from HuggingFace. Runs in rangpur
piradiusquared Oct 14, 2025
4fc6f5d
Basic README skeleton. Created folder for model diagrams.
piradiusquared Oct 14, 2025
df063ca
Reformatting of current code for easier refactoring. Loaded model for…
piradiusquared Oct 14, 2025
b0649dd
Update to README. Some details filled out
piradiusquared Oct 14, 2025
4dc7fe8
Created helper file constants.py. Contains all values which are used …
piradiusquared Oct 23, 2025
f1ca52e
Dataset function. Uses Pandas to load in dataset directly from Huggin…
piradiusquared Oct 23, 2025
7787f51
Removed HuggingFace API related code. Re-visiting modules (dataset fo…
piradiusquared Oct 23, 2025
a966c5f
Load LoRA into modules.py. Test with some optimal LoRA configuration …
piradiusquared Oct 26, 2025
10ffdd8
Minor updates to dataset configurations. For testing purposes, 50 row…
piradiusquared Oct 27, 2025
c6377a6
Added held-out dataset splitting. Dataset is split into 70/30 ratios …
piradiusquared Oct 27, 2025
5bcbbad
Update to constant values. Testing with t5-small (will revert back to…
piradiusquared Oct 27, 2025
c8d6e75
Skeleton for train.py. Should be the required variables for the train…
piradiusquared Oct 27, 2025
3949559
Added scaler and rouge evaluator into train.py. Adapted basic pytorch…
piradiusquared Oct 27, 2025
0524281
Adapted evaluation loop from Pytorch and online videos. Currently the…
piradiusquared Oct 27, 2025
e5f754d
Added sampling feature to dataset loading. Can now get specified amou…
piradiusquared Oct 27, 2025
b64fba8
Added gitignore for pycache files
piradiusquared Oct 27, 2025
20769c6
Prediction of unseen data from validation set. Compares scores from o…
piradiusquared Oct 30, 2025
36682f8
Training loop now prints out live stats after each epoch when training.
piradiusquared Oct 30, 2025
c6ea437
Added data augmentation for prefixes when preprocessing.
piradiusquared Oct 30, 2025
a49c3ae
Update constants to best performing values (LoRA parameters, training…
piradiusquared Oct 30, 2025
0e6c2b8
Update requirements to what is used in rangpur
piradiusquared Oct 30, 2025
ad17bbe
Minor refactoring to code
piradiusquared Oct 30, 2025
11c9177
Added additioanl Perplexity scoring in model benchmarking. This code …
piradiusquared Oct 30, 2025
b96beb3
Loss image plots
piradiusquared Oct 30, 2025
06e50ed
Replaced loss3 with updated graph (with data augmentation)
piradiusquared Oct 30, 2025
3b9595f
Transformer architecture.
piradiusquared Oct 30, 2025
abce49c
Update constants to use t5-base instead of small.
piradiusquared Oct 30, 2025
e0126dc
Checkpoint for README.
piradiusquared Oct 30, 2025
81445f6
Added rangpur cluster run files.
piradiusquared Oct 30, 2025
c29259b
Minor change to sample size of training data. Removed call to subset …
piradiusquared Oct 30, 2025
d9c7f4b
Removed some hard coded constants with those in constants.py.
piradiusquared Oct 30, 2025
5bc31f8
Updates to README. Added section for running training/benchmark local…
piradiusquared Oct 30, 2025
f9bd91f
Update to constants for saving the loss plot.
piradiusquared Oct 30, 2025
0d32c89
Minor typo fixes. Added some explanations to improvements.
piradiusquared Oct 31, 2025
7fd0ae8
Minor update to README indicating where model was trained.
piradiusquared Oct 31, 2025
be16c21
Fix typo in README regarding SSH access
piradiusquared Oct 31, 2025
cf5d01c
Add constants for model and training configuration
piradiusquared Oct 31, 2025
e43907a
Enhance documentation with method docstrings
piradiusquared Oct 31, 2025
4d3e2b6
Enhance FlanModel with method docstrings
piradiusquared Oct 31, 2025
4a8db21
Clarify perplexity calculation in predict.py
piradiusquared Oct 31, 2025
cb3f542
Enhance documentation and setup in train.py
piradiusquared Oct 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions recognition/FLAN_s4885380/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/__pycache__/*
503 changes: 503 additions & 0 deletions recognition/FLAN_s4885380/README.md

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added recognition/FLAN_s4885380/assets/loss1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added recognition/FLAN_s4885380/assets/loss2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added recognition/FLAN_s4885380/assets/loss3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 35 additions & 0 deletions recognition/FLAN_s4885380/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# File containing all constants used
MODEL_NAME = "google/flan-t5-base"

# Pandas dataframe link for BioLaySumm dataset
TRAIN_FILE = "hf://datasets/BioLaySumm/BioLaySumm2025-LaymanRRG-opensource-track/data/train-00000-of-00001.parquet"
VALIDATION_FILE = "hf://datasets/BioLaySumm/BioLaySumm2025-LaymanRRG-opensource-track/data/validation-00000-of-00001.parquet"
TEST_FILE = "hf://datasets/BioLaySumm/BioLaySumm2025-LaymanRRG-opensource-track/data/test-00000-of-00001.parquet"

INPUT_COLUMN = "radiology_report"
TARGET_COLUMN = "layman_report"

# Prompt used exclusively in predict.py
MODEL_PROMPT = "translate this radiology report into a summary for a layperson: "

# Held-out splits
TRAIN_SPLIT = 0.7
VALIDATION_SPLIT = 0.3

# Training parameters
EPOCHS = 3
LEARNING_RATE = 3e-4
TRAIN_BATCH_SIZE = 64
VALID_BATCH_SIZE = 128
MAX_INPUT_LENGTH = 256
MAX_TARGET_LENGTH = 128

# LoRA Parameters
LORA_R = 32
LORA_ALPHA = 64
LORA_DROPOUT = 0.05
LORA_TARGET_MODULES = ["q", "v"]

# File paths for model saving and loss plotting
OUTPUT_DIR = "t5-base-lora-tuned"
LOSS_OUT = "loss.png"
78 changes: 78 additions & 0 deletions recognition/FLAN_s4885380/dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import random
import pandas as pd
import numpy as np

from torch.utils.data import Dataset
from constants import *

"""
Held-out data splitter for training and evaluation.
Splits the data into 70/30 ratio
"""
class SplitData:
def __init__(self, file_path: str, sample_size: int | None = None) -> None:
self.dataframe = pd.read_parquet(file_path)
if sample_size != None:
self.dataframe = self.dataframe[0:sample_size]
# else:
# self.dataframe = self.dataframe[100:300] # Testing split

"""
Returns both splits at once from the original dataframe
"""
def get_splits(self) -> tuple[pd.DataFrame, pd.DataFrame]:
split_index = np.random.random(len(self.dataframe)) < 0.7
train = self.dataframe[split_index]
validation = self.dataframe[~split_index]

return train, validation


"""
Custom dataset loader and preprocessor. Prepends 1 of 4 similar prompts for training and evaluation.
"""
class FlanDataset(Dataset):
def __init__(self, dataframe: pd.DataFrame, tokenizer) -> None:
self.tokenizer = tokenizer
# self.prefix = MODEL_PROMPT
self._prompts = [
"Translate this radiology report into a summary for a layperson: ",
"Summarise the following medical report in simple, easy-to-understand terms: ",
"Explain this radiology report to a patient with no medical background: ",
"Provide a layperson's summary for this report: "
]

self.dataframe = dataframe

# Biolaysumm dataset is of .parquet file type
# Future addition: add support for basic files
# self.dataframe = pd.read_parquet(file_path)
# self.dataframe = self.dataframe[0:50] # Slice data for subset

def __len__(self) -> int:
return len(self.dataframe)

"""
Tokenises the inputs using the tokenizer API. Converts strings into NLP suitable tensors
"""
def __getitem__(self, index: int) -> list:
row = self.dataframe.iloc[index] # Selects slices using iloc index

rand_prefix = random.choice(self._prompts) # Selects random prefix
report = rand_prefix + str(row[INPUT_COLUMN])
summary = str(row[TARGET_COLUMN])

model_inputs = self.tokenizer( # Tokenises the radiology report
report,
max_length=MAX_INPUT_LENGTH,
truncation=True
)

with self.tokenizer.as_target_tokenizer(): # Tokenises the layman summary
labels = self.tokenizer(
summary,
max_length=MAX_TARGET_LENGTH,
truncation=True
)
model_inputs["labels"] = labels["input_ids"] # Join report and summary together
return model_inputs
57 changes: 57 additions & 0 deletions recognition/FLAN_s4885380/modules.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
from typing import Tuple

from torch.optim import AdamW
from transformers import (
AutoTokenizer,
AutoModelForSeq2SeqLM,
get_scheduler
)
from peft import LoraConfig, get_peft_model, TaskType

from dataset import *
from constants import *

"""
Build and loads the pre-trained model as well as LoRA and AdamW optimiser
"""
class FlanModel:
def __init__(self):
pass

"""
Loads the tokeniser and Flan-T5 base model. Configures LoRA to parameters described in constant.py
"""
def build(self) -> Tuple[AutoModelForSeq2SeqLM, AutoTokenizer]:
# Load actual Flan-T5 models
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# Load LoRA
lora_config = LoraConfig(
r=LORA_R,
lora_alpha=LORA_ALPHA,
target_modules=LORA_TARGET_MODULES,
lora_dropout=LORA_DROPOUT,
bias="none",
task_type=TaskType.SEQ_2_SEQ_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # Show trainable parameters
return model, tokenizer

"""
Setup optimiser and scheduler
"""
def setup_optimiser(self, model, train_dataloader) -> Tuple[AdamW, get_scheduler]:
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
num_training_steps = EPOCHS * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps
)

return optimizer, lr_scheduler

81 changes: 81 additions & 0 deletions recognition/FLAN_s4885380/predict.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
import torch
import evaluate
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset

from constants import *

FINETUNED_MODEL = "t5-base-lora-tuned/epoch_3" # Take last epoch for best performance

"""
Computes the perplexity score using the model loss
"""
def perplexity_score(model: AutoModelForSeq2SeqLM,
tokenizer: AutoTokenizer,
prompt: str,
target_text: str,
device="cuda") -> dict:
inputs = tokenizer(prompt, return_tensors="pt").to(device)
labels = tokenizer(target_text, return_tensors="pt").input_ids.to(device)

# Gets the loss during benchmarking
with torch.no_grad():
outputs = model(**inputs, labels=labels)
loss = outputs.loss

perplexity = torch.exp(loss) # Calculate perplexity
return perplexity.item()

# Get new base flan-t5 model, and load in saved trained model
base_model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto")
base_model.eval()

# Use completely fresh Flan-T5 model
new_t5 = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto")
fine_tuned_model = PeftModel.from_pretrained(new_t5, FINETUNED_MODEL)
fine_tuned_model.eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Use API for loading in dataset
predict_dataset = load_dataset("BioLaySumm/BioLaySumm2025-LaymanRRG-opensource-track")
predict_dataset = predict_dataset.shuffle(seed=3710)
random_predict = predict_dataset["validation"]

predictions = []
references = []

for i in range(5): # Number of evaluations
radiology_report = random_predict[i]['radiology_report']
layman_report = random_predict[i]['layman_report']

prompt = f"translate this radiology report into a summary for a layperson: {radiology_report}"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Get fine tuned model to generate a summary
with torch.no_grad():
outputs = fine_tuned_model.generate(**inputs, max_new_tokens=MAX_INPUT_LENGTH)

prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Compare perplexity
fine_tune_perplexity = perplexity_score(fine_tuned_model, tokenizer, prompt, layman_report)
base_model_perplexity = perplexity_score(base_model, tokenizer, prompt, layman_report)

print(f"\nExample {i + 1}")
print(f"Official Layman Report: {layman_report}")
print(f"Fine tuned Model's Layman Report: {prediction}")

print(f"\nFine Tuned Model Perplexity on Official Report: {fine_tune_perplexity:.4f}")
print(f"\nBase Model Perplexity on Official Report: {base_model_perplexity:.4f}")

predictions.append(prediction)
references.append(layman_report) # Official report from dataset

rouge_scores = evaluate.load("rouge")
scores = rouge_scores.compute(predictions=predictions, references=references, use_stemmer=True)

print(f"Final ROUGE scores after predictions")
for key, value in scores.items():
print(f"{key}: {value * 100: .4f}")
18 changes: 18 additions & 0 deletions recognition/FLAN_s4885380/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Core Fine Tuning and NLP packages
accelerate==1.10.1
bitsandbytes==0.48.1
datasets==4.2.0
evaluate==0.4.6
peft==0.17.1
safetensors==0.6.2
tokenizers==0.22.1
torch==2.8.0
transformers==4.57.0

# Other Utilities
matplotlib==3.10.7
nltk==3.9.2
numpy==2.3.3
pandas=2.3.3
rouge-score==0.1.2
tqdm==4.67.1
11 changes: 11 additions & 0 deletions recognition/FLAN_s4885380/runners/benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --partition=a100
#SBATCH --time=5:00:00
#SBATCH --job-name=flanbenchmark
#SBATCH -o benchmark.out

conda activate flan
python predict.py
11 changes: 11 additions & 0 deletions recognition/FLAN_s4885380/runners/trainer
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --partition=a100
#SBATCH --time=5:00:00
#SBATCH --job-name=flant5finetune
#SBATCH -o flantune.out

conda activate flan
python train.py
Loading