This project explores the capabilities of ModernBERT and the former bert model by analyzing their tokenization processes and evaluating their performance on various text inputs. It is designed for educational purposes and serves as a starting point for deeper exploration of NLP models.
- Tokenizer Analysis: Analyze the tokenization behavior of BERT and ModernBert models.
- Model Evaluation: Evaluate ModernBERT's embedding generation and hidden states.
-
Clone the repository:
git clone https://github.com/Meeex2/ModernBert_Benchmark.git cd ModernBert_Benchmark -
Install dependencies:
pip install -r requirements.txt
Explore and run the tokenizer analysis script:
notebooks/tokenizer_analysis.ipynbExplore and evaluate the model's performance:
notebooks/evaluation.ipynb- Python 3.8 or higher
- Libraries:
transformerstorch
Contributions are welcome! Feel free to fork this repository, create a branch, and submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
- Hugging Face for the Transformers library
- The research community for advancing NLP and transformer models