Skip to content

Rishi-Kora/Tokenizers-using-HuggingFace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Tokenizers-using-HuggingFace

A hands-on guide to exploring Hugging Face tokenizers across popular LLMs like LLaMA, PHI-3, and StarCoder2. This project demonstrates how to encode, decode, and format text, code, and chat-style messages for large language models.


📌 Features

  • 🔄 Encode and decode text with various tokenizers
  • 💬 Format multi-turn chat prompts using chat templates
  • 🧠 Compare tokenization outputs across models
  • 🧪 Visualize individual tokens and their IDs
  • 🧰 Supports models like:
    • meta-llama/Meta-Llama-3.1-8B-Instruct
    • microsoft/phi-3-mini-4k-instruct
    • bigcode/starcoder2-15b

📂 Folder Structure

Tokenizers-using-HuggingFace/
├── Tokenizers_using_HuggingFace.ipynb
└── README.md

🚀 Getting Started

1. Clone the repository

git clone https://github.com/your-username/Tokenizers-using-HuggingFace.git
cd Tokenizers-using-HuggingFace

2. Install dependencies

pip install transformers

Optional for some models:

pip install torch
pip install sentencepiece

🧪 Example Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct", trust_remote_code=True)
text = "I love exploring tokenizers!"
tokens = tokenizer.encode(text)
decoded = tokenizer.batch_decode(tokens)

print(tokens)
print(decoded)

🧠 License

This project is open-source and available under the MIT License.


🤝 Contributing

Contributions, suggestions, and improvements are welcome! Feel free to open an issue or submit a pull request.


📬 Contact

Created by Rishi Kora (https://github.com/Rishi-Kora) – feel free to reach out with questions or ideas!

About

Explore how Hugging Face tokenizers work across models like LLaMA, PHI-3, and StarCoder2. Includes examples for encoding, decoding, chat formatting, and token visualization. Ideal for understanding text preprocessing in LLMs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors