This repository provides a practical guide to understanding fundamental concepts in Large Language Models (LLMs), specifically focusing on tokens, tokenization, and embeddings. The primary resource is the Understanding_LLMs.ipynb Jupyter Notebook, which walks through code examples to illustrate these concepts.
- What are Tokens? Tokens are the basic units of text that an LLM processes. They can be words, subwords, or even individual characters, depending on the tokenization strategy.
- What is Tokenization? Tokenization is the process of breaking down a piece of text into these smaller units (tokens). This is a crucial preprocessing step for LLMs, as they operate on numerical representations of these tokens.
- Demonstration: The notebook uses the
AutoTokenizerfrom the Hugging Facetransformerslibrary to show how raw text is converted into a sequence of token IDs.
- What are Embeddings? Embeddings are dense vector representations of tokens or pieces of text. These vectors capture the semantic meaning and context of the words they represent. Words with similar meanings will have embeddings that are closer together in the vector space.
- Why are they Important? Embeddings allow LLMs to understand and process language in a way that captures nuances, relationships, and context, which is essential for tasks like text generation, translation, and sentiment analysis.
- Demonstration: The notebook utilizes the
google.genailibrary to generate embeddings for sample text, illustrating how text can be converted into meaningful numerical vectors.
The Jupyter Notebook Understanding_LLMs.ipynb serves as an interactive, step-by-step guide to:
- Install and set up necessary libraries.
- Understand and implement text tokenization.
- Generate and understand text embeddings.
- Perform basic text generation using a pre-trained causal language model.
- Dependency Installation:
transformers(from Hugging Face for tokenizers and models)google-generativeai(for Google's Generative AI models, including embedding generation)python-dotenv(for managing API keys)
- API Key Management:
- The notebook demonstrates setting API keys directly as environment variables (e.g.,
os.environ["HF_TOKEN"]). - Recommended Practice: For security and better configuration management, it is highly recommended to store API keys (like
HF_TOKENfor Hugging Face andGOOGLE_API_KEYfor Google AI) in a.envfile in the root of the project. The notebook can then be modified to load these keys usingpython-dotenv.# Example of loading from .env (add this to the notebook) # At the beginning of your notebook: # !pip install python-dotenv # import os # from dotenv import load_dotenv # load_dotenv() # hf_token = os.getenv("HF_TOKEN") # google_api_key = os.getenv("GOOGLE_API_KEY")
- The notebook demonstrates setting API keys directly as environment variables (e.g.,
- Text Tokenization:
- Using
AutoTokenizer.from_pretrained("google/gemma-3-1b-it")to load a tokenizer. - Converting text strings into input IDs and attention masks.
- Using
- Embedding Generation:
- Initializing the
genai.Clientwith an API key. - Using
client.models.embed_content()to get embedding vectors for given text.
- Initializing the
- Text Generation:
- Loading a pre-trained Causal Language Model:
AutoModelForCausalLM.from_pretrained("google/gemma-3-1b-it"). - Generating text by providing tokenized input to the model using
model.generate(). - Decoding the generated token IDs back into human-readable text using
tokenizer.batch_decode().
- Loading a pre-trained Causal Language Model:
- Clone the repository:
git clone <https://github.com/Shreyas-Walde/Understanding_LLMs.git> cd <Understanding_LLMs>
- Create a
.envfile: In the root directory of the project, create a file named.envand add your API keys:HF_TOKEN="your_hugging_face_token" GOOGLE_API_KEY="your_google_ai_api_key"
- Install Dependencies:
The notebook installs dependencies directly within its cells. Ensure you have pip installed.
Alternatively, you can create a
requirements.txtfile with the following content and runpip install -r requirements.txt:transformers google-generativeai python-dotenv torch
- Open and Run the Notebook:
Launch Jupyter Notebook or JupyterLab and open
Understanding_LLMs.ipynb. Execute the cells sequentially to see the concepts in action.
The main Python libraries used in this project are:
transformers(Hugging Face)google-generativeaipython-dotenvtorch
Please refer to the notebook for specific versions if encountering compatibility issues.