On the ‘Micro’Biology of LLms : How LLM Finds Answer?
Yeah..I know. It’s a cliche title. I can’t find a better title. But that is actually a play on the title by Anthropic latest paper, “On the biology of LLMs”. This paper was released last March to the much hyped review by most AI Researchers out there. It’s a very interesting paper. Since Anthropic is doing something that is not the focus of many AI company nowadays. They’re explaining behing the scene of large language model, and what is the mechanism that allow these large transformer to generate such clear, coherent and intelligent-like words.
This paper is a good paper to be replicated. The only challenge is resources. We’re just normal people paper here, with no data to center to spare. Although, maybe some of you got some GPUs on your hand, but me, I only have a Macbook M1 to be utilised. So I ask ChatGPT DeepResearch for help, in devising a simple experiment to replicate the paper above with limited resources that I have. With this, it will help us to understand more on how LLM functions.
We will be using a small model, Mistral 7b that is able to run on Google Colab. We will explore the attention layer of these model and explore how it moves/behaves when we ask the model to answer basic fact findings question.
Understanding what Anthropic is Doing
Before we begin, let us understand what Anthropic paper is actually doing.
You can read both of their papers here:
Transformer Circuits Thread
Can we reverse engineer transformer language models into human-understandable computer programs? Inspired by the…transformer-circuits.pub
Under the month of March 2025 Articles section, they published 2 main articles. The first one is the “On the Biology of a LLM” and the other one is on Circuit Tracing.
That circuit tracing paper is actually the paper that explained the method that they uses to understand the “biology” of the LLMs. And by “biology” here, it actually means the circuitry inside the Large Language Models.
In the paper, they explained in details how they manage to build a computational graph circuit tracing that are able to trace the signal that is processed at each layers/nodes during LLM text generation.
For simplification, they take out the MLP layer of the transformer, and replace that with their Cross-Layer Transcoder (CLT). The attention layer and the CLT layer is combined to form a replacement model. This replacement model will be trained to match the output of the original model.
Their objective is to match the output between the replacement model and the original model. Why using CLT? Well, because CLT is highly interpretable. With CLT, this will allow Anthropic scientists to understand and map the circuit tracing inside their LLM.
However, although their method is clear, doing it will be another matter entirely. Yeah, we did replace the MLP with CLT. But, we still need to train an entire transformer model, which will be problematic with the limited hardware that we have.
So, what can we do? What other kinds of experiment we can do to replicate some aspects of the Anthropic paper?
Devising Small-Scale Experiment
So, with limited resources, we can still conduct experiments similar in spirit. However, we need to be realistic and keep our experiments small. Looking at the objective, we want to ensure that we understand at least a little about how LLMs work. Specifically, we want to observe how the components inside the LLM behave, how the layers, the neurons, and the weights react when asked a simple question.
To demonstrate this, we will do three simple experiments:
Look at how each attention weight changes during token generation.
2 .See how each token relates to the next token.
With this, we hope to understand how LLMs are able to provide answers when given fact-finding questions. For this experiment, the question we will ask is:
“What is the capital of the state Perlis?”
For readers outside of Malaysia, Perlis is the smallest state in the country. Since we are running this experiment on a small LLM model (7B parameters), we want to see whether it can generate an accurate answer for a relatively obscure fact. If it succeeds, we aim to understand how it managed to do so.
So, let’s begin.
Preparing the Model
The first thing that we need to do is to download the model. If you’re using Google Colab, this is also the steps that you need to take. We’re going to play around with the internal machinery of LLMs. Therefore, we cannot simply use API call.
For people who are not familiar, the first things you need to do is to head out to Hugging Face website.
Hugging Face - The AI community building the future.
We're on a journey to advance and democratize artificial intelligence through open source and open science.huggingface.co
This is a website thousands of open-source LLMs/AI models that can be used and utilized by users around the world. Make sure you register, and generate your Hugging Face token here. Token is generated by accessing the Profile>Settings>Access Tokens

Once your token is generated, save it in a secure location. Then, we can proceed with the next step.
On your jupyter notebook or your Google Colab you can run this code. This is to ensure we install all the required packages just to be safe.
!pip install transformers accelerate bitsandbytes xformers optimum huggingface_hub --quietAdd options for quiet there to ensure we don’t have large cluttered cell operations.
After that, we need to check our GPUs. Just run the code below, to ensure we have GPUs in our notebook/Colab.
import torch
print("CUDA available:", torch.cuda.is_available())If it prints CUDA Available, run the next code,
!nvidia-smiThis will output the GPU information as below:
This GPU information is very information. In here, we can see the GPU type that we use, which is NVIDIA A100. We can also see the vRAM availability that comes with the GPU (40GB). This information is actually useful for us to determine what model we can download and run with out experiment.
We can actually use this information, programmatically to determine which model parameters we can actually use. I have the code, which actually configured for Gemma, but what we want is actually the model parameter size, so you may refer to my Bahasa Melayu tutorial here
Siri Belajar AI : Buat RAG dari kosong (Bahagian Akhir)
Akhirnya, kita sudah tiba di bahagian yang terakhir. Bahagian yang paling penting sekali. Dalam bahagian ini, kita akan…medium.com
Or you can run both of these code below
#get gpu available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = gpu_memory_bytes / (1024.0 ** 3)
print(f"Total GPU memory available: {gpu_memory_gb} bytes")# Note: the following is Gemma focused, however, there are more and more LLMs of the 2B and 7B size appearing for local use.
if gpu_memory_gb < 5.1:
print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemmab 2B in 4-bit precision.")
use_quantization_config = True
model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
use_quantization_config = False
model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
use_quantization_config = False
model_id = "google/gemma-7b-it"
print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")This will recommend which model size that we can run for our experiment. For my demonstration, since I have 40GB, I will be using the
“mistralai/Mistral-7B-Instruct-v0.2”
mistralai/Mistral-7B-Instruct-v0.2 · Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.huggingface.co
Alright, now we’re ready to download our model. But before we do that, let’s logging to HuggingFace CLI via our notebook/colab. Just run the code below:
!huggingface-cli loginThen it will prompt you to enter your token
Once it’s done, you can proceed with downloading your model. Just run the code below
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
# Load tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16, # Use fp16 for efficiency
device_map="auto", # Automatically assign layers to available GPU(s)
attn_implementation="eager", # Use eager attention (avoids SDPA issues)
output_attentions=True, # Make sure attention weights are returned
output_hidden_states=True,
trust_remote_code=True
).eval()
# Make sure we're on GPU.
assert torch.cuda.is_available(), "CUDA is not available!"Then, once it’s downloaded, you can use it to answer our prompt as below:
prompt = "What is the capital of the state Perlis?"
# Tokenize the prompt. (Move all tensors to GPU.)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Use the model's EOS token as pad if not defined.
pad = tokenizer.pad_token_id or tokenizer.eos_token_id
eos = tokenizer.eos_token_id
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
pad_token_id=pad,
eos_token_id=eos,
return_dict_in_generate=True,
output_attentions=True, # Return attentions so we can use it later
output_hidden_states=True
)
# Decode the generated tokens.
generated_ids = outputs.sequences[0]
generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
print("Generated Output:\n", generated_text)If everything works correctly, we will get the answer as below:
Now, our model works. So, let’s try to understand how it comes to the answer.
Viewing the Attention Weights
The first part of our experiment is to look at how our attention weights changed when we asked them a simple question. Remember when we generate the answer above, we asked to model to output the attention. This will help us use the attention for visualization and data collection. So, let’s see how might we utilize that.
The first step we need to do is to let it loop back the token generation. We will record all the attention weights that is triggered during the token generation. Think of it as our neuron that’s activated during the token generation.
But instead of looking at individual neuron, we will look at the token that is generated. We want to know which token is attended most during the next token generation. This will let us understand the behaviour of the language model.
So, what you need to do is just run the code below, the code will
Loop over the token generation step
Collect all the attention weights during the token generation
Use it to plot the graph that will change color depending on the attention weights
Consolidate all the attention weights into a df/table
import matplotlib.pyplot as plt
import numpy as np
import imageio
import os
import pandas as pd
#setup all out output directtory and output list
output_dir = "attention_frames"
os.makedirs(output_dir, exist_ok=True)
frames = []
all_attention_data = []
#find the input length
input_len = inputs["input_ids"].shape[1] # Number of tokens in the prompt
# loop through generation of token
for step, attn_tensor in enumerate(outputs.attentions):
attn_avg = attn_tensor[0].mean(dim=0) # Average over heads
seq_len = attn_avg.shape[0]
final_token_idx = seq_len - 1
attn_for_final = attn_avg[final_token_idx]
attn_for_final = attn_for_final.squeeze().cpu().numpy()
if attn_for_final.ndim > 1:
attn_for_final = attn_for_final.mean(axis=0)
current_ids = outputs.sequences[0][:seq_len]
tokens_current = tokenizer.convert_ids_to_tokens(current_ids.tolist())
# get generated token index
gen_token_pos = input_len + step
if gen_token_pos < len(outputs.sequences[0]):
generated_token = tokenizer.convert_ids_to_tokens([outputs.sequences[0][gen_token_pos].item()])[0]
else:
generated_token = "<no token>"
# find the top 3 attention and keep it in a list
token_attention_pairs = list(zip(tokens_current, attn_for_final.tolist()))
token_attention_pairs_sorted = sorted(token_attention_pairs, key=lambda x: x[1], reverse=True)
top_3_attention = token_attention_pairs_sorted[:3] # Get only top 3 attended tokens
for token_idx, (token, attn_weight) in enumerate(top_3_attention):
all_attention_data.append({
"Generation Step": step + 1,
"Generated Token": generated_token,
"Top Attending Token": token,
"Attention Weight": attn_weight
})
#create the GIF frame
plt.figure(figsize=(20, 5))
attn_row = attn_for_final[None, :]
plt.imshow(attn_row, cmap="viridis", aspect="auto")
plt.colorbar(label="Attention Weight")
ax = plt.gca()
ax.set_xticks(np.arange(len(tokens_current)))
ax.set_xticklabels(tokens_current, rotation=90, fontsize=6)
skip_n = 1
for i, label in enumerate(ax.xaxis.get_ticklabels()):
if i % skip_n != 0:
label.set_visible(False)
plt.tight_layout()
frame_path = os.path.join(output_dir, f"frame_{step+1:03d}.png")
plt.savefig(frame_path)
plt.close()
frames.append(frame_path)
# save all data in a df, then csv
attention_df = pd.DataFrame(all_attention_data)
output_csv = os.path.join(output_dir, "attention_top3_per_token.csv")
attention_df.to_csv(output_csv, index=False)
print(f"Top-3 attention data saved to {output_csv}")
#save the GIF
gif_path = "attention_evolution2.gif"
with imageio.get_writer(gif_path, mode='I', duration=2) as writer:
for frame in frames:
image = imageio.imread(frame)
writer.append_data(image)
print(f"GIF saved to {gif_path}")This long code will produce two outputs. The first one is the GIF as below:
As you can see from the GIF above, we can clearly observed that when it try to generates the next token, it will always look at the current token and few other tokens that come before. But, what makes it intriguing here, we can also see that, for some tokens, it also attends to the token that comes way before. Let’s look closely at this. Let’s look at the data. Maybe we can understand more there.
As you can see from the table, when it try to generate the word Kangar (the subword, it attends to the sub-word is, which comes from Perlis, and what’s interesting is, it also attends to the word Malaysia. Means that it tries to get context on before completing the word Kangar. This shows that the LLMs really behaves like human during answer generation, where it finds the context from the question, and answer it based on the context.
This brings us to the next experiment.
The Attribution Graph Experiment
This experiment will map the each token generated with contributions score for each tokens that comes before them. This will help us looks and understand how much each tokens influence each other during token generation.
In order to do this, we need to take back all attention score from the last layer. We will look at this token and see which of the previous token it was attended to. And from there build a connection graph to visualize how each tokens influence the next token generation. So, just run this code below
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import imageio
import os
# Set a threshold for drawing edges: only tokens with attention weight higher than this will be shown.
threshold = 0.01
# Create an output directory for the graph frames.
output_dir_graph = "attribution_graph_frames"
os.makedirs(output_dir_graph, exist_ok=True)
graph_frames = [] # To store file paths for each graph frame
# Assume outputs.attentions is a list, where each element is a tuple of tensors (one per layer)
# For each generation step:
num_steps = len(outputs.attentions)
for step, attn_tuple in enumerate(outputs.attentions):
# Extract the attention tensor from the last layer for this generation step.
attn_tensor = attn_tuple[-1] # Should have shape: (batch_size, num_heads, seq_len, seq_len) ideally
# Average over heads for the first example (batch index 0)
attn_avg = attn_tensor[0].mean(dim=0) # Ideally shape: (seq_len, seq_len)
# Print shape for debugging:
print(f"Step {step+1}: attn_avg shape = {attn_avg.shape}")
# Determine the current sequence length.
seq_len = attn_avg.shape[-1]
# Extract the tokens of the sequence up to this step.
current_ids = outputs.sequences[0][:seq_len]
tokens_current = tokenizer.convert_ids_to_tokens(current_ids.tolist())
# If attn_avg's first dimension is 1, we have shape (1, seq_len), so set final_token_idx = 0.
if attn_avg.shape[0] == 1:
final_token_idx = 0
else:
final_token_idx = seq_len - 1
# Get the attention distribution for the final token (query vector)
try:
attn_for_final = attn_avg[final_token_idx] # shape: (seq_len,)
except Exception as e:
print(f"Error at step {step+1}: {e}")
continue
attn_for_final = attn_for_final.squeeze().cpu().numpy()
if attn_for_final.ndim > 1:
attn_for_final = attn_for_final.mean(axis=0)
# Build a directed graph.
G = nx.DiGraph()
for token in tokens_current:
G.add_node(token)
target_token = tokens_current[-1]
for i, weight in enumerate(attn_for_final):
if weight > threshold:
source_token = tokens_current[i]
G.add_edge(source_token, target_token, weight=weight)
# Use a fixed seed for layout reproducibility.
pos = nx.spring_layout(G, seed=42)
plt.figure(figsize=(8, 6))
nx.draw(G, pos, with_labels=True, node_color='skyblue', edge_color='gray',
arrowsize=20, font_size=10)
edge_labels = {(u, v): f"{d['weight']:.2f}" for u, v, d in G.edges(data=True)}
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_color='red')
plt.title(f"Attribution Graph at Generation Step {step+1}")
plt.tight_layout()
frame_path = os.path.join(output_dir_graph, f"graph_frame_{step+1:03d}.png")
plt.savefig(frame_path)
plt.close()
graph_frames.append(frame_path)
# Create a GIF from the graph frames.
gif_path_graph = "attribution_graph_evolution.gif"
with imageio.get_writer(gif_path_graph, mode='I', duration=5) as writer:
for frame in graph_frames:
image = imageio.imread(frame)
writer.append_data(image)
print(f"Attribution graph GIF saved to {gif_path_graph}")This code will output the GIF as below
It’s too fast. But what we want to see is actually the important part. So, let’s see how it generates the “Kangar” answer.
As you can see, before it generates the word “Kangar” (split into three subwords, “K”, “ang” and “ar”, it attens to the word “is”. And Malaysia, Capital and state is used to help it generate the next token.
Let’s look at the next token.
When it try to generate the next token, it attends to the previous token. But let’s see what happens when it tries to complete the words “Kangar”
As you can see here, when it tries to complete the token, it also attent to the word “Malaysia”, “state” and “capital”. This proves that it takes the context before providing the answer. Just like what any human would do.
We have succesfully completed our analysis. This is really exciting. This just proves that even for a small scale experiment, we can really understand under the black box of AI. And it makes us really understand what this technology is all about.
And understanding this technology will lead us to making it by ourself one day.












