My RM 500 (failed?) experiments on Transformer
“I set out to scale transformer. But I only scaled my frustration and loss RM 500.”
So, this is the second sharing from me on my own paper implementation. You can read my first sharing here on PIDAO,
https://medium.com/@maercaestro/pidao-a-new-way-of-deep-learning-training-optimization-d1e864dbd237
I’ve done this paper earlier actually than PIDAO , way back in November. Initially I don’t want to share it since it hasn’t been a succefull implentation overall. But, something tells me that it will take me a long time to get it running and succesful. And sometimes, it’s also good to share your failures. It’s a humbling experience and it lets you reflect and learn more.
So, here it is. The story of how I burn RM 500 while implementing Transformer.
Why Transformer?
Short answer- this is the start, the beginning of it. The shifting points that allows AI to reach its current level today.
Long answer-
Transformer is the backbone architecture to all major State-of-The-Art (SoTA) models. Company like OpenAI, Google, Meta, Microsoft all use transformer architecture to develop their AI model. It was first introduced in the landmark 2017 paper called “Attention is all You Need” by 8 Google scientist.
Here it the paper: (https://arxiv.org/abs/1706.03762)
It was first originally introduced to resolve the issues faced by the Seq2Seq model especially in machine translation tasks. Seq2Seq is processing texts in (as you guessed it) sequence. This has caused issues especially if the model wants to processed long sequence of texts. The model has to wait for a batch of texts to be processed first, before they can move on to the next sequence. This has caused computational issues leading to longer time of training.
Transformer resolved this issues by removing the needs to process the text in sequence. The text has been processed in parallel utilizing the power of GPUs. Because of this, it has reduced the time for training, and because of it’s scalability to GPUs, it makes for an excellent architeture to process more amounts of data leading to higher performance.
Transformer has been so revolutionary because it generalizes well. First envisioned as model for translation, it’s now being used everywhere from generating text, image, audio and even videos.
That’s why, it’s very important for every data scientist to at least learn and implement their own transformer.
What is the Transformer actually?
No, Transformer is not Optimus Prime, although the creator of the transformer himself said that he is inspired by the cartoon series in naming the transformer.
It is a neural network architecture that transforms input data (like a sequence of words) into meaningful representations for tasks like translation or text generation.
The picture above shows the architecture in it’s entirety. It is kinda scary (still is) for me who has no basics. But if we know what to look for, we can understand the transformer architecture completely. As as start, let’s separate two sections of the transformer, the left (which is the encoder) and the right (which is the decoder).
It basically follows the autoencoder architecture coming from Seq2Seq model. Which can be represented through the image below:
This is the picture that I see is easier for people to understand what is encoder decoder architecture. Although it uses image for example, it has basically the same structure. In the picture we see that encoder change the input into rich-context representation that is easier to be learn by machine. The training is done on this rich-context representation. The decoder then change the rich-context representation back to image or original input. So, the goal is to have the rich-context representation done properly so that when it is decoded, all the context will not disappear and we will get the correct output.
From the transformer architecture above, we can see several components in both encoder and decoder parts. Which is
Input Embedding (already covered this in my tutorial https://medium.com/@maercaestro/siri-belajar-ai-mari-belajar-tentang-penanaman-vektor-vector-embedding-2e06230c01c9)
Positional Encoding (also have covered this here https://medium.com/@maercaestro/siri-belajar-ai-mari-belajar-tentang-penanaman-posisi-positional-embedding-94ad4cdd7cc2)
Multi-Head Attention Blocks (masked and not masked) -https://medium.com/@maercaestro/siri-belajar-ai-mekanisma-perhatian-cd71853ec325
Feed forward (which is the normal neural network). https://medium.com/@maercaestro/siri-belajar-ai-mari-buat-jaringan-neural-dari-kosong-b525ba11171c
The other parts which is the Add & Norm (addition and normalization), Softmax and Linear are just simple mathematical operations, but these 4 are the main components that we need to know about.
I will not cover what all of these parts do extensively. As you can see from the links that I pasted above, I have covered all these individual components in my Bahasa Melayu tutorial. So, you may read all these at your own time.
In the interest of time, and ensuring that I won’t bore all of you, I would go straight to how I setup my experiments and what tool did I use. But I will in the future covered about transformer (when I managed to get a succeful results out of it).
So, let’s dive into my implementations
Scaling the hill (Bukit Beruang): My first implementation
The first implementation of transformer is taken exactly as the paper specified. I build a full encoder-decoder architecture of transformer. In the original paper, they have two models, Base and Large each with it’s own configuration
Base (size = ~65 million parameters)
model:
num_layers: 6 # Number of encoder/decoder layers
d_model: 512 # Hidden size (model dimensions)
num_heads: 8 # Number of attention heads
d_ff: 2048 # Feedforward layer dimension
src_vocab_size: 32000 # Vocabulary size for source
tgt_vocab_size: 32000 # Vocabulary size for target
max_len: 512 # Maximum sequence length
dropout_rate: 0.1 # Dropout rateLarge (size = ~213 million parameters)
model:
num_layers: 6 # Number of encoder/decoder layers
d_model: 1024 # Hidden size (model dimensions)
num_heads: 16 # Number of attention heads
d_ff: 4096 # Feedforward layer dimension
src_vocab_size: 32000 # Vocabulary size for source
tgt_vocab_size: 32000 # Vocabulary size for target
max_len: 512 # Maximum sequence length
dropout_rate: 0.1 # Dropout rateFor my model, I use below configuration.
Megat-Transformer Configuration (size = ~ 237 million parameters)
model:
num_layers: 6
d_model: 512
num_heads: 8 # Number of heads
d_ff: 2048
src_vocab_size: 96038 # Vocabulary size for source
tgt_vocab_size: 96038 # Vocabulary size for target
max_len: 43 # Maximum sequence length (or greater if needed)
dropout_rate: 0.2Larger vocab size leads to larger parameter size causing my model to be bigger than the base model of the original transformer.
For dataset, I did not use the data used originally stated in the paper. The original paper uses WMT English to German Translation (2014) which has 4.5 million rows of dataset. The bigger model uses the WMT English to French Translation (2014) which has 36 million rows of dataset.
Since I want to implement this properly, I would like to use a custom dataset. Considering that I’m malay, so I would like to use dataset for Malay translation. Unfortunately there aren’t any official translation of malay language datasets. That’s why malay language has always been identified as ‘low-resource-language’ in the world of natural language processing
But, thanks to the effort from open sourcing, we have one dataset compiled by Mesolitica, the Google translate Malay to english translation which have 2 million rows of dataset.
Making things worse, I don’t have any computational power to load and train this datasets. Therefore, I have to limit this dataset to only 30k rows.
This may impact the performance of the model, but I just want to prove that I can implement the transformer properly, train it, and maybe get a good evaluation score out of it.
Speaking of evaluation, we will use the same evaluation in the paper, the BLEU (Bilingual Evaluation Understudy). The original transformer achieved BLEU score of 27.3 for base model and 28.4 for Large model. Since we’re using only the Base configuration, so 27.3 will be our benchmark.
I’ve set the run using Google Colab (since that is the only GPU I can get my hands on, initially) and track the training runs through Wandb, as recommended by ChatGPT.
The training is done in 20 epochs, which takes around 20 hours. I managed to achieve loss of 0.2 and a high BLEU score of 10.37. Considering that I only train with 30k rows of dataset, that is quite good for the first time right? Right????
Right??????
Wrongg!!!!!!
Basically I made a mistake of not adding validation loss to my training run so maybe my results does not generalize well.
I learn that the hard way when I want to continue my experiments by testing the scaling laws
Scaling Mount Kinabalu : Testing the Scaling Laws
By the title alone you know that something is wrong. No inexperienced hiker will attempt to scale the highest mountain in Southeast Asia just after he conquered a small hill in Melaka. But that’s what I did, in terms of AI research. And I pay dearly for it.
In the second implementation, I seek out to use the transformer that I have to test the scaling laws.
Based on the original paper of scaling Laws by OpenAI (Scaling Laws of Neural Language Model), which you may find here (https://arxiv.org/abs/2001.08361?utm_source=chatgpt.com),the scaling laws stated that performance of a transformer/neural-language models depends on 3 main factors
The size of the model
The amount of data feed during training
The computation capacity for the training
Since I’m not a millionaire , we can’t do anything on number 3. What we can only do is increase the size of the model and the amount of data feed during training.
What we did was increasing the number of heads (to 16) and the number of dimension (to 1024). This has increase the model parameters from 237 million to 512 million.
We also perform base runs with 237 million size models as comparison.
My original transformer experiments are done with 30,000 rows of dataset extracted from Mesolitica (mesolitica/google-translate-malay-news). For this experiments, we will increase the dataset rows to 50k and 100k.
The training runs has been configured as below:
# Training settings
training:
warmup_steps: 1000
batch_size: 32
learning_rate: 0.00001
epochs: 300
checkpoint_dir: "checkpoints/"We also set early stopping criteria for each runs at 30. Meaning that if there’s no improvement to either validation and test loss, the training will stop.
early_stopping_patience = 30 # Number of epochs with no improvement before stopping
if val_loss < best_val_loss:
best_val_loss = val_loss
epochs_no_improve = 0
else:
epochs_no_improve += 1
if epochs_no_improve == early_stopping_patience:
print("Early stopping triggered. Training stopped.")
breakFor this training runs, we did not use Google Colab. I’ve heard about RunPod which offers chead cloud GPU that can be used for AI training. So, I setup my cloudGPU and spend nearly $100 for initial credits.
Once I set up everything, I’m ready to begin training. And the results is chaotic. I had to perform 54 runs, each with different training configuration. It always leads to early stopping, which is problematic.And after 54 runs, I can’t seems to hold on to my credit anymore. I have to stop the training and figured out ways to improve the overall training runs.
To make story short, our initial objectives is to increase the BLEU score from our initial run of 10.37. However, since all runs resulted in early stopping, we are unable to increase our BLEU score. I believe this are caused by the datasets conditions and our training configuration.
However, there’s still some light at the end of the tunnel. Based on the graph above, we can see that increasing the model size and the amount of data ingested will lead to improved performance of the models. Both larger models and larger datasets runs has longer iteration, causing it to last longer than smaller models. This almost support conjecture propose by the scaling laws paper. However, further experiments are recommended to at least get higher BLEU score compared to initial run.
So, what have I learn?
A lot actually. A lof about implementation, coding, theory, frameworks to be used. I also learned that running an AI model is too damn expensive. That’s why it’s very crucial we find ways to optimize all of this.
But in terms of technical related to the transformer, I have three recommendations, which I listed below:
Perform cleaning on the dataset. Adjust the architecture based on the cleaned dataset.
Analyze the loss graph and improve the training configuration. By adding warm-up steps and dynamic learning rate (as proposed in the paper), the training loss will be improved
Introduce tokenization (byte-pair encoding or sentencepiece) to improve the pre-training.
Will this sort of failures stop me? Nah. This is just the beginning.
I’ll get you next time transformer.









