From Byte-Level LLM Smoke Tests to Tokenized FSDP Pretraining
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
TL;DR
This is a field note from building a small LLM pretraining stack on two 8-GPU A100 servers. The path was not a clean straight line: I started with a byte-level toy LLM, verified communication and FSDP, trained a small debug model, tried a 7B model from scratch, then moved to a real SentencePiece tokenizer, Hugging Face CCI3-HQ data, checkpointing, cached token shards, and finally a more reasonable 0.5B pretraining target.
The biggest lesson is simple:
For early pretraining experiments, a smaller model trained on enough tokens is more informative than a large model trained on too few tokens.
In my case, a 7B model could run, checkpoint, and generate, but the generation quality was poor because it only saw about 204.8M tokens in the CCI3-HQ run. A 0.5B model trained for 5.24B tokens already produced much more fluent Chinese continuations, even though it was still far from instruction-following or factually reliable.
Hardware and Setup
The experiments used two remote A100 servers:
| Server | GPUs |
|---|---|
| node0 | 8 x A100 80GB |
| node1 | 8 x A100 80GB |
The first version used a very manual but transparent stack:
SSH + tmux + torchrun + NCCL + PyTorch FSDP
This is not a replacement for Slurm or Kubernetes, but it was perfect for understanding what actually happens:
- how many ranks are launched
- which GPU IDs are used
- how
MASTER_ADDR,MASTER_PORT,node_rank, andworld_sizefit together - whether NCCL communication works
- whether FSDP can save and restore checkpoints
For two fixed servers, manual tmux orchestration was enough. For a larger shared cluster, I would move to Slurm, Kubernetes, Ray, or at least a more complete job manager.
Stage 1: Communication Probe Before Training
Before training anything real, I ran a communication probe using NCCL collectives. The probe creates GPU tensors and runs operations like:
dist.all_reduce(tensor, group=group)
This tested three levels of communication:
| Group | Meaning |
|---|---|
world | all GPUs across both machines |
| per-node group | GPUs inside one server |
| inter-node leaders | one leader rank per server, roughly testing cross-node bandwidth |
Observed result:
| Tensor | World Latency | World Alg BW | World Bus BW | Inter-Node BW |
|---|---|---|---|---|
| 16MB | 3.38 ms | 4.97 GB/s | 9.31 GB/s | 6.26 GB/s |
| 128MB | 23.07 ms | 5.82 GB/s | 10.91 GB/s | 6.47 GB/s |
This was good enough for multi-node FSDP experiments, but it was not a high-end InfiniBand-style training fabric. That matters later: FSDP full-shard depends on repeated parameter all-gather and gradient reduce-scatter.
Stage 2: Byte-Level Debug Model
The first model was a small LLaMA-like Transformer:
debug_120m
actual parameters: about 98M
tokenizer: byte-level
vocab size: 8192
The goal was not language quality. The goal was to answer:
- Can 16 ranks form a NCCL process group?
- Can FSDP initialize?
- Can forward, backward, optimizer step, and checkpointing all run?
- Can a saved checkpoint be loaded for generation?
A synthetic 5-step FSDP smoke test worked:
| Step | Max Step Time |
|---|---|
| 1 | 1.267 s, warmup |
| 2 | 0.203 s |
| 3 | 0.202 s |
| 4 | 0.249 s |
| 5 | 0.246 s |
Then I trained on an authorized local science-fiction corpus:
files: 196 TXT files
size: 37,762,964 UTF-8 bytes
model: debug_120m
nodes x GPUs: 2 x 4
GPU IDs: 4,5,6,7 on each node
seq_len: 512
micro_batch_size: 1
The 10k-step run reached:
final avg loss: 1.1704
step 10000 time: 0.148 s
throughput: 27,726 tok/s
Generation became more Chinese-looking and science-fiction-like, but the semantics were still weak. This was expected: byte-level training is useful for debugging, but it is inefficient for real Chinese LLM pretraining because each Chinese character expands into multiple bytes.
Stage 3: 7B Can Run, But It Does Not Mean It Has Learned Enough
Next I launched a LLaMA-shaped 7B model:
model: llama_7b
parameters: 6.738B
layers: 32
hidden size: 4096
attention heads: 32
FFN hidden size: 11008
parallelism: FSDP full shard
nodes x GPUs: 2 x 5
GPU IDs: 3,4,5,6,7 on each node
world_size: 10
seq_len: 2048
micro_batch_size: 1
grad_accum_steps: 1
The key formula:
global tokens per optimizer step =
world_size * micro_batch_size * seq_len * grad_accum_steps
For this 7B run:
10 * 1 * 2048 * 1 = 20,480 tokens / step
At 10,000 steps:
trained tokens ~= 204.8M
The system worked:
final avg loss: 1.3864 on the local byte-level science-fiction corpus
step time: about 0.985 s
throughput: about 20,795 tok/s
checkpoint size: about 26GB
But the output quality was weak. This was an important lesson: running 7B is not the same as training 7B well. A 7B model trained from scratch usually needs on the order of hundreds of billions to trillions of tokens. A Chinchilla-style rough target for 7B is:
7B params * 20 ~= 140B tokens
So the 7B experiment was primarily a systems validation:
- multi-node FSDP works
- checkpointing works
- single-GPU generation from checkpoint works
- GPU memory is manageable
- the training loop is real
It was not a sufficient 7B pretraining run.
Stage 4: Moving From Byte-Level to SentencePiece
The next scaffold moved from the local corpus to BAAI/CCI3-HQ and from byte-level tokenization to SentencePiece:
dataset: BAAI/CCI3-HQ
tokenizer: SentencePiece Unigram
vocab size: 32,000
byte_fallback: enabled
character_coverage: 0.9995
The tokenizer was trained from a streaming sample:
sample: 200,000 documents
characters: about 376M
vocab size: 32k pieces
This was a major upgrade. Byte-level tokenization is fine for smoke tests; a real tokenizer is necessary for efficient training. For Chinese, this also means traditional word segmentation like jieba is usually not needed: modern LLM tokenizers learn subword pieces directly from raw text, and byte fallback handles rare characters.
Stage 5: HF Streaming Worked, But It Became a Bottleneck
Hugging Face streaming was convenient because it allowed training without first downloading the full dataset:
load_dataset(..., streaming=True)
In distributed training, each rank receives a different stream slice. Conceptually, each GPU is not loading the entire dataset into memory; it pulls examples on demand, tokenizes them, and forms batches.
This worked, but it had two practical issues:
- Network stalls caused occasional long step times.
- Online tokenization made the input path less predictable.
A 0.5B streaming run showed normal steps around 5.6-5.9s, but with occasional stalls and checkpoint overhead. Profiling and observation made the next direction clear:
HF streaming text
-> SentencePiece token ids
-> uint16 token shards
-> cached token training
The current partial/full cache work produced a token cache of about:
documents: 17.0M
tokens: 26.3B
size: about 97GB
format: uint16 token shards + manifests
After switching to cached token shards, training no longer depended on live HF streaming. This made step time much more stable.
Stage 6: 0.5B Was the Better First Serious Target
The 0.5B model preset:
parameters: about 557.8M
dim: 1280
layers: 24
heads: 20
FFN hidden_dim: 3456
max_seq_len: 4096
vocab: 32k SentencePiece
The first successful 0.5B run:
server: 6007
GPUs: 8
world_size: 8
seq_len: 2048
micro_batch_size: 8
grad_accum_steps: 8
global tokens/update: 1,048,576
trained steps: 5000
trained tokens: 5.24B
wall time: about 9h 53m
Evaluation:
quick eval loss: 3.0461
ppl: 21.03
eval tokens: 32,768
The full_rank0 checkpoint was also evaluated on one A100 for quick inference:
checkpoint: checkpoints/cci3_llama500m_single_node_8gpus_mb8_ga8_5000steps_5p24Btok_9h53m/full_rank0/step_00005000/model.pt
preset: llama_500m
parameters: about 0.558B
generation: temperature=0.8, top_k=50, max_new_tokens=96
results: results/cci3_llama500m_step5000_full_rank0_generation.jsonl
Sampled generation results:
| Prompt | Output excerpt |
|---|---|
| 宇宙飞船正在飞向 | 宇宙飞船正在飞向火星。 这一计划是由美国宇航局航天飞机项目的负责人杰夫·霍顿(Jeff Hondor)提出的。霍顿和他的同事们一直试图通过提高载人船的运力,避免它们被敌人的超大船拖入轨道… |
| 人工智能的发展意味着 | 人工智能的发展意味着 AI 的逐步落地,AI 将成为驱动 AI 发展的核心驱动力。 AI 时代,人工智能将重塑各行各业,加速各行业的数字化转型和创新… |
| 在遥远的未来,人类文明 | 在遥远的未来,人类文明将要迎来一个崭新的时代。人类社会从人类历史开始,就进入了人类历史的时代。历史是由时间来记录的,而时间是历史的真实写照… |
| 地球上的最后一个城市 | 地球上的最后一个城市。 “这个城市位于南极洲。” “如果我去南极,我会去那里,到北极或南极去。” 1992年11月,25岁的格里菲斯在斯德哥尔摩乘坐游轮登上北极… |
This was surprisingly encouraging. With only 8 A100 GPUs, 5,000 optimizer steps, and about 5.24B training tokens, the model had already learned a recognizable sense of Chinese sentence structure, web-text style, exposition rhythm, and topic association. The knowledge itself was not reliably correct, but the language texture was already visible. That distinction felt important: pretraining first teaches distributional form and style before it becomes trustworthy knowledge or instruction-following behavior.
This model was still a base pretraining checkpoint, not an instruction model. It failed many structured QA prompts. For example, it associated Li Bai with wine and 将进酒, but did not reliably answer the question; it listed wrong countries for socialist states; and it failed to write a requested poem.
That led to a useful distinction:
| Stage | What It Learns |
|---|---|
| Base model | Next-token prediction: language structure, knowledge, style, and text patterns |
| SFT / instruction tuning | How to respond in a “user instruction -> assistant answer” format |
| Preference tuning / RLHF / DPO | Human preferences: more helpful, safer, less rambling, less obviously wrong |
So the current 0.5B checkpoint is a base model. It can continue text, but it does not necessarily obey. If prompted to “write a poem”, it may continue with web-style prose because it has only learned text distribution, not the instruction-following behavior of “I should satisfy the user’s request.”
Still, compared with the 7B 10k-step run, it was clearly more fluent. The reason was not magic:
| Item | 7B CCI3-HQ Run | 0.5B Run |
|---|---|---|
| Parameters | ~6.7B | ~0.558B |
| GPUs | 10 | 8 |
| Seq len | 2048 | 2048 |
| Global tokens/update | 20,480 | 1,048,576 |
| Evaluated step | 10,000 | 5,000 |
| Trained tokens | 204.8M | 5.24B |
| Final/near-final loss | ~5.13 | ~3.05 eval loss |
The 0.5B model had seen about 25.6x more tokens. It also had fewer parameters to fit. That combination mattered more than the headline parameter count.
Stage 7: Parallelism: What We Actually Used
The current training setup is:
TP = 1
PP = 1
FSDP-DP = number of GPUs used
That means:
- No Tensor Parallelism: individual matrix multiplications are not split across GPUs.
- No Pipeline Parallelism: different layers are not assigned to different GPUs.
- FSDP full shard: parameters, gradients, and optimizer states are sharded across ranks.
Every rank has the full Python module structure, but it only stores shards of the model states. During training, FSDP gathers the current layer’s parameters, computes forward/backward, then reduce-scatters gradients.
Important collectives:
| Collective | Meaning |
|---|---|
| all-gather | each rank has a shard; after all-gather, each rank has the full tensor |
| all-reduce | each rank has a full tensor; after reduction, each rank has the same reduced full tensor |
| reduce-scatter | reduce full tensors, then scatter reduced shards |
For the experiments:
8 GPU single-node run: FSDP-DP = 8
5 GPU single-node run: FSDP-DP = 5
10 GPU two-node run: FSDP-DP = 10
For 7B, TP=1, PP=1, FSDP-DP=N is a reasonable learning setup. For 32B, 70B, or longer context, I would start considering TP, PP, sequence/context parallelism, or a Megatron/DeepSpeed-style stack.
Stage 8: Micro Batch vs Gradient Accumulation
The formula that explains most experiments:
global_batch_tokens =
world_size * micro_batch_size * grad_accum_steps * seq_len
Definitions:
| Term | Meaning |
|---|---|
micro_batch_size | samples per GPU per forward/backward |
grad_accum_steps | how many micro-steps before one optimizer update |
world_size | total distributed ranks, usually total GPUs |
seq_len | tokens per sample |
Intuition:
- Increasing
micro_batch_sizeuses more activation memory and may improve GPU utilization, but can hit memory limits quickly. - Increasing
grad_accum_stepsincreases global batch without much extra peak activation memory, but each optimizer step contains more forward/backward passes. - If
grad_accum_steps > 1, usingno_sync()on non-final micro-steps avoids unnecessary gradient synchronization.
Two current cached-data experiments made this concrete:
| Run | GPUs | Micro Batch | Grad Accum | Tokens/Step | Step Time | Throughput | Peak Alloc |
|---|---|---|---|---|---|---|---|
| 6007 cached 0.5B | 8 | 32 | 2 | 1.05M | ~5.23 s | ~200k tok/s | ~29.8GB |
| 1024 cached 0.5B | 5 | 64 | 2 | 1.31M | ~10.3 s | ~126k tok/s | ~58.1GB |
This shows a useful rule: a larger per-step token count does not guarantee better throughput. The 5-GPU mb64-ga2 run processes more tokens per optimizer step, but it is slower overall because it uses fewer GPUs and much more memory per GPU. The 8-GPU mb32-ga2 run is healthier.
Stage 9: Checkpointing Lessons
The training scaffold now supports:
full_rank0
dcp
both
none
full_rank0 is convenient for generation and inspection:
full_rank0/step_00005000/model.pt
dcp means PyTorch Distributed Checkpoint. It is better for distributed resume because it can save sharded model and optimizer states:
dcp/step_00005000/
Because both servers share a NAS path, DCP is practical here. But exact sample-level resume is harder with HF streaming. The current implementation restores model state, optimizer state, RNG state, and step number, but streaming dataset position is not exactly replayed. For large shuffled pretraining streams, this is usually acceptable; for strict reproducibility, the data pipeline needs deterministic token shards and stored offsets.
What I Would Keep As Know-How
Here is the distilled checklist I would keep for future pretraining runs:
- Validate communication before training. Run NCCL probes before debugging model code.
- Start with a tiny model. A 100M model tells you if the distributed system works.
- Do not over-interpret early generation. A model can generate text-looking output long before it knows facts or follows instructions.
- Track tokens, not just steps. Steps mean little without
world_size * micro_batch * grad_accum * seq_len. - Byte-level tokenization is for smoke tests. Real Chinese pretraining needs a serious tokenizer.
- Small model plus enough data beats large model plus starvation. The 0.5B run was more informative than the early 7B run.
- FSDP-only is enough to learn distributed training. But TP/PP become important as model size and context length grow.
- Streaming is convenient, cached tokens are stable. Online streaming is great for exploration; pre-tokenized shards are better for long runs.
- Micro batch and grad accumulation trade memory for throughput. They are not interchangeable even if global batch is the same.
- Checkpoint format matters. Save full-rank checkpoints for quick generation, DCP for serious resume.
Personal Takeaway
This exercise changed how I think about “training a model.” At the start, the milestone was simply: can a 7B model run on the GPUs? Later, the better question became: how many clean tokens did the model actually see, how stable is the input pipeline, how are states sharded, and what exactly does one optimizer step mean?
That shift is the real progress. The experiments moved from “make it run” to “make it measurable.”
The near-term plan is to keep pushing on this LLM training path until the whole loop feels clear: tokenizer, cached data, FSDP, checkpoint resume, eval, generation, batch tuning, and eventually SFT. Once that is solid, the natural next experiment is to move from LLMs to VLMs and repeat the same discipline: start with the smallest working version, make the data path measurable, then scale.
