A Practical Map of LLM Training Ecosystems
Published:
This post supports English / 中文 switching via the site language toggle in the top navigation.
The Short Version
After working through a small LLM training pipeline, I find it useful to think of the ecosystem in layers:
PyTorch / CUDA / NCCL
-> Megatron / FSDP / DeepSpeed / Accelerate
-> Hugging Face Transformers / safetensors / Hub
-> vLLM / SGLang / TensorRT-LLM
-> TRL / OpenRLHF / verl / NeMo-Aligner
These tools are not competing in one flat category. They sit at different levels of the stack. For my current setup, Megatron is the pretraining path, Hugging Face safetensors are the exchange and deployment format, vLLM or SGLang handle serving and evaluation, and verl is the more serious RL post-training option. PyTorch sits underneath almost everything.
Megatron Is the Heavy Training Machine
Megatron, especially Megatron-Core and NeMo-Megatron, is built for large-scale training. Its natural language is tensor parallelism, pipeline parallelism, data parallelism, sequence parallelism, distributed optimizer state, and distributed checkpoints.
That matters once the job is no longer a notebook experiment. On two 8-GPU A100 servers, Megatron gives a more direct path to training a 3B or larger dense model than a plain Hugging Face loop. It knows how to split weights across GPUs, how to resume from sharded checkpoints, and how to make NVIDIA hardware do the boring expensive work with fewer surprises.
For from-scratch pretraining or serious continued pretraining, I would keep Megatron as the main training path.
Hugging Face Is the Exchange Layer
Hugging Face is not just a training library. It is the common language of the open-source model world.
The important pieces are:
transformersfor model definitions and loadingdatasetsandtokenizersfor data and preprocessingsafetensorsfor portable model weightsPEFTandTRLfor lightweight fine-tuning and alignment experiments- the Hub as the distribution layer
Hugging Face safetensors can be used for training, especially SFT, LoRA, DPO, and smaller continued pretraining. But for large distributed pretraining, it is usually not the checkpoint format I want to rely on. A Megatron checkpoint contains the training state in the shape Megatron needs: TP/PP/DP partitioning, optimizer state, scheduler state, RNG state, and iteration metadata.
So I do not think of this as choosing Megatron or Hugging Face once and for all. I would train in Megatron, export selected checkpoints to Hugging Face safetensors, and use the Hugging Face version for evaluation, serving, sharing, and lighter post-training.
The conversion can also go the other way. A Hugging Face model can be converted into Megatron format for continued pretraining. But that is usually a weight initialization conversion, not a perfect recovery of optimizer and dataloader state.
PyTorch Is the Ground Floor
PyTorch sits below all of this. Megatron uses PyTorch. Hugging Face uses PyTorch. verl and OpenRLHF usually use PyTorch. Even when the user-facing framework has a different name, the low-level reality is often still tensors, autograd, CUDA, NCCL, and distributed process groups.
This matters because debugging eventually falls through the abstraction. A training run may look like “Megatron failed”, but the real issue can be NCCL topology, CUDA memory fragmentation, a PyTorch distributed checkpoint mismatch, or a dataset worker hanging.
For LLM engineering, knowing PyTorch distributed is not optional forever. You can postpone it, but it comes back.
vLLM and SGLang Are Serving Engines
vLLM and SGLang enter after the model is usable enough to poke.
vLLM is the default choice when I want a stable OpenAI-compatible server, good throughput, and a short path from Hugging Face weights to an API. It is direct and practical.
SGLang overlaps with vLLM, but it is more opinionated about structured generation and inference programs. If the task involves JSON constraints, tool-like multi-step flows, prefix reuse, or agent-style rollout logic, SGLang becomes more interesting.
Neither wants a raw Megatron distributed training checkpoint as the happy path. They want a deployable model format, usually Hugging Face safetensors.
verl Is the RL Post-Training Layer
Reinforcement learning post-training is its own system problem. It is not just “call loss.backward() with a different loss.”
A real RL pipeline has several moving parts:
- actor model
- reference model
- reward model or rule-based reward
- rollout engine
- advantage estimation
- PPO or GRPO update
- distributed scheduling
- checkpointing and evaluation
This is why verl is attractive. It is designed around the RL post-training workflow, not merely around a single trainer class. It can connect training workers with rollout engines such as vLLM or SGLang, and it is a better fit once the job becomes multi-GPU or multi-node.
For a quick proof of concept, TRL is still a good place to start. For a serious 16-GPU RL run, I would look at verl first.
My Current Pipeline
The pipeline I would use now is:
1. Pretrain or continue-pretrain with Megatron
2. Save Megatron distributed checkpoints for resume
3. Convert selected checkpoints to Hugging Face safetensors
4. Evaluate and serve with vLLM or SGLang
5. Run SFT or DPO with HF/TRL if the experiment is small
6. Run GRPO/PPO with verl if the RL stage becomes serious
7. Export the final actor back to HF format for deployment
This keeps each tool in its strongest role. Megatron handles the expensive pretraining phase. Hugging Face makes the model portable. vLLM and SGLang make it easy to test and serve. verl handles the messy RL loop.
The main lesson is simple: do not force one ecosystem to do every job. The practical stack is a chain of handoffs.
Takeaway
With two machines and 16 A100s, I would use Megatron as the main path for from-scratch pretraining. At the same time, I would regularly export selected checkpoints to Hugging Face safetensors, because evaluation, serving, sharing, and post-training are easier once the model is in that format.
