Bio-sequence large language models (LLMs) are computationally intensive. Pre-training is the most demanding phase, but fine-tuning and inference still require significant GPU memory and throughput. This post outlines the hardware needs across the model lifecycle, and how to meet them cost-effectively with cloud infrastructure.
Lifecycle Requirements
There are now more than 40 pretrained bio-sequence LLMs, nearly all of which are trained and deployed on NVIDIA GPUs. While pre-training is the heaviest lift, fine-tuning still require comparable VRAM. There are hundreds of finetuned models on top of these pretrained LLMs, tuned for different purposes using specialized datasets. Inference with these models can require less VRAM than for training or finetuning.
Rule of thumb: the VRAM needed for pre-training is roughly necessary for downstream usage. If a model is trained on an 80GB A100, it will generally need an 80GB-class GPU for effective fine-tuning and batch inference. The same applies to models trained on a 48GB A6000 or 40GB V100. Memory is often the limiting factor, not just compute power.
Optimization Techniques
For various stages, tools like DeepSpeed, PEFT, Ray can help reduce the necessary requirements and time-horizon. DeepSpeed can help offload certain steps of training onto the CPU, reducing the necessary GPU VRAM requirements. It can also speed up inference by 30% in many cases.
Packaging PyTorch models and compiling them for your specific architecture can help, too. In many cases, biological algorithms are optimized for NVIDIA Ampere GPUs, so choosing an Ampere can have significant benefits, like allowing bf16 or tf32 (mixed precision) – for example, the A100 (look for the “A” prefix). For instance, Mmseqs2 says about its GPU features for multiple sequence alignments, “This requires an NVIDIA GPU of the Ampere generation or newer for full speed, however, also works at reduced speed for Tesla-generation GPUs.”
When finetuning LLMs, BioLM utilizes Population-Based Tuning (PB2) in order to finetune and achieve strong predictive metrics with few trials and GPUs. For speed, ASHA paramater hyperoptimization can kill early trials and achieve similar metrics in a shorter amount of time than PB2. It is also worth considering training an XGBoost or simple neural-net on top of the LLM’s embeddings, which would lower the training and inference cost since a new checkpoint wouldn’t need to be finetuned and predictions can utilize the pretrained model’s embeddings.
Cloud Economics: Renting Instead of Buying
Most bio-AI labs rent rather than own GPUs. The biggest gain here is being able to spin up as many GPUs as desired, of any type, without the investment to set up your own GPU servers. Cloud platforms—AWS, GCP, Azure, and Lambda—offer hourly access to high-end GPUs, including A100 and V100 models. Spot Instances can reduce costs by >70% but come with the risk of interruption.
An example: on AWS, a 16GB V100 costs:
- Spot: ~$0.92/hr
- On-Demand: ~$3.06/hr
Folding 10,000 proteins approximately 380 aa sequentially on a single V100 (e.g., using ESMFold) takes ~39 hours. That’s:
- ~$36.28 (Spot)
- ~$120.67 (On-Demand)
Full 24-hour usage costs:
- $22.08 (Spot)
- $73.44 (On-Demand)
Note that additional charges apply for storage, data egress, and there might be other cloud costs. However, one can speed up the 39 hours it would take a single V100 GPU to fold 10,000 proteins by simply adding more cloud GPUs to the task. In essence, this is what BioLM’s architecture does on-demand. Since we prioritize speed, cost, and performance, if 10,000 folds were submitted to our APIs, our backend would spin up dozens of GPUs to run our optimized ESMFold and stream results back.
However, if there will be more consistent GPU usage, such as for finetuning or pre-training, investing in on-prem GPUs becomes much more cost-effective. For instance, a 16GB NVIDIA V100 GPU can be purchased for less than $1,000 US today. Unless you design software to handle interruptions from Spot server reclamations, to finetune in the cloud you might run a V100 on-demand for two weeks, for a total of $1,028.
Fast Inference, Minimal Dependencies
When screening thousands to millions of designed molecules, you want to choose models with fast inference and minimal dependencies in order to reduce costs. For instance, ESMFold predicts protein structures faster than AlphaFold2 — no external databases required — at 99% of AF2’s accuracy.
Protein Length | AlphaFold2 | ESMFold | Speedup |
---|---|---|---|
384 residues | 752s | 14.2s | ~53x |
128 residues | 240s | 0.6s | ~400x |
This makes it feasible to batch-predict tens of thousands of proteins, whereas AlphaFold2 would be significantly more expensive. An initial structure screen using ESMFold would allow a protein designer to minimize the number of sequences run through a final AlphaFold2 screen.
Example Pre-training: Cost Breakdown
Let’s use the ESM-1v models as an example. The ESM-1v models underwent extensive pre-training prior to being used for function prediction tasks. First, each model utilized 64 V100 GPUs for a 6-day pre-training period. Then, the weights for the MSA Transformer were integrated from the open-source repository provided by the authors. Finally, the model underwent another pre-training phase for 13 days using 128 V100 GPUs.
- Phase 1: 64 V100s × 6 days
- Phase 2: 128 V100s × 13 days
Total GPU hours: 49,152
Training used 8-GPU instances:
- Low estimate: $9.36/hr → $57,508
- High estimate: $31.22/hr → $191,816
Each of the five models was trained using different UniRef clustering thresholds (30% to 100%) to improve generalizability.
What This Means
Access to high-end GPUs is essential for both training and deploying bio-sequence LLMs, which have given us more accuracy and throughput than traditional ML models. While cloud platforms make it possible to run large-scale workloads without capital expenditure, significant investments must be made to compare GPU performance, develop software optimizations, and identify the right hardware optimizations to have high-throughput and low-cost training and inference. BioLM takes all these aspects into consideration when deploying a new API, and tunes the API workload to maximize throughput.
By understanding the GPU demands of LLMs and the economics of cloud computing, research teams can scale their models without overspending—and without compromising.