loader

In this blog, we dive into the computational demands of bio-sequence LLMs, highlighting the crucial role of GPUs in both their development and application. We emphasize that while pre-training requires the most resources, both fine-tuning and prediction still necessitate significant computational power. Additionally, this blog explores cost-effective solutions for accessing powerful GPUs, including cloud computing options and resource optimization strategies.

Demystifying GPU Requirements for Bio-Sequence LLMs: A Closer Look: A thriving ecosystem of bio-sequence LLMs has emerged, with at least 40 well-established models available. These models typically rely on NVIDIA hardware for their computational needs, with published specifications outlining their GPU and memory requirements. Importantly, the GPU demands for prediction tasks are generally less stringent than those required for pre-training and fine-tuning. As a rule of thumb, the minimum GPU VRAM allocation for pre-training an LLM should be considered the baseline requirement for subsequent fine-tuning and prediction steps. For example, training on an 80GB NVIDIA A100 GPU would necessitate at least another 80GB GPU for downstream tasks like fine-tuning and inference. Similarly, a 48GB NVIDIA A6000 would require an additional 48GB GPU for post-training utilization. This highlights the critical role of GPUs in powering both the creation and application of bio-sequence LLMs.

Researchers often rely on rented compute resources, particularly GPUs, to run demanding AI workloads. The Big 3 cloud platforms – Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure – offer flexible hourly rental options for GPUs like the A100. This model enables researchers to leverage the power of expensive hardware, like the $10,000 A100, for specific tasks without incurring the full purchase cost. BioLM prioritizes cost-effective solutions by utilizing Spot Instances whenever possible to achieve the lowest pricing compared to traditional On-Demand instances.

The cost of renting a V100 GPU on AWS for one day ranges from $32.16 to $195.12 depending on the instance type (number of GPUs, RAM, vCPUs) and pricing model (on-demand or spot). Spot instances offer significant discounts but can be interrupted, while on-demand instances are predictable but more expensive 1.

Cost-Effective Protein Structure Prediction with ESMFold and Cloud Computing: ESMFold can predict a protein structure significantly faster than AlphaFold2. On a single NVIDIA V100 GPU, ESMFold predicts a 384-residue protein structure in 14.2 seconds, compared to AlphaFold2’s 85 seconds. This translates to a 6x speedup. In addition, the speed advantage is even greater for shorter protein sequences. For example, predicting a 128-residue protein structure takes only 0.4 seconds, a 60x improvement over AlphaFold2. Furthermore, unlike some methods, ESMFold does not require external databases for prediction, further reducing computation time. It is important to note that the estimated prediction time for ESMFold excludes the CPU time required for multiple sequence alignment (MSA) and template search, which can be significant for other methods 2.

Taking the above numbers into account, predicting 10,000 protein structures sequentially with a V100 GPU would take about 39 hours. The 16GB V100 on AWS costs $0.92 Spot or $3.06 On-Demand per hour. Folding 10,000 proteins on this hardware would cost approximately $36.28 Spot compute, or and $120.67 On-Demand. And when considering cost calculations for one day of GPU rental, renting a V100 GPU for a full day (24 hours) costs $22.08 with Spot Instances and $73.44 with On-Demand Instances.

Beyond the hourly cost of AWS instances; charges for data transfer, storage and software licenses: Data transfer refers to the movement of data in and out of your instance, such as downloading data sets or uploading results. EBS storage charges apply to any data you store on the attached volumes, which provide additional storage space beyond the instance’s local disk. Finally, if your instance utilizes any licensed software, such as operating systems or specialized tools, additional fees will be associated with those licenses.

Unveiling the Engine Behind ESM-1v: Training Regime, Resource Requirements, and Cost Analysis: The ESM-1v models undergo extensive pre-training prior to being used for function prediction tasks. First, each model utilizes 64 V100 GPUs for a 6-day pre-training period. Then, the weights for the MSA Transformer are integrated from the open-source repository provided by the authors. Finally, the model undergoes another pre-training phase for 13 days using 128 V100 GPUs. This two-stage process ensures the models are thoroughly prepared for accurate function prediction tasks. Notably, once trained, the models are highly efficient for forward inference, requiring minimal additional computational resources during application. To account for potential variations in protein sequence data, five distinct ESM-1v models were trained using various Uniref clustering thresholds, ranging from 30% to 100% at 10% increments. This provides a diverse set of models capable of tackling a wide spectrum of protein function prediction challenges3.

Cost-breakdown for ESM-1v: -GPU hours: 9,216 + 39,936 = 49,152
-Instance hourly cost: $9.36 (8x V100s)
-Instance hourly cost: $31.22 (8x V100s)
-Number of instances: 8
-Hours per instance: 768
-Estimated pre-training cloud cost: $57,508 (min) – $191,816 (max)

Accelerating Bio-Sequence Research with Powerful GPUs. By understanding the intricacies of GPU demands throughout the bio-sequence LLM development pipeline, researchers can optimize their resource allocation and expedite their scientific breakthroughs. The availability of flexible cloud computing options, coupled with cost-effective solutions like BioLM.ai’s approach, democratizes access to powerful GPUs, empowering researchers regardless of their financial constraints. With these tools and strategies at their disposal, bio-AI researchers can unlock the full potential of LLMs and accelerate their journey towards life-changing discoveries.

Authors:

Zeeshan Siddiqui: Bioinformatics Scientist and DevRel @BioLM

Nikhil Haas: CEO @BioLM

  1. G4dn.xlarge (no date) Vantage. Available at: https://instances.vantage.sh/aws/ec2/g4dn.xlarge (Accessed: 15 January 2024).
    ↩︎
  2. Lin, Z. et al. (2023) ‘Evolutionary-scale prediction of atomic-level protein structure with a language model’, Science, 379(6637), pp. 1123–1130. doi:10.1126/science.ade2574.
    ↩︎
  3. Meier, J. et al. (2021) Language models enable zero-shot prediction of the effects of mutations on protein function [Preprint]. doi:10.1101/2021.07.09.450648.
    ↩︎