We contributed several analyses to the Adaptyv EGFR Competition paper released this weekend. Below we share more background on our contribution and perspective of results.
Additional authoring by Chance Challacombe, Ahmad Qamar
The 2024 Adaptyv EGFR Binder Competition was a landmark event – a field test for the current generation of AI-driven protein design methods. Over two rounds, hundreds of affinity-characterized molecules were tested, providing a critical sampling and assessment of today’s AI design methods in real-world experimental conditions. Uniquely, the competition encompassed a broad variety of molecules, including peptides, polypeptides, scFvs, nanobodies, and other proteins. Such a broad affinity-characterization dataset is rare. Naturally, it piqued our interest, and I shared some thoughts on metrics our team put together using its R1 data last year; shoutout to Ahmad and Chance for putting those stats and visualizations together!
We were no strangers to Adaptyv when the competition was announced. We met the founders at SynBioBeta and have ordered from them before, and despite us being in the US and them in Switzerland, we run in some of the same biotech/techbio startup circles. It is also because of companies like Adaptyv that BioLM is able to focus on developing AI/ML, since we can get low-cost expression and assays for a fraction of the cost and effort of setting up our own lab. But it was due to our proactive discussions about the competition and expertise in ML/AI in bio that Adaptyv reached out to us to contribute to the Competition Paper released this weekend.
One of the first things we noticed when looking at the data is that peptides dominated affinity characterizations by sheer number. This was no surprise since the competition’s ranking models, AlphaFold2 and ESM-2, were trained on several thousand antibodies to many millions of other proteins. Consequently, their confidence metrics would rank de novo antibodies poorly, since they have seen very little like them in pretraining. Mind you, these models can rank antibodies that are similar to those in their training data well, especially if you use the same CDRs in order to keep the model’s confidence metrics high. This may have been the case for some of the antibody binder submissions, and would have required some insight into this. Because alter those difficult-to-model CDR-3 regions, and AF2 confidence in the binding region would likely plummet. Thus, novel antibodies may have had little chance of ranking well against thousands of other non-antibody molecules.
Such discrepancies are highlighted in the paper, and show the need to use predictive models specific to your distinct molecular features and contexts. You cannot use one model for everything today. Successful outcomes require different models, tailored strategies, and an understanding of the contextual biology and ML. We anticipated this, considering both ESM-2 and AlphaFold2 perform poorly on antibodies in multiple ways. Yet, simultaneously, we observe many successful models, like BindCraft, use AlphaFold2 predictions for the design of other molecule types.
I’ll pause here, because in case you’re wondering if BioLM submitted anything, well, as far as therapeutics go, our current work focuses on de novo and biobetter mAbs and nanobodies. Given the above regarding AF2, you can imagine we don’t use it in our design processes, and also knew none of our designs would make the cut based on the competition’s ranking metrics. We had previously characterized anti-EGFR designs through Adaptyv in order to baseline internal ML models, so our motivation for competing was low. We threw in a few unusual-looking de novo nanobody designs last minute and, fun fact: they did not bind. The analysis we contributed in the paper does show a correlation between AF2 and ESM-2 metrics for peptides, but not for antibody-domain molecules, so this aligns with expectations. Even still, this competition created a wonderful opportunity to explore what factors underpinned the success of a large variety of design methods, which is covered further in the paper.
Another key lesson from the competition was that while AI excels at generating and scoring extensive libraries of potential binders, without proper direction, it often struggles with critical downstream factors essential for real-world applications. Browsing the submissions, we noticed many authors describe generating a low number of sequences – on the order of a few thousand – using models like RFDiffusion, and then using its confidence metrics to select a handful for submission. There was no screening for expression, stability, and of course none for manufacturability, aggregation, immunogenicity, storage conditions (e.g., temperature sensitivity, stability over time). None of these were part of the competition requirements, so why even consider them? But if the only goal was affinity, one would expect the hit-rate to be greater than R1’s 3%, no?
Part of the low hit-rate was due to that round’s low expression rate, which was in the 70%’s. If this were a commercial project, you’ve essentially lost 30% of your money at that rate. Critically, it also means you have 30% less data to train ML models with for future generation. Over a decade of modeling in this field has certainly made this clear: no expression equals no data. Read the paper to see how R2 expression rate improved.
Most of the above was motivation to submit a proposal to the 2024 Bio x ML Hackathon to put together a diverse team to design de novo EGFR nanobodies. Team Silica, made up of 14 diverse members, used their knowledge to create several de novo nanobody pipelines, and ultimately 93 with non-redundant CDR-H3s were tested by Adaptyv. Nine of them were binders, and all expressed. While Silica will be publishing more detailed methods later, we ended up really needing to augment the EGFR competition data with those results to create a fairer assessment of the predictive binding metrics in the paper. Neither dataset alone offers much statistical significance to conclude how models like ESM-2, AF2, and others perform across different molecule types. But combined, we started to see some interesting things emerge, which reinforced previous retrospective analyses on AF2’s failings with antibodies, and also its use for other molecule types.
Towards the end of the paper, Adaptyv proposes a broader benchmark initiative, BenchBB. This initiative aims to provide a minimum public challenge for AI binder design methods, including several carefully selected antigen targets, promoting the development of generalizable protein design methods. One of the problems they aim to address is the disparity in wet-lab validation targets between different published AI architectures and pipelines. In the best case, a given entity would be able to produce good binders for all of the proposed targets, not only one or two.
There are many other considerations the paper does not have the scope to touch upon, such as the immunogenicity risk of different molecule types, developability, IP, and when you might create de novo versus affinity-matured designs. In the competition, there was no de novo antibody success – the best binder was an affinity-matured scFv Cetuximab. That does not indicate that the AI tools were a failure. The tools are good, but today, success still depends on people who know how to use them and translate between biological and ML requirements. AI platforms like BioLM API Services increasingly simplify and accelerate molecule generation, embeddings, and other predictions for those who know how to leverage them. The scale offered is useful, but deciding which molecules to advance still demands experienced judgment. Many participants in the competition learned that a single generative model is not the end-all solution to therapeutically viable molecules, let alone expressible ones. Comprehensive interdisciplinary expertise that spans biological knowledge, AI-driven statistical analyses, and software scalability is often required.
Our own team combines biological insights, deep AI/ML expertise, and years of experience to develop promising AI-designed viable therapeutics. Our model-agnostic approach ensures that we can optimize designs to meet a variety of unique needs, and we thoroughly vet AI outputs. This means that we choose the right models for the ML approach that best matches the scientific and business needs.
Ultimately, the future of protein binder design hinges on effectively integrating sophisticated AI methods with robust experimental validation and real-world considerations. For those who are pushing boundaries into areas where rich experimental data does not yet exist, grabbing a model off-the-shelf will not cut it. That is why we are committed to offering the interdisciplinary expertise and practical experience necessary to make AI-designed therapeutics a reality.
Read the full Adaptyv EGFR Competition Paper here.