Cambridge Healthtech Institute's Inaugural
Training Data Generation and Quality
Ensuring Quality Predictions via Attention to Data Quality
January 15 - 16, 2025 ALL TIMES PST
Wednesday, January 15
BENCHMARKING, BIAS, AND CONTROLS
Protein Language Models Are Biased by Unequal Sequence Sampling across the Tree of Life
Frances Ding, PhD, Machine Learning Scientist, Prescient Design, Genentech
Protein language models (PLMs), like all machine learning models, learn biases from data. In this talk I will show that PLMs unintentionally learn a strong species bias. Specifically, PLM likelihoods of protein sequences from certain species (e.g., human, E. coli) are systematically higher, independent of the protein in question. I trace this bias' origins and demonstrate how it can be detrimental for some protein design applications, such as enhancing thermostability.
Data-efficient Training of Natively Paired Antibody Language Models
Bryan Briney, PhD, Assistant Professor, Immunology & Microbial Science, Scripps Research Institute
Antibody language models training using natively paired data outperform unpaired models, likely due to their ability to learn structurally and functionally relevant cross-chain patterns. Unfortunately, severe data constraints hinder our ability to scale model size and performance. We have evaluated several training techniques for maximizing the training value of every training example to construct a highly data-efficient strategy for antibody language model training.
2:35 pmRefreshment Break in the Exhibit Hall with Poster Viewing
AI Benchmarking Competition Based on High-Throughput Automation and Cloud Lab Experimentation
Peter Kelly, Director, Open Datasets Initiative, Align to Innovate
Align to Innovate, a non-profit research organization, is on a mission to shepherd biology into a data-first discipline through reproducible, scalable, and sharable experimentation. We run a suite of programs that work in conjunction to develop automated wet-lab experimental methods accessible to the community, collect large-scale public protein engineering datasets, and benchmark predictive and generative protein design algorithms. All our work is community-driven, collaborative, and operates under open science principals.
HIGH-THROUGHPUT EXPERIMENTATION
KEYNOTE PRESENTATION: Generation of High-Quality Aggregation Propensity Datasets for Machine Learning by Deep Mutational Scanning and an in vivo Assay
David J. Brockwell, PhD, Professor, School of Molecular and Cellular Biology, University of Leeds
A key requisite of any machine learning campaign is the availability of large volumes of high-quality training data that reports on the property to be predicted. Here we show that a tripartite beta-lactamase assay previously used by our group as a directed evolution screen can be reconfigured into a deep mutational screening format, providing datasets that can subsequently be used to train predictive models for different biophysical properties.
FEATURED PRESENTATION: High-Throughput Data Generation and Experimental Validation
Gabriel J. Rocklin, PhD, Assistant Professor, Pharmacology, Northwestern University
All proteins continuously fluctuate between different conformational states according to the energies of these states and the barriers between them. Even rare, high-energy states can have large impacts on protein function, aggregation, immunogenicity, and more. These high-energy states are challenging to observe and have never been examined at scale. Using a new high-throughput approach, we quantified protein energy landscapes for 5,000 domains and applied these data to guide protein engineering.
Design of Multifunctional Antibodies with Generative AI and High-throughput Data Iteration
Wei Lu, PhD, Director, AI Drug Design, Aureka Biotechnologies, Inc.
Our generative protein design system combines structure-informed protein models with high-throughput digital biotechnology for scalable functional antibody design. In an obesity therapeutic case, dual-binding antibodies created with this system achieved sub-picomolar potency, surpassing current candidates by 50-fold and clinical benchmarks on early metrics. From design to characterization, the process took under 9 weeks, showcasing a practical, data-driven approach to next-generation therapeutics with minimal experimental overhead.
SPECIAL PRESENTATION
In Vitro Validated Antibody Design Against Multiple Therapeutic Antigens Using Inverse Folding
Amir P. Shanehsazzadeh, Artificial Intelligence Scientist, Absci Corp.
We present an antibody inverse-folding model and demonstrate its application in binder design against 8 therapeutic antigens. For each antigen, we design 100 HCDR3s and 100 HCDR123s using a native antibody-antigen structure, scaffold them into the native antibody, screen using SPR, and identify binders. Inverse-folding has applications to de novo design and lead optimization, and the data from this study are a useful benchmark of antibody-antigen interactions.
Wednesday Night Meet-Up
CANCELLED: What Needs to be Done to Make a Pipeline of Mini-Binders More Developable?
Monica L. Fernandez-Quintero, PhD, Staff Scientist, General Inorganic & Theoretical Chemistry, Scripps Research Institute
Wing Ki Wong, PhD, Senior Scientist, Pharmaceutical Research and Development, Large Molecule Research, Roche Diagnostics GmbH
Network, Inspire Others and Connect
- Get to know fellow peers and colleagues
- Make connections and network with other institutions
- Inspire others and be inspired!
We will meet outside of the exhibit hall then transition to the lounge area
Thursday, January 16
7:45 amRegistration and Morning Coffee
PLENARY KEYNOTE SESSION
Transforming Therapeutic Protein Engineering
Marissa Mock, PhD, Executive Director Research, Amgen Inc.
Generative biology is an emerging discipline that integrates artificial intelligence (AI) and machine learning (ML) with advanced life science technologies. The application of generative biology to protein engineering is accelerating the discovery and design of complex proteins with therapeutic potential—and, maximizing the benefits of these novel technologies will require seamless integration of both wet- and dry-laboratory technologies.
8:50 amSession Break
DATASET TRAINING FOR SPECIFIC MODELS AND EXPERIMENTS
Training Data Requirements for Antibody-Antigen Binding Affinity Prediction under Multiple Circumstances
Alissa Hummer, DPhil, Postdoctoral Fellow, Stanford University
Antibodies are an important class of medicines, whose efficacy is driven by specific target binding. Given the therapeutic relevance, there have been multiple attempts to predict antibody-antigen binding affinity computationally. I will discuss our findings on how training data influences the success and selection of machine learning strategies to tackle this challenge, ranging from antigen-specific to generalizable and zero-shot affinity prediction.
Enhanced Prediction of Protein-Protein Interface Structure via Augmentation with in vitro Affinity Data
David Noble, Data Scientist II, A Alpha Bio Inc.
The structural complex of a protein-protein interaction (PPI) can yield important mechanistic insights that support drug discovery efforts. Rigid body docking and predictive models such as AlphaFold multimer remain poor quality for difficult but clinically significant systems. Here we present AFInjection, a framework for generating and incorporating experimental data to AlphaFold to improve complex prediction. AFInjection uses affinity data from in vitro directed coevolution of a PPI, finding novel functional sequence pairs which are incorporated into AlphaFold’s features to better infer the parental complex. We demonstrate the utility of this method on antibody-antigen systems.
Featured Poster Presentation: Leveraging Novel In Vivo Datasets to Generate Machine Learning Models Predicting Protein Aggregation and Developability
Conor McKay, Researcher, Astbury Centre for Structural Molecular Biology, University of Leeds
Protein aggregation impacts neurodegenerative diseases and biotherapeutic manufacturing. Big data tools like AlphaFold and protein language models excel in biology but rely heavily on high-quality training datasets. This work introduces the tripartite β-lactamase assay, a novel method for generating large, high-quality datasets that link protein aggregation to cell survival, enabling deeper insights into protein behavior and aggregation-related challenges.
Featured Poster Presentation: Mimic Antibodies and How to Find Them
Brennan Abanades, PhD, Postdoctoral Fellow, Large Molecule Research, Roche
The majority of antibodies in the PDB targeting the same binding site as some other non-antibody proteins are mimic antibodies– they share with the other protein a motif composed of key residues at conserved geometrical positions. By investigating mimic antibodies and how they imitate the binding site of other proteins, we develop a method for identifying them in repertoire data and validate it on IL-18RA.
10:30 amCoffee Break in the Exhibit Hall with Poster Viewing
Validating High Throughput Developability Assays to Support ML Model Training
Bismark Amofah, PhD, Senior Scientist, Biologics Engineering, AstraZeneca
Classic developability assays are low-medium throughput and require complex reagent generation. The large, normalized datasets required for ML tool building require adapting or replacing these assays with ones amenable to high throughput automation. We describe our process and results for validating replacement HT developability assays and a HT developability package compatible with very early HT screening.
TABLE 6: Machine Learning in Biologic Drug Discovery: Leveraging External Data Sources
David Noble, Data Scientist II, A Alpha Bio Inc.
- Quantity: Availability challenges, scaling laws, synthetic data
- Quality: Diversity, leakage, reproducibility, quality vs. quantity
- Collaborative data generation: Industry-academia partnerships, data sharing consortia
- Federated learning: Technical challenges, open-source foundation models
- Intellectual property: Data ownership, balancing openness with commercial interests
- Open-source data: Curation quality, integrating diverse sources with proprietary data

Shamit Shrivastava, CoFounder & CEO, Apoha
The Liquid Brain®, Apoha’s novel physical ecosystem, introduces state diagrams: a new class of time-resolved, high-dimensional biophysical data that grounds AI/ML protein models in physical reality. While existing data often fails to capture the dynamic, real-world behaviour of molecules, the Liquid Brain® seamlessly bridges this gap via physics of excitable substrates. Designed with an AI-first mindset, the Liquid Brain® directly interacts with samples to generate predictive insights. Using just 10μg of antibodies, it performs single-shot classification of developability risks, validated against 12 biophysical assays across 135 clinical-stage antibodies. In a customer study, it identified outliers in 70 VHH-Fc fragments with >80% accuracy via simple unsupervised algorithms and without any training.
Our dataset of risk scores for 135 clinical stage antibodies is now available on Hugging Face. Join us to discover how the Liquid Brain® can empower you with ML ready state diagram embeddings!
1:10 pmIce Cream & Cookie Break in the Exhibit Hall with Last Chance for Poster Viewing
DATASET GENERATION AND CURATION
Curation Strategies for R&D Pipeline Data
Kevin Metcalf, PhD, Associate Principal Scientist, Merck & Co., Inc.
Model-based prediction of biologics developability properties will increase speed to clinic. Previous pipeline program data is a valuable data source for training models but requires data curation, contextualization, annotation, and quality control for this new use. I will describe how we incorporated historical data using data quality control protocols to create reusable data products for machine learning prediction of key attributes, including hydrophobicity and polyspecificity of monoclonal antibodies.
ML Models for Nanobody Developability trained on a Purpose-Built Multi-Readout Dataset
Samuel Demharter, PhD, Senior Data Scientist, Discovery Data Science and Protein Science & Technologies, Genmab
The biophysical characterisation of biologics requires significant wet-lab resources. To enable large-scale predictions of millions of molecules, protein-language models have become an attractive proposition to accurately predict lab readouts. However, current machine-learning models are limited in accuracy largely due to lack of high-quality and high-volume training data. In this talk, we present the generation of a maximally informative dataset for the purpose of training machine-learning models for nanobody developability predictions.
A Machine Learning-Driven Approach for Multi-Parametric Optimization of T Cell Engagers
Winston Haynes, PhD, Vice President, Computational Sciences and Engineering, LabGenius Therapeutics
T-cell engagers (TCEs) promise breakthroughs in the treatment of solid tumors, but their progression in the clinic is limited by on-target, off-tumor toxicity. In this talk, I describe how our platform integrates active learning, automation, and high-throughput functional assays to efficiently identify highly selective and potent TCEs. I highlight our utilization of the design-build-test-learn ecosystem to generate high-quality data that powers our machine learning models and therapeutic assets.
Improving AlphaFold2 Performance with a Global Metagenomic & Biological Data Supply Chain
Geraldene Munsamy, PhD, Senior Scientist, Deep Learning, Basecamp Research Ltd.
Scaling laws estimate over a trillion species exist, yet less than 0.00001% have been studied. Powered by a global metagenomic data supply chain, BaseFold offers improved protein structure prediction with increased accuracy, achieving up to 80% reductions in RMSD values. Leading to more reliable predictions, better docking results, and advancements in therapeutic development, all while incentivizing biodiversity protection.
4:05 pmClose of BioLogic Summit