2025 ARCHIVES

Cambridge Healthtech Institute's Inaugural

Training Data Generation and Quality

Ensuring Quality Predictions via Attention to Data Quality

January 15 - 16, 2025 ALL TIMES PST

The success of machine learning (ML) models in drug discovery heavily relies on the quality and relevance of training data. CHI's Training Data Generation and Quality conference will explore cutting-edge strategies for creating high-quality datasets that drive the performance of ML models. Attendees will learn about amplification strategies, techniques for mitigating bias and noise, the importance of controls and validation, and methods for ensuring the suitability of datasets for specific applications. The meeting will also delve into emerging approaches such as federated learning and the standardization of data generation practices. Additionally, the conference will showcase the latest advances in closed loop systems that integrate predictive modeling with high-throughput automated experimentation, enabling adaptive screening, active learning, and in silico benchmarking.

Wednesday, January 15

1:30 pm

Chairperson’s Remarks

Arvind Sivasubramanian, PhD, Director, Computational Biology & Platform Technologies, Adimab LLC

1:35 pm

Protein Language Models Are Biased by Unequal Sequence Sampling across the Tree of Life

Speaker Headshot Frances Ding, PhD, Machine Learning Scientist, Prescient Design, Genentech

Protein language models (PLMs), like all machine learning models, learn biases from data. In this talk I will show that PLMs unintentionally learn a strong species bias. Specifically, PLM likelihoods of protein sequences from certain species (e.g., human, E. coli) are systematically higher, independent of the protein in question. I trace this bias' origins and demonstrate how it can be detrimental for some protein design applications, such as enhancing thermostability.

2:05 pm

Data-efficient Training of Natively Paired Antibody Language Models

Speaker Headshot Bryan Briney, PhD, Assistant Professor, Immunology & Microbial Science, Scripps Research Institute

Antibody language models training using natively paired data outperform unpaired models, likely due to their ability to learn structurally and functionally relevant cross-chain patterns. Unfortunately, severe data constraints hinder our ability to scale model size and performance. We have evaluated several training techniques for maximizing the training value of every training example to construct a highly data-efficient strategy for antibody language model training.

2:35 pmRefreshment Break in the Exhibit Hall with Poster Viewing

3:10 pm

AI Benchmarking Competition Based on High-Throughput Automation and Cloud Lab Experimentation

Speaker Headshot Peter Kelly, Director, Open Datasets Initiative, Align to Innovate

Align to Innovate, a non-profit research organization, is on a mission to shepherd biology into a data-first discipline through reproducible, scalable, and sharable experimentation. We run a suite of programs that work in conjunction to develop automated wet-lab experimental methods accessible to the community, collect large-scale public protein engineering datasets, and benchmark predictive and generative protein design algorithms. All our work is community-driven, collaborative, and operates under open science principals.

3:40 pm

KEYNOTE PRESENTATION: Generation of High-Quality Aggregation Propensity Datasets for Machine Learning by Deep Mutational Scanning and an in vivo Assay

Speaker Headshot David J. Brockwell, PhD, Professor, School of Molecular and Cellular Biology, University of Leeds

A key requisite of any machine learning campaign is the availability of large volumes of high-quality training data that reports on the property to be predicted. Here we show that a tripartite beta-lactamase assay previously used by our group as a directed evolution screen can be reconfigured into a deep mutational screening format, providing datasets that can subsequently be used to train predictive models for different biophysical properties.

4:10 pm

FEATURED PRESENTATION: High-Throughput Data Generation and Experimental Validation

Speaker Headshot Gabriel J. Rocklin, PhD, Assistant Professor, Pharmacology, Northwestern University

All proteins continuously fluctuate between different conformational states according to the energies of these states and the barriers between them. Even rare, high-energy states can have large impacts on protein function, aggregation, immunogenicity, and more. These high-energy states are challenging to observe and have never been examined at scale. Using a new high-throughput approach, we quantified protein energy landscapes for 5,000 domains and applied these data to guide protein engineering.

4:40 pm

Design of Multifunctional Antibodies with Generative AI and High-throughput Data Iteration

Speaker Headshot Wei Lu, PhD, Director, AI Drug Design, Aureka Biotechnologies, Inc.

Our generative protein design system combines structure-informed protein models with high-throughput digital biotechnology for scalable functional antibody design. In an obesity therapeutic case, dual-binding antibodies created with this system achieved sub-picomolar potency, surpassing current candidates by 50-fold and clinical benchmarks on early metrics. From design to characterization, the process took under 9 weeks, showcasing a practical, data-driven approach to next-generation therapeutics with minimal experimental overhead.

5:10 pm

In Vitro Validated Antibody Design Against Multiple Therapeutic Antigens Using Inverse Folding

Speaker Headshot Amir P. Shanehsazzadeh, Artificial Intelligence Scientist, Absci Corp.

We present an antibody inverse-folding model and demonstrate its application in binder design against 8 therapeutic antigens. For each antigen, we design 100 HCDR3s and 100 HCDR123s using a native antibody-antigen structure, scaffold them into the native antibody, screen using SPR, and identify binders. Inverse-folding has applications to de novo design and lead optimization, and the data from this study are a useful benchmark of antibody-antigen interactions.

5:40 pmClose of Day

6:00 pm

CANCELLED: What Needs to be Done to Make a Pipeline of Mini-Binders More Developable?

Speaker Headshot Monica L. Fernandez-Quintero, PhD, Staff Scientist, General Inorganic & Theoretical Chemistry, Scripps Research Institute

Speaker Headshot Wing Ki Wong, PhD, Senior Scientist, Pharmaceutical Research and Development, Large Molecule Research, Roche Diagnostics GmbH

Network, Inspire Others and Connect

Get to know fellow peers and colleagues

Make connections and network with other institutions

Inspire others and be inspired!

We will meet outside of the exhibit hall then transition to the lounge area

Thursday, January 16

7:45 amRegistration and Morning Coffee

8:15 am

Chairperson's Remarks

M. Frank Erasmus, PhD, Head, Bioinformatics, Specifica, an IQVIA business

8:20 am

Transforming Therapeutic Protein Engineering

Speaker Headshot Marissa Mock, PhD, Executive Director Research, Amgen Inc.

Generative biology is an emerging discipline that integrates artificial intelligence (AI) and machine learning (ML) with advanced life science technologies. The application of generative biology to protein engineering is accelerating the discovery and design of complex proteins with therapeutic potential—and, maximizing the benefits of these novel technologies will require seamless integration of both wet- and dry-laboratory technologies.

8:50 amSession Break

8:55 am

Chairperson’s Remarks

Bismark Amofah, PhD, Senior Scientist, Biologics Engineering, AstraZeneca

9:00 am

Training Data Requirements for Antibody-Antigen Binding Affinity Prediction under Multiple Circumstances

Speaker Headshot Alissa Hummer, DPhil, Postdoctoral Fellow, Stanford University

Antibodies are an important class of medicines, whose efficacy is driven by specific target binding. Given the therapeutic relevance, there have been multiple attempts to predict antibody-antigen binding affinity computationally. I will discuss our findings on how training data influences the success and selection of machine learning strategies to tackle this challenge, ranging from antigen-specific to generalizable and zero-shot affinity prediction.

9:30 am

Enhanced Prediction of Protein-Protein Interface Structure via Augmentation with in vitro Affinity Data

Speaker Headshot David Noble, Data Scientist II, A Alpha Bio Inc.

The structural complex of a protein-protein interaction (PPI) can yield important mechanistic insights that support drug discovery efforts. Rigid body docking and predictive models such as AlphaFold multimer remain poor quality for difficult but clinically significant systems. Here we present AFInjection, a framework for generating and incorporating experimental data to AlphaFold to improve complex prediction. AFInjection uses affinity data from in vitro directed coevolution of a PPI, finding novel functional sequence pairs which are incorporated into AlphaFold’s features to better infer the parental complex. We demonstrate the utility of this method on antibody-antigen systems.

10:00 am

Featured Poster Presentation: Leveraging Novel In Vivo Datasets to Generate Machine Learning Models Predicting Protein Aggregation and Developability

Speaker Headshot Conor McKay, Researcher, Astbury Centre for Structural Molecular Biology, University of Leeds

Protein aggregation impacts neurodegenerative diseases and biotherapeutic manufacturing. Big data tools like AlphaFold and protein language models excel in biology but rely heavily on high-quality training datasets. This work introduces the tripartite β-lactamase assay, a novel method for generating large, high-quality datasets that link protein aggregation to cell survival, enabling deeper insights into protein behavior and aggregation-related challenges.

10:15 am

Featured Poster Presentation: Mimic Antibodies and How to Find Them

Speaker Headshot Brennan Abanades, PhD, Postdoctoral Fellow, Large Molecule Research, Roche

The majority of antibodies in the PDB targeting the same binding site as some other non-antibody proteins are mimic antibodies– they share with the other protein a motif composed of key residues at conserved geometrical positions. By investigating mimic antibodies and how they imitate the binding site of other proteins, we develop a method for identifying them in repertoire data and validate it on IL-18RA.

10:30 amCoffee Break in the Exhibit Hall with Poster Viewing

11:00 am

Validating High Throughput Developability Assays to Support ML Model Training

Speaker Headshot Bismark Amofah, PhD, Senior Scientist, Biologics Engineering, AstraZeneca

Classic developability assays are low-medium throughput and require complex reagent generation. The large, normalized datasets required for ML tool building require adapting or replacing these assays with ones amenable to high throughput automation. We describe our process and results for validating replacement HT developability assays and a HT developability package compatible with very early HT screening.

11:30 amInteractive Breakout Discussions

TABLE 5: Internal Data Generation and Curation

Kevin Metcalf, PhD, Associate Principal Scientist, Merck & Co., Inc.

Amplification strategies
Avoiding bias
Closed-loop experimentation
Controls and validation
Dealing with skewed data
Historical data

TABLE 6: Machine Learning in Biologic Drug Discovery: Leveraging External Data Sources

David Noble, Data Scientist II, A Alpha Bio Inc.

Quantity: Availability challenges, scaling laws, synthetic data
Quality: Diversity, leakage, reproducibility, quality vs. quantity
Collaborative data generation: Industry-academia partnerships, data sharing consortia
Federated learning: Technical challenges, open-source foundation models
Intellectual property: Data ownership, balancing openness with commercial interests
Open-source data: Curation quality, integrating diverse sources with proprietary data

12:30 pmSession Break

12:40 pm

LUNCHEON PRESENTATION: State Diagram Embeddings to Ground Protein Models in Physical Reality: Single Shot Biophysical Classification of Antibodies

Speaker Headshot Shamit Shrivastava, CoFounder & CEO, Apoha

The Liquid Brain®, Apoha’s novel physical ecosystem, introduces state diagrams: a new class of time-resolved, high-dimensional biophysical data that grounds AI/ML protein models in physical reality. While existing data often fails to capture the dynamic, real-world behaviour of molecules, the Liquid Brain® seamlessly bridges this gap via physics of excitable substrates. Designed with an AI-first mindset, the Liquid Brain® directly interacts with samples to generate predictive insights. Using just 10μg of antibodies, it performs single-shot classification of developability risks, validated against 12 biophysical assays across 135 clinical-stage antibodies. In a customer study, it identified outliers in 70 VHH-Fc fragments with >80% accuracy via simple unsupervised algorithms and without any training.

Our dataset of risk scores for 135 clinical stage antibodies is now available on Hugging Face. Join us to discover how the Liquid Brain® can empower you with ML ready state diagram embeddings!

1:10 pmIce Cream & Cookie Break in the Exhibit Hall with Last Chance for Poster Viewing

2:00 pm

Chairperson's Remarks

Winston Haynes, PhD, Vice President, Computational Sciences and Engineering, LabGenius Therapeutics

2:05 pm

Curation Strategies for R&D Pipeline Data

Speaker Headshot Kevin Metcalf, PhD, Associate Principal Scientist, Merck & Co., Inc.

Model-based prediction of biologics developability properties will increase speed to clinic. Previous pipeline program data is a valuable data source for training models but requires data curation, contextualization, annotation, and quality control for this new use. I will describe how we incorporated historical data using data quality control protocols to create reusable data products for machine learning prediction of key attributes, including hydrophobicity and polyspecificity of monoclonal antibodies.

2:35 pm

ML Models for Nanobody Developability trained on a Purpose-Built Multi-Readout Dataset

Speaker Headshot Samuel Demharter, PhD, Senior Data Scientist, Discovery Data Science and Protein Science & Technologies, Genmab

The biophysical characterisation of biologics requires significant wet-lab resources. To enable large-scale predictions of millions of molecules, protein-language models have become an attractive proposition to accurately predict lab readouts. However, current machine-learning models are limited in accuracy largely due to lack of high-quality and high-volume training data. In this talk, we present the generation of a maximally informative dataset for the purpose of training machine-learning models for nanobody developability predictions.

3:05 pm

A Machine Learning-Driven Approach for Multi-Parametric Optimization of T Cell Engagers

Speaker Headshot Winston Haynes, PhD, Vice President, Computational Sciences and Engineering, LabGenius Therapeutics

T-cell engagers (TCEs) promise breakthroughs in the treatment of solid tumors, but their progression in the clinic is limited by on-target, off-tumor toxicity. In this talk, I describe how our platform integrates active learning, automation, and high-throughput functional assays to efficiently identify highly selective and potent TCEs. I highlight our utilization of the design-build-test-learn ecosystem to generate high-quality data that powers our machine learning models and therapeutic assets.

3:35 pm

Improving AlphaFold2 Performance with a Global Metagenomic & Biological Data Supply Chain

Speaker Headshot Geraldene Munsamy, PhD, Senior Scientist, Deep Learning, Basecamp Research Ltd.

Scaling laws estimate over a trillion species exist, yet less than 0.00001% have been studied. Powered by a global metagenomic data supply chain, BaseFold offers improved protein structure prediction with increased accuracy, achieving up to 80% reductions in RMSD values. Leading to more reliable predictions, better docking results, and advancements in therapeutic development, all while incentivizing biodiversity protection.

4:05 pmClose of BioLogic Summit