Skip to main content

Train your models

Two model training scenarios

ScenarioType(s) of trainingYour InputsTrained model(s)
A. No experimental dataSelf-supervised fine-tuning on homologsHomolog dataBase sampler
B. With experimental dataSelf-supervised fine-tuning, Supervised fine-tuning on your primary objective (DPO), Multi-head predictor fine-tuningHomolog data, assay data, primary objectiveBase sampler, conditioned sampler, and scorer

Learn more about what training entails, and how these different models are trained and used in How our models are trained.

You will first need to perform a homolog search to compile a dataset to be used for training sampler models.

As an input, you provide one or more seed sequences (the proteins that define the protein "neighborhood" you want to operate in), and Cradle will run a query on public databases and return a list of homologs. For antibody searches, Cradle will also compute CDR annotations (IMGT numbering).

These are the databases Cradle queries:

ModalityDatabaseNotes
Single-domain binder (VHH)OAS (Observed Antibody Space)Unpaired dataset (single chains)
Multi-domain binder (Fab, IgG, scFv)OASPaired dataset, preserving VH/VL co-evolutionary signal
Enzyme, peptide, other proteinsUniRefNo computed annotation available

The search takes a couple of hours depending on the number of homologs and sequence length, and you can expect to find on the order of thousands of related sequences. You can review the homolog search report to sanity check the results:

  • Count. Expect at least hundreds of homologs, ideally thousands
  • Sequence identity and coverage. Ideally high sequence identity (many related proteins) with high coverage, which means good data on epistatic interactions across the sequence. Very low sequence identity can indicate your seed isn't well represented in the database, which can lower performance — but there is a wide range of acceptable values.
  • Distributions, not just averages. The report gives distributions, so you can look for surprises e.g., clusters of homologs with unexpectedly low coverage, for example.

Special cases: A chimera or fusion seed will return hits covering each component separately. A seed that is a sub-unit of a larger protein will return hits covering the seed but only a fraction of the homologs. For antibodies, expect good coverage of framework regions but unlikely close hits for the CDRs.

Note: If you bring your own homolog collection (e.g., from a screening campaign) you can switch to Advanced mode and select your dataset to be used for training instead of running a search. If you would like to combine your homolog collection with the search results, you will need to create a combined dataset first.

Configuring and starting the training

Depending on the round configuration, you will have two scenarios:

Scenario A. No experimental data

  1. Select the homolog dataset to be used for training the sampler model.

Scenario B. With experimental data

  1. Select the homolog dataset to be used for training the sampler model.
  2. Add your assays: choose your dataset, the assays and their scale types will be populated from the dataset.
  3. Set the optimization target - a single property used for sampler fine-tuning.

Start training. The training will now continue in the background. Once it finishes, the base sampler, conditioned sampler, and the scorer will be posted with a unique ID, and available for you to use in the next step.