Data guidelines
Learn more about best practices to enhance data quantity and quality.
To get the most performance for your protein while using Cradle there are some data guidelines to follow. The models perform better with larger quantities of data (by including negative data, raw data, and replicates) as well as high-quality data (including controls).
Improving data quantity
Provide as much high quality data as possible, as larger datasets give our models richer context about the protein space of the protein you are working on. To increase the amount of useful data models can train on follow these guidelines:
Include negative data: Include all data, even negative data that doesn't support your hypothesis. This helps models better understand a wider distribution of the protein's fitness landscape, leading to better optimization.
Include raw measurements: Submit direct measurements instead of normalized values when measuring improvement against a baseline. Models typically learn more from raw data since it includes the most complete information (for example, the range of melting temperatures for your protein of interest) compared to when the data is already preprocessed.
Include replicates: When assaying a sequence multiple times, submit raw observations for replicates rather than averaged or pooled data. Every individual data point you upload helps the model learn more about the protein you're optimizing and understand the variability it can expect in measurements.
Improving data quality
The output quality of machine learning models depends on the quality of the data they are trained on. As models learn the relationship between protein sequence and fitness, errors in the dataset can compromise training. To prevent training issues follow these guidelines:
Standardize conditions: Keep workflows, assay conditions, and protocols consistent across experimental rounds. Variability in experimental setup can confuse models and reduce prediction quality.
Include controls: Include both positive and negative controls in every experiment batch and in follow up rounds to verify data quality and track experimental variation. This adds valuable context for the models to learn from.
Include Batch ID and Sample ID: Adding identifiers to distinguish samples and experimental batches is crucial in helping the models understand variation in assay data:
-
A
Sample IDis a unique identifier assigned to a physical protein sample, serving as the primary reference for all derived samples. It ensures that our machine learning models can accurately link assay values to the correct biological sample. -
A
Batch IDis a unique identifier assigned to a set of assay values measured under comparable experimental conditions. It helps models distinguish to which batch a measurement belongs. Each batch typically contains a set of controls prepared in a way consistent with the sample. When you repeat or start a new experiment and add a newly purified control, create a new batch ID that includes both the new controls and all measurements from that experimental run.
Remove unreliable data: If your dataset contains measurements you don't trust due to experimental errors or technical issues, exclude them from training data.
Following these guidelines will help you achieve optimal results from Cradle's machine learning capabilities.