Data guidelines

To get the most performance for your protein while using Cradle there are some data guidelines to follow. The models perform better with larger quantities of data (by including negative data, raw data, and replicates) as well as high-quality data (including controls).

More high quality data

Provide as much data as possible, as larger datasets give our models richer context about the protein space, leading to better predictions.

If you don't have any data for your protein yet, Cradle's models can still generate initial libraries to test in the lab since they are trained on more than 1 billion sequences. However, having data specific to your protein will significantly improve model performance.

Include negative data

Include all data, even negative data that doesn't support your hypothesis. This helps models better understand a wider distribution of the protein's fitness landscape, leading to better optimization.

Include raw data

Submit direct measurements instead of normalized values when measuring improvement against a baseline. Models typically learn more from raw data since it includes the most complete information (e.g., the range of melting temperatures for your protein of interest) compared to when the data is already preprocessed.

Include replicates

When assaying a sequence multiple times, submit raw observations for replicates rather than averaged or pooled data. Every individual data point you upload helps the model learn more about the protein you're optimizing and understand the variability it can expect in measurements.

Data quality

The output quality of machine learning models depends on the quality of the data they are trained on. It is important to generate the best possible data. However, we can help you deal with potential data quality issues to make it usable for ML. Aim to keep workflows and conditions the same between rounds of experimentation. If your dataset contains assay values that you don't trust, remove them.

Include controls

Include both positive and negative controls to verify data quality and track experimental variation. This adds valuable context for the models to learn from.

Summary of best practices

To summarize,

  • More data leads to better model performance.

  • Include all data, even if it doesn't support your hypothesis.

  • Raw data is preferred over preprocessed data.

  • Multiple observations of the same sequence are valuable.

  • Use controls to verify data quality and track experimental variation.

If you want to learn more about data collection and the truth behind common data myths read this blog post.

Once you have prepared your data for machine learning, start your project with Cradle. Learn more here Getting started with Cradle.

Last updated