Importing data
Last updated
Last updated
With your assays configured, you can begin importing data. Providing high-quality experimental data enhances model performance. If you don't have any experimental data yet, you can proceed to Objectives.
You can import data by uploading a .csv, .xlsx, or .tsv file with the following columns:
This column contains each variant's raw amino acid sequences. If multiple measurements are taken for the same sequence, report them separately in distinct rows and do not group them (e.g., with an average):
Sequence strings should only include capital letters corresponding to the canonical amino acids. Follow these guidelines:
Supported characters
A C D E F G H I K L M N P Q R S T V W Y
Unsupported characters
- _ * . : ;
His-tag and other purification segments should not be included. Empty spaces, tabs, and new lines (like in a FASTA entry) are also not accepted.
Your data file should contain separate columns for each assay you want to import:
Only numeric or empty values are accepted. At least one assay values column must be included in the sheet. Each data cell must contain either a numeric value or be empty.
Non-numeric characters, such as %, NA, or -, are not supported.
You can import replicate assay values by adding them as separate rows. To do this, create rows with the same protein sequence and multiple assay values. During the import process, you'll need to match assay columns with the corresponding assays you have defined on the platform.
Do not filter out negative data. Focusing solely on top-performing variants overlooks the insights provided by negative data. Uploading assay data for all variants improves machine learning training and predictive accuracy, as machine learning models learn from both negative and positive data.
Include individual measurements and not the average. If your process involves running replicates, upload each measurement separately rather than submitting averages.
Do not normalize values if you are measuring improvement against a baseline. The model prefers raw measurements of a property as opposed to fold improvement.
Do not use pooled data. Each sequence should correspond to a single measurement. If there is no way to deconvolute pooled data, it cannot be used to train the model.
Learn more in Data guidelines.
A batch is a set of assay values measured under comparable experimental conditions. Each batch typically contains a set of controls prepared in a way consistent with the sample.
When scientists repeat or start a new experiment and add a newly purified control (such as their wild type), that defines a new batch of data. That new batch contains your new controls as well as all the data points in your plate(s).
Note: a batch of data is often the same as a round of data.
Cradle's machine learning models focus on relative relationships between measurements. To ensure accuracy, the models should only compare data that was measured under comparable conditions.
Examples
Data belongs to the same batch when
Sequences belonging to the same plate were assayed simultaneously under consistent lab conditions, with variability low enough to require only one set of controls for the plate.
Two plates were assayed on different days under the same conditions, with control measurements on both plates showing minimal to no variation.
The activity of sequences on the same plate was measured at three different temperatures such as 55°C, 60°C, and 65°C.
Note: While these measurements belong to the same batch, each temperature should be a different Assay, e.g., activity_55, activity_60, and activity_65.
Data belongs to different batches when:
Two plates were assayed on different days under the same conditions, but either no control sequences were included to assess variability, or the control sequences show significant variation between the two assays.
Measurements come from different engineering rounds unless they are repeated together with the same controls.
The data was generated from the same sequences, which were purified multiple times at different time points.
You upload your data by matching the columns in your file with the relevant fields (Protein sequence
, Assay values
, Batch ID
). If you have additional data in your file, you can click on Don't import
for the platform to ignore these columns. If you don't match a column with anything it won't be imported as well.
After matching all the columns, you can click on Continue
to upload the file. If you haven't assigned a Batch ID to your data, you will be asked to do so here.
Now that you have uploaded your data, you can set your Objectives.