Skip to main content

Prepare your dataset(s)

If you have experimental data to train a new model, navigate to Data in the sidebar:

If you still need to upload data, use New import You will be guided to import your data in CSV format, associate it to an existing table / schema, and match the columns.

If you already have the data uploaded, use New dataset You will prepare a point-in-time snapshot of your assay data, in the right format for model training. For this you will use an SQL query editor, and combine and/or filter data to make an ML-ready dataset. Learn more about Working with your data in Cradle.

Data analysis

Once a dataset is created, you will have an opportunity to check the dataset and its quality, before proceeding with the training.

To setup the analysis, first pick a reference sequence onto which you want to map mutations across the dataset for diversity analysis. Then, select which assays you want to inspect, defining the scale and the units (optional). You can select among three scale options:

  • Additive. When it scales linearly.

Tip: choose it when you think about improvements for this assay in terms of absolute quantities (e.g., you would think of a 50ºC to 55ºC increment as a +5ºC improvement, not 1.1 fold).

  • Multiplicative. When it scales logarithmically.

Tip: choose it when you think improvements for this assay in terms of relative quantities (e.g., a 10 fold improvement in affinity from 40 to 4 nm, not a -36 nm change)

  • Rank. For categorical variables.

When the analysis is complete, navigate to the report to inspect your data.

SectionWhat it coversWhat to look for
Data quantityHeadline numbers: total measurements, unique variants, and reference sequence lengthMore data is generally better, but a smaller, cleaner dataset may outperform a large one with poor coverage or noisy replicates
Overall distribution of assembly lengthsSequence lengths across all variantsExtremely short or long values may indicate outliers
Overall distribution of assay valuesOverall spread of measurementsRoughly unimodal, reasonably symmetric distribution. Isolated bars far from the main distribution may suggest measurement errors or outliers — but could also be exceptionally good or bad variants. Bimodal distributions may indicate batch effects or mixed populations
Overall distribution of the number of mutationsNumber of mutations each variant carries relative to the referenceVariants with unexpectedly high mutation counts — may indicate sequences from other projects that have slipped into your data
Number of measurements for most frequent assemblies by batchSpread between repeated measurements of the same variantsTight error bars indicate a precise assay; large error bars mean harder model learning. Systematic noise across the replicate panel, or exceptionally noisy individual variants, are worth investigating before training

If you discover any issues in the data, you may need to create a new dataset, to filter out or modify the data accordingly. Catching and fixing data issues is important to maximize chances of your round success.