Prepare your dataset(s)
If you have experimental data to train a new model, navigate to Data in the sidebar:
If you still need to upload data, use New import You will be guided to import your data in CSV format, associate it to an existing table / schema, and match the columns.
If you already have the data uploaded, use New dataset You will prepare a point-in-time snapshot of your assay data, in the right format for model training. For this you will use an SQL query editor, and combine and/or filter data to make an ML-ready dataset. Learn more about Working with your data in Cradle.
Data analysis
Once a dataset is created, you will have an opportunity to check the dataset and its quality, before proceeding with the training.
To setup the analysis, first pick a reference sequence onto which you want to map mutations across the dataset for diversity analysis. Then, select which assays you want to inspect, defining the scale and the units (optional). You can select among three scale options:
- Additive. When it scales linearly.
Tip: choose it when you think about improvements for this assay in terms of absolute quantities (e.g., you would think of a 50ºC to 55ºC increment as a +5ºC improvement, not 1.1 fold).
- Multiplicative. When it scales logarithmically.
Tip: choose it when you think improvements for this assay in terms of relative quantities (e.g., a 10 fold improvement in affinity from 40 to 4 nm, not a -36 nm change)
- Rank. For categorical variables.
When the analysis is complete, navigate to the report to inspect your data.
| Section | What it covers | What to look for |
|---|---|---|
| Data quantity | Headline numbers: total measurements, unique variants, and reference sequence length | More data is generally better, but a smaller, cleaner dataset may outperform a large one with poor coverage or noisy replicates |
| Overall distribution of assembly lengths | Sequence lengths across all variants | Extremely short or long values may indicate outliers |
| Overall distribution of assay values | Overall spread of measurements | Roughly unimodal, reasonably symmetric distribution. Isolated bars far from the main distribution may suggest measurement errors or outliers — but could also be exceptionally good or bad variants. Bimodal distributions may indicate batch effects or mixed populations |
| Overall distribution of the number of mutations | Number of mutations each variant carries relative to the reference | Variants with unexpectedly high mutation counts — may indicate sequences from other projects that have slipped into your data |
| Number of measurements for most frequent assemblies by batch | Spread between repeated measurements of the same variants | Tight error bars indicate a precise assay; large error bars mean harder model learning. Systematic noise across the replicate panel, or exceptionally noisy individual variants, are worth investigating before training |
If you discover any issues in the data, you may need to create a new dataset, to filter out or modify the data accordingly. Catching and fixing data issues is important to maximize chances of your round success.