DNA Barcoding Machine Learning R&D
Enhancing DNA Metabarcoding with Machine Learning for Cost-Effective Biodiversity Monitoring
Biodiversity monitoring through DNA barcoding faces a fundamental cost-quality tradeoff. Individual specimen barcoding produces reliable, high-quality species identification but is expensive and time-consuming. Metabarcoding—processing bulk environmental samples—is far cheaper but yields less precise results. For organisations running large-scale monitoring programmes, this creates difficult resource allocation decisions.
Working with the International Barcode of Life, I investigated whether machine learning could bridge this gap: predicting what individual barcoding results would show using only the data available from metabarcoding.
I developed a random forest model to predict relative species counts as they would appear in individual barcoding results, trained on samples where both methods had been applied. The model learns the systematic relationship between metabarcoding outputs and the higher-fidelity individual barcoding results, then applies that relationship to enhance metabarcoding-only samples.
This research demonstrates a pathway for monitoring programmes to reduce costs without proportionally sacrificing data quality. Rather than choosing between expensive-but-accurate and cheap-but-noisy, organisations could strategically deploy individual barcoding on a subset of samples to train and validate the model, then apply metabarcoding across the full programme with ML-enhanced predictions.
Project Partners