 Methodology
 Open Access
 Published:
“Noisy beets”: impact of phenotyping errors on genomic predictions for binary traits in Beta vulgaris
Plant Methods volume 12, Article number: 36 (2016)
Abstract
Background
Noise (errors) in scientific data is endemic and may have a detrimental effect on statistical analyses and experimental results. The effects of noisy data have been assessed in genomewide association studies for casecontrol experiments in human medicine. Little is known, however, on the impact of noisy data on genomic predictions, a widely used statistical application in plant and animal breeding.
Results
In this study, the sensitivity to noise in the data of five classification methods (Knearest neighbours—KNN, random forest—RF, ridge logistic regression—LR, and support vector machines with linear or radial basis function kernels) was investigated. A sugar beet population of 123 plants phenotyped for a binary trait and genotyped for 192 SNP (single nucleotide polymorphism) markers was used. Labels (0/1 phenotype) were randomly sampled to generate noise. From the base scenario without errors in the labels, increasing proportions of noisy labels—up to 50 %—were generated and introduced in the data.
Conclusions
Local classification methods—KNN and RF—showed higher tolerance to noisy labels compared to methods that leverage global data properties—LR and the two SVM models. In particular, KNN outperformed all other classifiers with AUC (area under the ROC curve) higher than 0.95 up to 20 % noisy labels. The runnerup method, RF, had an AUC of 0.941 with 20 % noise.
Background
Errors in the data collected for scientific experiments or—especially—for routine industrial applications are referred to as noise in the data, and may arise for several reasons (e.g. instrument errors, human errors, environmental noise, inherent randomness in the physical process, corruption of data etc. [1]). Noisy data are a long known problem in statistics (e.g. [2]). In spite of efforts to clean the data and produce good quality datasets [3], a certain amount of noise is bound to persist in the data: this needs to be dealt with, and the impact on results assessed (e.g. [4–6]). In binary classification problems, noisy data typically take the form of mislabeled observations or flipped labels [6]. For instance, the carrier status of recessive mutations (e.g. [7]) may inadvertently be inverted in some individuals; the same could happen in the case of resistance/susceptibility to diseases (e.g. rhizomania in sugar beet [8], where a proportion of resistant plants could be mislabeled as susceptible, and viceversa). These are examples of possible phenotypic errors in binomial traits.
In the field of genomics, the effect of mislabeled observations on the statistical power of genomewide association studies (GWAS) has been recognized in casecontrol studies in human medicine [9, 10]. Buyske et al. found that a 39fold larger sample size is required to maintain the same power of analysis in casecontrol studies with 5 % misclassifications. In animal genetics, a known issue are pedigree errors and their effect on the accuracy of estimated breeding values [11]: for instance, in a pig population with 20 % errors in the pedigree, the average genetic gain showed a reduction in the range 3.2–12.4 % for a number of traits. Noisy data are bound to have a detrimental effect also on wholegenome predictions, which are increasingly used for a variety of phenotypes in plant and animal breeding [12]. Additionally, the current trend in precision agriculture is bringing about novel highthroughput phenotyping systems to measure a vast amount of data in an automatic and continuous way [13], which may well harbor a certain proportion of errors. Automatically generated noncurated datasets are prone to contain errors (e.g. [14, 15]). There are currently no studies that address the issue of noisy data in genomic predictions, neither in humans, nor in plants and animals.
In this paper, the impact of random noise on the accuracy of genomic predictions for binary traits is investigated. Starting from a population of sugar beet with known binomial phenotypes, increasing proportions of noisy labels were randomly generated, and the performance of different classification methods was measured.
Methods
Plant phenotypes and genotypes
In total, 123 sugar beet (B. vulgaris) plants were available, 99 with high and 24 with lowroot vigor. Plants were originated from 18 selected sugar beet lines (15 with high and 3 with lowroot vigor). Root vigor is linked to nutrient uptake and plant productivity, [16] and, in selected sugar beet populations, has been usually treated as a binary trait [17, 18]. Classification of plants into high or lowroot vigor was based upon phenotypic measurement of root elongation on 11dayold seedlings: root elongation was on average 12.9 and 2.6 mm/day in high and low root vigor plants, respectively. The clearly bimodal distribution can be seen in Biscarini et al. 2015 [18].
All plants were genotyped for 192 SNP markers with the highthroughput marker array QuantStudio 12K Flex system coupled with Taqman OpenArray technology. The average persample and permarker callrate were 0.984 and 0.969. Only one SNP had a permarker callrate \({\le }85\,\%\) and was removed. There were in total 738 missing genotypes (3.14 %). After imputation ([19]) data were edited for minor allele frequency (MAF): 16 SNPs with MAF \({\le }2.5\,\%\) were discarded. After editing, 175 SNPs evenly distributed across the nine chromosomes of the sugar beet genome were left for the analysis.
Further description of phenotypes and genotypes can be found in [17, 18, 20–22].
The study was conducted in accordance with the existing national and international guidelines and legislation.
Classification models
Based on SNP genotypes, the genomic classification of individual sugar beet plants into the two classes (high and lowroot vigor) was carried out using the following five models:
Knearest neighbors (KNN) classifier The predicted class for plant \(x_0\) was obtained by majority vote among the K closest neighbours. The neighbourhood was determined via Euclidean distances based on SNP genotypes (\(D_E=d(x_0,x_i)=\sqrt{\sum _{j=1}^m (x_{0j}x_{ij})^2}\), for each neighbour i, over m SNP dimensions). The vote of neighbors could be differentially weighted (or not) by the inverse (\(1D_E\)) or the reciprocal \(\left( {1}/{D_E} \right)\) of the distance from the unlabelled observation \(x_0\). Whether and how to weight neighbouring observations was determined through crossvalidation.
Random forest (RF) classifier A large number of classification trees was built on B bootstrapped samples of sugar beet plants. Classification trees were decorrelated by using, at each node, a random subset m of the 175 SNP. The final classifier was obtained by majority vote over the B classification trees:
where \(x_i\) is the vector of SNP genotypes for plant i, and \(\hat{f}_b(x_i)\) is the prediction (high/lowroot vigor) from the classification tree built on the \(b_{th}\) bootstrapped data sample. More details on random forest can be found in [23].
Ridge logistic regression (LR) classifier The probability of having highroot vigor (\(P(Y=1X)=p(x)\)) was modeled as a linear combination of the SNP genotypes in a logistic regression model:
where \(p(x_i)\) is the \(P(Y=1X)\) for individual i with vector of SNP genotypes \(x_i\); \({\textit{SNP}}_j\) is the effect of the \(j_{th}\) marker; \(z_{ij}\) is the genotype of individual i at locus j (0, 1 or 2 for AA, AB and BB genotypes). Since the number of markers in the model (175 SNPs) exceeds the number of observations (123 plants), an \(\ell 2\)norm penalization (\(\frac{1}{2} \uplambda \sum _{j=1}^m SNP_j^2\)) was applied to the likelihood function to be maximised [24].
Support vector machine with linear kernel (SVMLin) SVMLin maps the vector of SNP genotypes \({\mathbf {x}} \in {\mathbb {R}}\) into a higher dimensional feature space \(\phi ({\mathbf {x}}) \in {\mathbb {H}}\) and constructs a separating hyperplanelinear in \({\mathbb {R}}\) to classify observations based on the width of the margin M and the sign of the classifier:
The mapping \({\mathbb {R}} \mapsto {\mathbb {H}}\) is performed by a linear kernel function \(K(x_i,x_{i'}) = \langle x_i,x_{i'} \rangle\) which defines an inner product of pairs of SNP genotype vectors in the space \({\mathbb {H}}\). The intercept \(\beta _0\) and the coefficients \(\alpha _i\) are obtained by maximizing the margin M, whose width is controlled by the hyperparameter C, optimized through crossvalidation.
Support vector machine with radial basis function kernel (SVMRbf) As in SVMLin, observations are classified by the sign of Eq. 3 and the width of margin M; only, in SVMRbf the kernel function K is the radial basis function: \(K(x_i,x_{i'})=\exp \left( \gamma \sum _{j=1}^p(x_{ij}x_{i'j})^2\right)\). The width of the margin M is again controlled by the hyperparameter C, while the positive constant \(\gamma\) controls the degree of nonlinearity of the decision boundary.
For a full description of SVM with either linear or radial basis function kernel, see [25].
Tuning the hyperparameters, generating noisy labels and measuring classification accuracy
The hyperparameters in the models were optimised through crossvalidation among a range of values: for KNN, the number of neighbors \(K \in \left\{ 1, 3, 5, 7 \right\}\) and their weight \(\in \left\{ 1, 1  D_E, {1}/{D_E} \right\}\); for LR, the value of the penalty \(\uplambda\); in RF the number of B “bootstrapped trees” \(\in \left\{ 1, 5, 10, 50, 100 \right\}\) and the subset of m SNP markers per node \(\in \left\{ j, 2, 4 \right\}\), where j is \(int(log_2(\#\_of\_SNPs) + 1 ))\); in SVM, the cost parameter \(C \in \left\{ 2^2 \cdots 2^9 \right\}\) for both SVMLin and SVMRbf; for SVMRbf, additionally, the positive constant \(\gamma \in \left\{ 10^{3} \cdots 10^{+1} \right\}\).
To test the impact of phenotyping errors on genomic predictions, an increasing fraction of the observations in the training set was randomly mislabelled: from 0 % (no mislabels) up to 50 % (theoretical maximum noise), through 12 intermediate steps (1, 2.5, 5, 7.5, 10, 12.5, 15, 17.5, 20, 25, 30, 40 %). At every step, the corresponding fraction of observations was randomly sampled from the original data and the labels were flipped (\(0 \rightarrow 1\); \(1 \rightarrow 0\)). For each proportion of mislabelled observations, the five classification models were tested with a 5fold crossvalidation scheme. 123 sugar beet plants were randomly split into 5 subsets of approximately the same size. In turn, the observations in one subset were set to missing and predicted using the model trained with the remaining four subsets, until all subsets were used once as validation set. A further nested 5fold crossvalidation run was applied for hyperparameter optimization. Labels predicted in the validation set were compared to the original (true) labels to measure the accuracy of classification. Each experiment (proportion of mislabelled observations per classification model) was repeated 100 times (\(\times\)5fold crossvalidation = 500 replicates). Results were averaged to explore the variability of prediction and ensure numeric stability.
High root vigor (the majority class) was by convention considered positive and low root vigor (the minority class) negative. The accuracy of genomic predictions was measured as: (1) Total error rate (TER: ratio between the number of classification errors and the total number of predictions), (2) False positive rate (FPR: ratio between wrongly predicted positives and the total predicted positives), and (3) False negative rate (FNR: ratio between false negative predictions over all negative predictions). Additionally, the area under the receiver operating characteristic (ROC) curve (AUC) was also recorded to monitor FPR and FNR over all possible classification thresholds in [0,1] [26].
Software
All models were implemented using the Weka machine learning suite [27]. The open source statistical environment R [28] was used generate random noisy labels, to parse results and produce figures and tables.
Results
Error rates (TER, FPR, FNR) for the five classification models over all mislabeling proportions are reported in Table 1. In general, very low error rates were observed with no phenotyping errors in the data (base scenario). No errors overall and in both classes with KNN, LR and SVMLin, errors below 0.1 % with SVMRbf and around 1 % with RF.
The average AUC as a function of the proportion of mislabeled observations is a good indicator of the relative performance of the five classification models, and their robustness to noise in the data (Fig. 1). The performance of LR and SVMLin decreased approximately linearly with increasing proportions of mislabeled observations. KNN, RF and SVMRbf appeared to be more robust to noise in the data: AUC was \({\gtrapprox }0.95\) for KNN and RF, and larger than 0.90 for SVMRbf, up to 20 % mislabelled observations: only after 20 % phenotype errors their performance started deteriorating rapidly. With mislabeled observations approaching 50 %, AUC from all classification models quickly converged to 0.50 (absence of any predictive value).
With increasing noise in the data, not only did the average performance decrease, but also the genomic predictions were much more variable. Figure 2 shows the boxplots of the 500 (5fold crossvalidation, repeated 100 times) true positive (TPR = 1 − FNR) and true negative (TNR = 1 − FPR) rates per method and proportion of noisy labels. With no or little phenotyping errors classifications were consistently very accurate. With KNN and SVMRbf there were virtually no misclassifications up to 7.5 and 10 % mislabeled observations, respectively. With larger fractions of noisy labels, classifications became more unstable and the variability of genomic predictions started spanning the entire range between 0 and 100 % correct classifications.
The lowtohigh root vigor ratio was 0.195 in the original data. Mislabeled observations were then generated randomly, and this had an effect on the class ratio, which went up to 0.520 with 50 % noise. When increasing proportions of noise were introduced, data got progressively more balanced. The frequency of the minority class for each proportion of noisy labels is reported in Table 1.
Discussion
Classifying sugar beet plants into high and lowroot vigor using SNP genotypes was already shown to be very accurate [17, 18]. This provides an excellent starting point, and ensures that observed classification errors are due to noise in the data and the chosen classification model, and not to intrinsic characteristic of the data that could privilege some method over the others.
In general, when noise increases, the rate of misclassifications also increases, together with the variability of genomic predictions, and the two classes gets progressively more balanced (which consequently casues TPR and TNR to get more similar). However, while the classification accuracy of LR and SVMLin decreased linearly with the rate of phenotyping errors, KNN, RF and SVMRbf were more robust to noise and showed a similar pattern in their AUC curve.
KNN and RF are semiparametric statistical methods which are inherently “local” in their behaviour, and therefore tend to be robust to outliers in the data. Neighbourhoods (in KNN) and branches (in RF) use subsets of the data and rely on the prevalent labels in the subset to classify observations. It is unlikely that all—or most—mislabelled observations happen to be in one neighbourhood or branch. Therefore KNN and RF would give good performance up to the point when the subset is dominated by misalbeled observation. When the fraction of mislabeled observations is 20 % or higher, the amount of noise is such that probabilities revert, and it gets unlikely to have local subsets without—or with few—mislabelled observations, and also local methods begin to fail [29]. In SVMRbf, training observations which are far—in terms of Euclidean distance—from a given test observations \(x*\) play essentially no role in predicting the class label of \(x*\) (\(K(x_i,x_{i'})=\exp \left( \gamma \sum _{j=1}^p(x_{ij}x_{i'j})^2\right)\) will in fact be very small [30]). This implies that the SVMRbf has a very local behavior, in the sense that only nearby training observations have an effect on the class label of a test observation, similarly to what happens with KNN and RF. This helps explain the similar performance of these three classification methods with increasing noise in the data.
On the other hand, LR and SVMLin work very well in the base scenario, when there are no mislabels. This is because in this classification problem the decision boundary is linear, and the two classes are linearly separable (see also phenotypic distribution in the Supplementary Figure SF1 in [18]). With noisy labels, though, LR and SVMLin tend to degrade faster than local methods because they build on general properties of the data.
Local classification methods proved to be robust to noise up to 20 % mislabeled observations in the dataset. At this proportion of errors, the average hyperparameters had the following values: for KNN, \(\overline{K}=4.49\) (and no weight was used in most of the cases—40 %); for RF, \(\overline{B}=37.7\); for SVMRbf, \(\overline{\gamma }=0.0921\). These hyperparameters control the bias/variance tradeoff and their optimization is much dependent on the specific training datasets (e.g. size of the data, number of parameters relative to observations). Therefore, the values of the hyperparameters estimated here are not directly applicable to other datasets, but can provide a guide for the space to be explored in similar problems.
Biscarini et al. [18] previously showed that it was possible to reduce the set of markers down to as few as 30 SNP, without losing accuracy of classification. The parsimonious classifier thus developed was here tested with noisy labels. Based on the proportion of variance explained, two subsets with the 50 (SNP50) and 30 (SNP30) most informative SNP loci were extracted and used to classify high and lowroot vigor sugar beets. The two best performing methods were applied: KNN and RF. Figure 3 shows the AUC for increasing proportions of noise in the data when using all 175 SNP or subsets with, respectively, 50 and 30 SNP. The accuracy of classification is practically unaffected by the number of SNP included in the model. The variability of predictions was also little affected: with fewer SNP predictions were only slightly less reliable (e.g. KNN for 5 and 7.5 % noise, see Additional file 1: Figure S1) . These results indicate that informative SNPs appear to be more relevant than sheer SNP density for the accuracy of genomic predictions (e.g. [31]).
Robustness to noise is an aspect of genomic predictions which is currently overlooked, but may be desirable. To extract useful information from data, a classifier that is robust to noisy labels is needed to produce meaningful results even in the presence of noise. There may be interest in methods robust to noise. Manual phenotyping is known to may be prone to errors (e.g. in human medicine [32, 33]). Novel highthroughput phenotyping platforms [34–36], by which very large amounts of data are automatically generated, may alleviate the problem, at least partially. However, automatically generated data are not doublechecked for errors, and are therefore susceptible to contain a residual amount of phenotyping errors. This highlights on one hand the importance of accurate phenotyping for genomic predictions [37, 38], on the other the need for prediction methods able to deal with noisy data.
Genomic classification for binary traits is highly relevant in plant breeding (e.g. resistance/susceptibility to diseases [39], which is often controlled by multiple loci e.g. [40]). In sugar beet, besides root vigor, other binomial characteristics of plants are important: for instance bolting tendency (i.e. premature flowering, negatively related to sugar yield [41]), for which a polygenic nature is increasingly evident [42], and genomeenabled predictions promise therefore to be a valuable technique for breeding.
Conclusions
Noise (errors) is pervasive in scientific data, potentially also in the field of genomics applied to plant breeding. A specific type of errors are misalbeled observations (wrongly assigned labels, flipped labels), which are relevant in the analysis of binary traits. The impact of noisy labels on the accuracy of genomeenabled predictions had not been investigated so far; this paper presented a first attempt at understanding what happens when binary phenotypes are incorrect, and how different classification methods respond to increasing proportions of noisy labels in the data. The results of this study indicate that local classification methods seem to be better suited to cope with noisy labels, with KNN outperforming all other classifiers. Overall, genomic predictions for binomial traits seem to be robust to small percentages of phenotyping errors, and the high variability between methods points at the possibility of selecting the best classifier for each problem, depending on the amount of noise and the nature of the decision boundary.
Availability of supporting data
SNP genotypes and high/lowroot vigor status of the 123 sugar beet samples used in this study are currently not hosted in any open access repository, but are available upon request to the authors.
Abbreviations
 SNP:

single nucleotide polymorphism
 KNN:

Knearest neighbors
 RF:

random forest
 SVM:

support vector machines
 SVMLin:

SVM with linear kernel
 SVMRbf:

SVM with radial basis function kernel
 AUC:

area under the curve
 TPR:

true positive rate
 TNR:

true negative rate
References
 1.
Guillet F, Hamilton HJ. Quality measures in data mining, vol. 43. Heidelberg: Springer; 2007.
 2.
Schlimmer JC, Granger RH Jr. Incremental learning from noisy data. Mach Learn. 1986;1(3):317–54.
 3.
Rahm E, Do HH. Data cleaning: problems and current approaches. IEEE Data Eng Bull. 2000;23(4):3–13.
 4.
CesaBianchi N, ShalevShwartz S, Shamir O. Online learning of noisy data. IEEE Trans Inform Theory. 2011;57(12):7907–31.
 5.
Chen Y. Learning with highdimensional noisy data. PhD thesis, University of Texas, Austin (August 2013)
 6.
Natarajan N, Dhillon IS, Ravikumar PK, Tewari A. Learning with noisy labels. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in neural information processing systems 26. Proceedings of Neural Information Processing Systems; 2013. p. 1196–1204.
 7.
Biffani S, Dimauro C, Macciotta N, Rossoni A, Stella A, Biscarini F. Predicting haplotype carriers from SNP genotypes in Bos taurus through linear discriminant analysis. Genet Select Evol. 2015;47(1):1.
 8.
Pavli OI, Stevanato P, Biancardi E, Skaracis GN. Achievements and prospects in breeding for rhizomania resistance in sugar beet. Field Crops Res. 2011;122(3):165–72.
 9.
Edwards BJ, Haynes C, Levenstien MA, Finch SJ, Gordon D. Power and sample size calculations in the presence of phenotype errors for case/control genetic association studies. BMC Genet. 2005;6(1):1.
 10.
Buyske S, Yang G, Matise TC, Gordon D. When a case is not a case: effects of phenotype misclassification on power and sample size requirements for the transmission disequilibrium test with affected child trios. Human Hered. 2009;67(4):287–92.
 11.
Long T, Johnson R, Keele J. Effects of errors in pedigree on three methods of estimating breeding value for litter size, backfat and average daily gain in swine. J Anim Sci. 1990;68(12):4069–78.
 12.
de los Campos G, Hickey JM, PongWong R, Daetwyler HD, Calus MP. Wholegenome regression and prediction methods applied to plant and animal breeding. Genetics. 2013;193(2):327–45.
 13.
Singh A, Ganapathysubramanian B, Singh AK, Sarkar S. Machine learning for highthroughput stress phenotyping in plants. Trends Plant Sci. 2016;21(2):110–24.
 14.
Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, GarciaHernandez M, Huala E, Lander G, Montoya M, et al. The arabidopsis information resource (TAIR): a model organism database providing a centralized, curated gateway to arabidopsis biology, research materials and community. Nucleic Acids Res. 2003;31(1):224–8.
 15.
Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequence (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005;33(suppl 1):501–4.
 16.
Stevanato P, Saccomani M, Bertaggia M, Bottacin A, Cagnin M, De Biaggi M, Biancardi E. Nutrient uptake traits related to sugarbeet yield. J Sugar Beet Res. 2004;41:89–100.
 17.
Biscarini F, Stevanato P, Broccanello C, Stella A, Saccomani M. Genomeenabled predictions for binomial traits in sugar beet populations. BMC Genet. 2014;15(1):87.
 18.
Biscarini F, Marini S, Stevanato P, Broccanello C, Bellazzi R, Nazzicari N. Developing a parsimonius predictor for binary traits in sugar beet (Beta vulgaris). Mol Breed. 2015;35(1):1–12.
 19.
Browning BL, Browning SR. A unified approach to genotype imputation and haplotypephase inference for large data sets of trios and unrelated individuals. Am J Human Genet. 2009;84(2):210–23.
 20.
Stevanato P, Broccanello C, Biscarini F, Del Corvo M, Sablok G, Panella L, Stella A, Concheri G. Highthroughput radsnp genotyping for characterization of sugar beet genotypes. Plant Mol Biol Report. 2014;32(3):691–6.
 21.
Pi Z, Stevanato P, Yv LH, Geng G, Guo XL, Yang Y, Peng CX, Kong XS. Effects of potassium deficiency and replacement of potassium by sodium on sugar beet plants. Russ J Plant Physiol. 2014;61(2):224–30.
 22.
Stevanato P, Trebbi D, Biancardi E, Cacco G, McGrath JM, Saccomani M. Evaluation of genetic diversity and root traits of sea beet accessions of the Adriatic Sea coast. Euphytica. 2013;189(1):135–46.
 23.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
 24.
Liu Z, Shen Y, Ott J. Multilocus association mapping using generalized ridge logistic regression. BMC Bioinform. 2011;12(1):1.
 25.
Vapnik VN, Vapnik V. Statistical learning theory, vol. 1. New York: Wiley; 1998.
 26.
Fawcett T. ROC graphs: notes and practical considerations for researchers. Mach Learn. 2004;31(1):1–38.
 27.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD Explor Newslett. 2009;11(1):10–8.
 28.
R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2014. http://www.Rproject.org
 29.
Huang KZ, Yang H, Lyu MR. Machine learning: modeling data locally and globally. Springer, Heidelberg (2008)
 30.
James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning, vol. 112. Heidelberg: Springer; 2013.
 31.
Erbe M, Hayes B, Matukumalli L, Goswami S, Bowman P, Reich C, Mason B, Goddard M. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed highdensity single nucleotide polymorphism panels. J Dairy Sci. 2012;95(7):4114–29.
 32.
Vaughn L, Williams JD, Robertson G, Caglioti S. Reduced error rates with Rh and K phenotyping with automated testing. 2009. http://mycts.org/publications/pdfs/abstracts/AgAbstract.pdf. Accessed 24 Jun 2016
 33.
Kukhareva P, Staes CJ, Tippetts TJ, Warner PB, Shields DE, Müller H, Noonan K, Kawamoto K. Errors with manual phenotype validation: case study and implications. 2015. https://goo.gl/NnFFWj. Accessed 24 Jun 2016
 34.
Montes JM, Melchinger AE, Reif JC. Novel throughput phenotyping platforms in plant genetic studies. Trends Plant Sci. 2007;12(10):433–6.
 35.
Araus JL, Cairns JE. Field highthroughput phenotyping: the new crop breeding frontier. Trends Plant Sci. 2014;19(1):52–61.
 36.
Fahlgren N, Gehan MA, Baxter I. Lights, camera, action: highthroughput plant phenotyping is ready for a closeup. Curr Opin Plant Biol. 2015;24:93–9.
 37.
Jannink JL, Lorenz AJ, Iwata H. Genomic selection in plant breeding: from theory to practice. Brief Funct Genomics. 2010;9(2):166–77.
 38.
BernalVasquez AM, Möhring J, Schmidt M, Schönleben M, Schön CC, Piepho HP. The importance of phenotypic data analysis for genomic prediction—a case study comparing different spatial models in rye. BMC Genomics. 2014;15(1):1.
 39.
Bernardo R. Molecular markers and selection for complex traits in plants: learning from the last 20 years. Crop Sci. 2008;48(5):1649–64.
 40.
Poland JA, Bradbury PJ, Buckler ES, Nelson RJ. Genomewide nested association mapping of quantitative resistance to northern leaf blight in maize. Proc Natl Acad Sci. 2011;108(17):6893–8.
 41.
Jung C, Müller AE. Flowering time control and applications in plant breeding. Trends Plant Sci. 2009;14(10):563–73.
 42.
Broccanello C, Stevanato P, Biscarini F, Cantu D, Saccomani M. A new polymorphism on chromosome 6 associated with bolting tendency in sugar beet. BMC Genetics. 2015;16(1):1.
Authors’ contributions
FB and NN and SM conceived the study and performed all statistical analyses. FB and NN wrote most of the paper. CB and PS contributed data for the analysis and information and insights on the binary trait used for illustration. All authors read and approved the final manuscript.
Acknowledgements
This study was partially supported by the Italian Ministry of University (MIUR 60 %). Simone Marini is an International Research Fellow of the Japan Society for the Promotion of Science.
Competing interests
The authors declare that they have no competing interests.
Author information
Affiliations
Corresponding author
Additional information
Filippo Biscarini and Nelson Nazzicari contributed equally to this work
Additional file
Additional file 1: Figure S1.
TPR/TNR variability with all SNP and with subsets of 30 or 50 SNP Distribution of TPR (red) and TNR (blue) in the validation set using KNN and RF with all 175 SNP and with subsets of 50 and 30 SNP. TPR and TNR as a function of mislabeled observations, from a 5fold cross validation repeated 100 times. Results are presented per method.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Biscarini, F., Nazzicari, N., Broccanello, C. et al. “Noisy beets”: impact of phenotyping errors on genomic predictions for binary traits in Beta vulgaris . Plant Methods 12, 36 (2016). https://doi.org/10.1186/s1300701601364
Received:
Accepted:
Published:
Keywords
 Noisy data
 Classification
 Knearest neighbours (KNN)
 Random forest (RF)
 Support vector machines (SVM)
 Ridge logistic regression
 Sugar beet
 Binomial phenotype
 Robustness to errors
 Genomic predictions