DNA barcoding, querry regarding tree based (cluster) methods for species identification

DNA barcoding, querry regarding tree based (cluster) methods for species identification

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am working on success rate of DNA barcoding in identification of species using distance and tree based methods.

Regarding distance based method, I have used Adhoc and species identifier and confirmed species identification using best closest match criteria (BCM).

Please suggest program/software/method for tree based identification (using NJ, parsimony, bayes), exhibiting clustering (number of clusters formed for particular species) that will suggest singleton (=Ambigious) species.

(For more details: Attached image) (Please Note: It is not possible to do it manually as iam having >4000 specimens)

Answers detailing methods or softwares appereciated! Thank you!

DNA barcoding a unique avifauna: an important tool for evolution, systematics and conservation

DNA barcoding utilises a standardised region of the cytochrome c oxidase I (COI) gene to identify specimens to the species level. It has proven to be an effective tool for identification of avian samples. The unique island avifauna of New Zealand is taxonomically and evolutionarily distinct. We analysed COI sequence data in order to determine if DNA barcoding could accurately identify New Zealand birds.


We sequenced 928 specimens from 180 species. Additional Genbank sequences expanded the dataset to 1416 sequences from 211 of the estimated 236 New Zealand species. Furthermore, to improve the assessment of genetic variation in non-endemic species, and to assess the overall accuracy of our approach, sequences from 404 specimens collected outside of New Zealand were also included in our analyses. Of the 191 species represented by multiple sequences, 88.5% could be successfully identified by their DNA barcodes. This is likely a conservative estimate of the power of DNA barcoding in New Zealand, given our extensive geographic sampling. The majority of the 13 groups that could not be distinguished contain recently diverged taxa, indicating incomplete lineage sorting and in some cases hybridisation. In contrast, 16 species showed evidence of distinct intra-species lineages, some of these corresponding to recognised subspecies. For species identification purposes a character-based method was more successful than distance and phylogenetic tree-based methods.


DNA barcodes accurately identify most New Zealand bird species. However, low levels of COI sequence divergence in some recently diverged taxa limit the identification power of DNA barcoding. A small number of currently recognised species would benefit from further systematic investigations. The reference database and analysis presented will provide valuable insights into the evolution, systematics and conservation of New Zealand birds.


The Neotropics hold an estimated 78,800 flowering plant species, over a third of the world's total [1]. Yet, tropical forests are being degraded at a fast pace [2], [3], and over half of the estimated 11,000 Amazonian tree species may face a direct risk of extinction [4]. Thus large-scale biodiversity inventories are critically needed in order to develop informed conservation strategies for these diverse ecosystems [5], [6]. Significant progress in mapping the distribution of Neotropical plants has been achieved over the past decades [7]–[11], but many areas are still under-collected and species identification remains a challenging task in many plant families. An example was recently provided by Pitman et al. (2008), who conducted a tree species diversity survey along a 700-km transect that cuts across one of the most diverse parts of the Amazon, between Ecuador and Brazil [12]. Based on traditional botanical sampling, they were able to identify 97% of the sampled stems to the genus, and counted a total of 435 tree genera. Yet, in their statistical analyses, they decided to conservatively exclude the genera that were difficult to identify in the field when only sterile material was available. Their choice of excluding no less than 20.7% of the genera, and 15.7% of the sampled stems resulted in loss of information, the influence of which on their conclusions is unknown.

With the advent of high-throughput DNA sequencing, it has been suggested that universally amplified, short, and highly variable DNA markers (DNA barcodes) may help identify organisms to species with a high confidence, which would be useful in a wide array of applications, including biodiversity surveys [13]–[15]. DNA barcodes should be both variable enough to discriminate among closely related species and yet possess highly conserved regions so as to be easily sequenced with standard protocols. The mitochondrial marker cytochrome c oxidase I (CO1) has met with some success for animal groups [13], [16], but see [17]–[19]. In plants, the search for suitable genomic regions has proven more challenging. Several regions in the plastid genome (e.g. rbcL, rpoC1, rpoB, ycf5, psbA-trnH, trnL, atpF-atpH, psbK-psbI) as well as the internal transcribed spacer (ITS) of the ribosomal nuclear DNA have emerged as good candidates for plant DNA barcoding [20]–[27]. A consensus has recently emerged among the members of the Consortium for the Barcoding of Life (CBoL) Plant Working Group for using only two of these markers to barcode land plants, namely rbcL and matK [28], yet these authors point out that this combination will lead to a species-level identification in 72% of the cases only, and this resolution is unlikely to be evenly distributed across land plant species.

Echoing Chase et al. (2007) [29], the CBoL Plant Working Group pointed out that plant DNA barcoding should be useful in discriminating among forest seedlings, or undertaking large-scale biodiversity surveys in situations where taxonomic expertise is limiting. Yet, we are unaware of any application in this research area thus far, and the present work fills this gap. Tropical plants present challenges to DNA barcoding that are much more pronounced than those encountered when barcoding temperate plants, and today applications of plant DNA barcoding in the tropics is still unchartered land (the only exceptions being applications on genus Compsoneura in the Myristicaceae, see Newmaster et al. 2008 genus Inga in the Fabaceae [30] and the orchid family [26]). DNA extraction is expected to be more difficult in tropical plants, due to the greater abundance of secondary metabolites [31], and this may compromise the overall performance of DNA barcoding [32]. In addition, the rate of lineage diversification is often high in the tropics, leading to the frequent occurrence of explosive radiations [33]–[34]. For recent lineages with great numbers of species, we thus expect that DNA barcoding will be less efficient, because species will tend to have lots of close relatives, reducing levels of interspecific divergence, as recently confirmed in genus Inga [35], and as should be expected in other groups [36]. Finally, it has been shown that woody plant lineages show consistently lower rates of molecular evolution as compared with herbaceous plant lineages [37], suggesting the application of DNA barcoding concepts should be more difficult for tree floras than for non-woody floras [26], [38].

In the present study, we use a plot-based sampling strategy to test the applicability of the currently proposed DNA barcoding scheme. Specifically, we examine if consensus barcodes are sufficiently variable and universal to reliably identify co-occurring Amazonian tree species, and we implement this scheme to the identification of tropical juvenile plants.


Reference DNA sequence analysis

Data description and distance summary

Haplotype data analysis detected 170 unique haplotypes in the DNA reference libraries (Table 1). The average nucleotide frequencies for all 42 species were as follows: A (adenine), 28% T (thymine), 40% G (guanine), 15.2% and C (cytosine), 16.8%. The analysis revealed that interspecific Kimura-2-parameter (K2P) genetic divergence ranged between 0.045–0.201 with a mean genetic distance (MGD) of 0.133 intraspecific K2P genetic divergence ranged between 0–0.107 with an average of 0.009 (Table 1).

Identification success rates

In the simulations, the nearest-neighbour (NN) approach returned 97.39% correct and 2.61% incorrect identifications (Fig. 1). The threshold analysis (TA) returned the same results as best close match (BCM) at the threshold value 0.01 (79.56% correct and 20.44% incorrect identifications). With a threshold of 0.039 calculated by the function localMinima in SPIDER, the TA and BCM provided 94.68% correct and 5.32% incorrect identifications. With a threshold of 0.044 (Additional file 1: Figure S1) generated by the function threshVal in SPIDER, the TA and BCM provided 95.21% correct and 4.79% incorrect identifications. The proportion of monophyly on a neighbor joining (NJ) tree approach (Mono) showed a success rate at 100% (Fig. 1).

Barplots of measures of identification success. Abbreviations: NN, nearest-neighbour TA, threshold analysis with 1% threshold TA.threshVal, threshold analysis with 4.4% threshold TA.localMinima, threshold analysis with 3.59% threshold BCM, best close match (1% threshold) BCM.threshVal, best close match with 4.4% threshold BCM.localMinima, threshold analysis with 3.59% threshold Mono, proportion of monophyly on a NJ tree

Barcode gap analysis

In our reference DNA sequences, we counted how often the maximum intraspecific distance exceeded the minimum interspecific distance. Using length and which functions in SPIDER to query how many times this occurred in our reference DNA sequences, we found that this was the case on 14 occasions (Additional file 2: Figure S2).

Molecular identification for Culicoides larvae

DNA sequences of Culicoides larvae collected in the Niayes area of Senegal were successfully obtained for 958 out of 1632 larvae (58.6%). PCR amplifications failed for 99 out of 773 samples of stages L1-L2, while all selected samples of stage L3-L4 were successfully amplified (859/859 samples). This might be explained by the physical size of the different larval stages (L1 and L2 stages are < 2 mm). The sequences were edited in Geneious R11 [19] and 933 cox1 sequences of better quality were used in this study. The overall rate of cox1 sequences successfully matched within our reference DNA sequences used as Search Set in BLAST search was 97.1%. Thus, 906 out of 933 cox1 sequences of larvae were successful identified to Culicoides species. However, 27 cox1 sequences were unmatched within our DNA barcode reference libraries. In order to find a match, these cox1 sequences were used as a query in NCBI ( However, no matches were found for these sequences.

The sequences matched corresponded to eight Culicoides species (Table 2). Of these species, Culicoides oxystoma Kieffer had the highest percentage (66.8%), followed by Culicoides nivosus de Meillon (21.5%), Culicoides distinctipennis Austen and Culicoides similis Carter, Ingram & Macfie (both slightly above 3%) (Table 2).

DNA barcoding database analyses

A total of 1131 cox1 sequences were submitted to the BOLD database under the project code “AFCUL” (details see Additional file 3: Table S1). A hierarchical increase in mean divergence was observed according to two taxonomic levels: within species (mean = 1.92%, SE = 0.00) and within genus (mean = 17.82%, SE = 0.00). In the barcode gap analysis using the BOLD Management and Analysis System, situations where the distance to the nearest neighbour was less than the max intra-specific distance were encountered in seven species (Additional file 4: Table S2). Haplotype data analysis detected 360 haplotypes in 1131 cox1 sequences for 40 Afrotropical Culicoides species.

DNA barcoding and species delimitation of butterflies (Lepidoptera) from Nigeria

Accurate identification of species is a prerequisite for successful biodiversity management and further genetic studies. Species identification techniques often require both morphological diagnostics and molecular tools, such as DNA barcoding, for correct identification. In particular, the use of the subunit I of the mitochondrial cytochrome c oxidase (COI) gene for DNA barcoding has proven useful in species identification for insects. However, to date, no studies have been carried out on the DNA barcoding of Nigerian butterflies. We evaluated the utility of DNA barcoding applied for the first time to 735 butterfly specimens from southern Nigeria. In total, 699 DNA barcodes, resulting in a record of 116 species belonging to 57 genera, were generated. Our study sample comprised 807 DNA barcodes based on sequences generated from our current study and 108 others retrieved from BOLD. Different molecular analyses, including genetic distance-based evaluation (Neighbor-Joining, Maximum Likelihood and Bayesian trees) and species delimitation tests (TaxonDNA, Automated Barcode Gap Discovery, General Mixed Yule-Coalescent, and Bayesian Poisson Tree Processes) were performed to accurately identify and delineate species. The genetic distance-based analyses resulted in 163 well-separated clusters consisting of 147 described and 16 unidentified species. Our findings indicate that about 90.20% of the butterfly species were explicitly discriminated using DNA barcodes. Also, our field collections reported the first country records of ten butterfly species—Acraea serena, Amauris cf. dannfelti, Aterica galena extensa, Axione tjoane rubescens, Charaxes galleyanus, Papilio lormieri lormeri, Pentila alba, Precis actia, Precis tugela, and Tagiades flesus. Further, DNA barcodes revealed a high mitochondrial intraspecific divergence of more than 3% in Bicyclus vulgaris vulgaris and Colotis evagore. Furthermore, our result revealed an overall high haplotype (gene) diversity (0.9764), suggesting that DNA barcoding can provide information at a population level for Nigerian butterflies. The present study confirms the efficiency of DNA barcoding for identifying butterflies from Nigeria. To gain a better understanding of regional variation in DNA barcodes of this biogeographically complex area, future work should expand the DNA barcode reference library to include all butterfly species from Nigeria as well as surrounding countries. Also, further studies, involving relevant genetic and eco-morphological datasets, are required to understand processes governing mitochondrial intraspecific divergences reported in some species complexes.


Recent arguments on the utility of cox1 DNA barcoding in blackflies have been discussed by [22,23,24, 35]. In our study, known species clustered together in the NJ tree based upon cox1 DNA barcode sequences (Fig. 2), which demonstrate the utility of this methodology in support of species identification. Most of the individuals of a given species were correctly placed in the NJ tree. Nonetheless, morphological specimens identified as S. argyreatum, S. monticola and S. variegatum were mapped in the same cluster, implying that they might be conspecific. This result was not surprising as the adults of the three species are morphologically very similar. However, the three taxa can be readily identified based on the pupal gill configuration. Simulium variegatum is easily identified by having 1+1 prominent tubercles at the base of the gill [26], while the tubercles are absent in S. argyreatum and S. monticola. In S. monticola, the ventral gill filaments originate directly from the base, all filaments are prominently curved at mid length, and the cephalothorax is covered by areas of small tubercles [26]. In S. argyreatum, the gill is covered by tubercles, which are homogeneously distributed [26]. Thus, we advocate that different genetic markers such as the elongator complex protein 1 gene (ECP1) or ITS2 [36,37,38,39] should be used to explore their taxonomic status.

We expected higher levels of genetic variation between members of known species complexes, even though cytological studies were not carried out in our study [3, 22,23,24, 40]. With this regard, most of the specimens grouped together, and high levels of genetic diversity was not identified between species complexes. In addition, no deep divisions in the NJ tree as observed in previous studies [3, 22, 24]. This is likely due to the fact that most of the specimens originated from the same, or relatively close, localities. However, not all known species grouped as we anticipated. As a whole, we revealed high intraspecific genetic divergence not only in P. latimucro (s.l.) with 2.77%, but also in P. tomosvaryi with 2.93% and S. intermedium with 3.96%. In particular, S. intermedium was split into two distinct groups, named here I and II (Fig. 2). This could be indicative of the presence of a species complex, but further cytotaxonomic studies are required to validate this hypothesis. In this study, the values obtained for the intraspecific genetic divergences as well as for the interspecific genetic divergences are within the values obtained by other authors [22,23,24, 40,41,42].

Many authors (e.g. [34, 35]) have stated that the congruence found between morphologically recognized species and BINs could demonstrate the presence of cryptic genetic diversity. Therefore, the subgroups detected in S. intermedium may be indicative of such diversity. In contrast, the presence of same BINs in other recognized species such as S. argyreatum, S. monticola and S. variegatum, might be hard to explain. Therefore, we advocate for further biosystematic studies in these taxa not only in Spain, but across their distribution range.

Materials and Methods

Sequences, Alignment, and Pairwise Distances

Similar to the approach of Barrett and Hebert (2005) and Hebert et al. (2003b), we used 1443 sequences from GenBank and aligned them using ClustalX. We eliminated sequences that were (1) too short (< 300 bp, 57 sequences), (2) not identified to species (49 sequences), (3) came from species-hybridization experiments (Drosophila subquinaria GI:25990046), and (4) could not be aligned and/or translated into proteins or had > 30% sequence divergence to all other COI sequences for Diptera (Dyscritomyia robusta GI:19879668 Drosophila busckii GI:27657151 Drosophila affinis GI:27657153). Sequences with GenBank names that are synonyms according to the Biosystematic Database of World Diptera ( Thompson, 2005) were renamed using the valid name (Anopheles arabiensis is junior synonym of Anopheles gambiae 3 sequences). The remaining 1333 COI sequences came from 449 species of Diptera, of which 127 species were represented by 1011 sequences (see supplementary material available online at Sequences not belonging to COI were removed and misplaced gaps were corrected to yield a 1539-bp gap-free alignment lacking stop codons (see supplementary material available online at Most analyses were carried out using a program developed for this purpose (“TaxonDNA” available at We carried out separate analyses for all sequences with a minimum of 300, 400, 500, and 600 overlap. Due to large interspecific distances in the very speciose genus Drosophila, we treated the Drosophila subgenera as separate genera.

In order to test for overlap between intraspecific with interspecific genetic variability, we plotted all uncorrected pairwise distances for conspecific sequences and all distances for interspecific, congeneric sequences. In order to test whether all species have unique DNA barcodes, we first tested whether identical sequences were shared by individuals from different species. We then constructed species barcodes as the consensus sequence of all conspecific sequences and again tested for the uniqueness of these species barcodes. The intraspecific sequence variability was summarized using IUPAC codes and the consensus sequence had to be based on at least two sequences.

DNA Barcoding: Tree-Based Query Identification

Using PAUP* ( Swofford, 2002 Kimura 2-parameter model as recommended in Barrett and Hebert, 2005 ties broken randomly), we computed neighbor-joining trees and bootstrap trees for the largest sets of congeneric sequences with at least 300 bp overlap. The same data sets were analyzed using parsimony as implemented TNT ( Goloboff et al., 2003) using New Technology Search = 15 find min. length = 3 bootstrap 250 replicates), and Bayesian analyses as implemented in MrBayes 3.1 ( Huelsenbeck and Ronquist, 2003). All Bayesian searches were initiated from random starting trees. For all data sets with congeneric sequences, the GTR+I+G model was favored by the Akaike information criterion and hierarchical likelihood-ratio testing as implemented in MrModelTest version 2.2 ( Nylander, 2004). The data set was run for 3,000,000 generations and a tree was sampled every 300 generations, resulting in 10,000 trees. Chain stationarity had been achieved for all data sets after 1,200,000 generations (burn-in) and 4000 trees were subsequently discarded. Three independently repeated analyses resulted in similar tree topologies and comparable clade probabilities and substitution model parameters.

For all trees, identification success was initially assessed as described in Hebert et al. (2003a) and Table 1 i.e., sequences were considered successfully identified as long as they formed species-specific clusters. Species with sequences at multiple positions in the tree were considered failures and those species with a single sequence were counted as ambiguous. Second, we used the revised identification criteria described in the introduction. We only considered queries to be correctly identified if they were found in a species-specific polytomy or at least one node into a clade exclusively consisting of sequences from one species. Ambiguous were all queries belonging to species with one or two sequences and those that formed a sister group to a cluster of conspecific sequences. We counted those sequences as misidentified that were assigned a definite but incorrect name (e.g., a query within an allospecific sequence cluster). A special case is polytomies of sequences from two species. If the query was from a different species than all remaining sequences in the polytomy, we counted the query as a misidentification because an identifier will assume that the query is conspecific with the remaining sequences. However, if the polytomy had at least two sequences each from two different species, then the query in the polytomy was considered ambiguous, because the identifier will be aware that a query in such a polytomy cannot be unambiguously identified.

DNA Barcoding: Identifying Species Based on Distances (see Table 1)

“Best match.” We used TaxonDNA to find for each query its closest barcode match. If both sequences were from the same species, the identification was considered a success, whereas mismatched names were counted as failures. Several equally good best matches from different species were considered ambiguous.

“Best close match.” We used TaxonDNA to plot the relative frequency of intraspecific distances in order to determine the threshold value below 95% of all intraspecific distances are found. All queries without barcode match below the threshold value remained unidentified. For the remaining queries, their identity was compared to the species identity of their closest barcode. If the name was identical, the query was considered an identification success. The identification was considered a failure when the names were mismatched and considered ambiguous when several equally good best matches were found that belonged to a minimum of two species.

“All species barcodes.” We assembled for each query a list of all barcodes sorted by similarity to the query using the same threshold as for best close match. Queries were considered a success when they were followed by all conspecific barcodes as long as there were at least two barcodes for the species. Queries were considered ambiguous when they were followed by only one conspecific barcode or only some of the conspecific sequences. Queries followed by all conspecific sequences from the “wrong” species were considered misidentified.

DNA Taxonomy: Profiles Based on Distance Thresholds

We tested the viability of threshold values for distinguishing intra- from interspecific variability for 2%, 3%, 4%, 5%, and 6% thresholds. For this purpose, TaxonDNA finds for each query a set of barcodes for which each sequence in the set has at least one other sequence within the threshold distance. For all clusters, we determined whether the largest observed distance exceeded the threshold distance and whether they correspond to currently accepted species (= contains all sequences for one species). If not, we determined whether it contained sequences for several species (error 1) and/or not all sequences for the same species (error 2).


In conclusion, our study revealed that DNA barcoding based on COI gene is an effective method to clarify species boundaries and quantitatively evaluate species diversity (e.g., taxa abundance and cryptic species). Population differentiation of benthic macroinvertebrates in four transboundary rivers was ascribed to geographical isolation. Geographical isolation and diversification events are two main factors for different populations to evolve in different directions and thus lead to a great increase in the diversity of benthic macroinvertebrates. Even so, DNA barcoding could be supplemented in population genetics studies, with morphological, ecological nuclear DNA, and other nonmolecular data regarding the existence of cryptic species and assessment of intraspecific divergence.


Surveys of three countries (Fig. 1) assemblages yielded a total of 2,801 sequences for the northwestern Pacific molluscs, belonging to 91 families, 240 genera, and 569 species. The taxonomy, accession numbers and the site of collection are available at Supplementary Table S1. For most species, multiple specimens (mean = 4.9 specimens per species) were analyzed to document intraspecific variability. 182 species were represented by a single specimen, and 1 species (Cellana nigrolineata) was represented by 62 specimens. The average nucleotide frequencies for all 573 species are as follows: A = 22.97%, T = 39.41%, G = 20.96% and C = 16.66%. Mean GC content averaged 37.62% (SE = 0.06), but showed considerable variation (range 29.94–52.02%). A chi-square test of homogeneity demonstrated significant variation in nucleotide frequencies among species in each of five molluscan classes (P < 0.001). Mean nearest neighbour distances between congeneric species showed a significant (P < 0.001 R 2 = 0.167) positive correlation with mean GC content (Supplementary Fig. 1).

Distribution map for all sampling sites (magenta circles) in the region of the northwestern Pacific.

The countries surrounding the study area: Greater China, Japan, and Korea. The location details and a list of the number of samples collected per site are available in the Supplementary Table S1. Both the map of World and the map of northwestern Pacific with Greater China, Japan, and Korea were rendered with ODV v4.7.3 63 (available at and modified in Microsoft Office.

Distance summary

We observed a hierarchical increase in mean divergence according to different taxonomic levels, within species (mean = 0.97%, SE = 0.023), within congeners (mean = 18.67%, SE = 0.004), within families (mean = 22.47%, SE = 0.003), within orders (mean = 25.3%, SE = 0.002) and within classes (mean = 30.60%, SE = 0.012) (Table 1). Therefore, there was ca 19.25× more variation among congeneric species than among conspecific individuals. A regression analysis revealed that the mean interspecific divergence appeared to increase with the number of species analyzed from a genus, but the regression was not significant (Fig. 2A P = 0.049 R 2 = 0.138). And the intraspecific divergence did not significantly differ with the number of individuals analyzed per species (Fig. 2B P = 0.27 P = 0.56).

(A) The relationship between interspecific divergence and sample size within genera. Mean interspecific divergence (% K2P) at COI plotted against the number of species sampled from each genus of marine molluscs with ≥2 species (N = 123). The correlation was insignificant (P = 0.052 R 2 = 0.08). (B) The relationship between intraspecific divergences and sample size within species. Mean and maximum intraspecific divergences (% K2P) at COI plotted against the number of individuals analyzed for 393 species of northwestern Pacific molluscs. The correlation between sample size and mean intraspecific divergence is insignificant (P = 0.27 R 2 = 0.024) as well as the maximum intraspecific divergence (P = 0.56 R 2 = 0.085).

Barcode gap analysis

We counted how often the maximum sequence divergence among individuals of a species exceeded the minimum sequence divergence from another congeneric species. These situations, which may confound barcode-based taxonomic assignments, were encountered in 70 species (12.30%) (Fig. 3, Supplementary Table S2). In these species, the maximum intraspecific variation overlaps with the NN (nearest neighbour) distance, leading to the absence of a barcode gap and in 36 case, NN distances were zero. 91 species show low distance to the NN (<=2%), but still exceeded the maximum intraspecific value.

Statistical results of DNA barcoding performance.

(A) Maximum intraspecific divergence compared with the nearest-neighbor distance for northwestern Pacific molluscs. Only species with multiple sequences are presented. Points above the line indicate species with a barcode gap. (B) Performance based on taxon clustering in Neighbor-joining analysis.

Success of sequence-based specimen identification techniques

In the simulations, the BM approach returned 89.15% of true and 10.92% of false identifications (Table 2). When singletons were removed, false identifications decreased to 4.73%. Details of simulation results are available as Supplementary Table S3. With a threshold of 0.01, the BCM analysis provided 68.62% of true and 1.14% of false identifications. For 14.28% of the queries, the result is ambiguous (more than one equally close matches were found below the threshold of 0.01). 15.96% of the queries had no conspecific matches below the threshold of 0.01, and almost half of these (40.72%) were singletons with no conspecific sequence available. The threshold optimization method (‘threshVal’ function in SPIDER) reported a threshold between 0.0135 and 0.0260 (Supplementary Fig. 2). The average value of 0.02 was selected as the optimized threshold for the analyses. Under this threshold, the BCM approach provided 74.94% of true, 1.75% of false identifications, and the ambiguous queries were 15.42%. The remaining 7.89% queries were unidentified. The ‘localMinima’ function in SPIDER returned the threshold of 0.053 as possible transition between intra- and interspecific distances (Supplementary Fig. 3). With this threshold, the BCM approach provided 76.29% of true, 2.46% of false and 15.49% of ambiguous identifications, while 5.75% had no identification. When singletons were excluded, the false and unidentified queries decreased under each threshold. The ASB analysis returned the same results as BCM at the threshold value 0.01. While the BCM approach returned a slightly higher success rate than that of ASB approache from threshold value 0.021 and 0.053.

BIN discordance report and the nearest neighbour analysis

The BIN analysis included 2591 of the 2801 records and generated 582 different BINs. A number of 387 BIN clusters was found to be taxonomically concordant with other barcode data on BOLD assigned to the same species name (Supplementary Table S4). Five records was indicated as singleton, which means that this BIN only refers to one specimen (Supplementary Table S5). BIN discordance analysis returned 190 BINs as discordant respect to our prior taxonomic assignments (Supplementary Table S6). The external (incl. BOLD data) incongruence occurred at different taxonomic levels: the highest rank of conflict was found at one phylum level, followed by five at order, as well as nine family level. At the genus level, 62 BINs were found to be discordant, which means that specimens belonging to different genera of the same family were grouped together in one BIN. Finally, 113 BINs incorporate specimens of at least two congeneric species. Within BNPM data, 72.2% of BINs was found to be concordantly with morphology-based identifications. The discrepancies include two groups: (i) 45 discordant BINs caused by haplotype sharing and low between-species divergence (Table 3), and (ii) 68 species clusters were assigned to two or more BINs (Table 4).

The nearest neighbour (NN) of each BIN according to the data available in BOLD is available as Supplementary Table S3. The NN comparison evidenced the under-representation of mollusc species on BOLD and the need for taxonomic reassessment of some species: 65% of BINs generated by our entries had a congeneric NN, 20% had a NN from the same family or a higher taxonomic rank, and 14% had a NN represented by an unidentified specimen.

Neighbour-joining analysis

The neighbour joining (NJ) tree profile showed that sequence records for 2,320 (83.0%) queries representing 355 (62.4%) of all species formed distinct barcode clusters allowing their successful identification. 299 sequences involve 31 cases of paraphyly or shared barcodes between closely related species pairs, making their misidentification. Due to the lack of conspecific sequences in the data set, 31.9% of species are ambiguous and remain unidentified (Supplementary Fig. 4). Therefore, a large proportion of sequences (83.0%) and species (62.4%) were unambiguously distinguishable using the criterion of barcode clusters.

Thirteen of these 31 problematic cases involved species that formed paraphyletic clusters (Supplementary Fig. 4, groups highlighted in yellow e.g., Patelloida pygmaea Fig. 4A). For P. pygmaea, some of the taxa exhibiting deep intraspecific divergence values were recovered as paraphyletic in phylogenetic trees nevertheless, the haplotype networks of the paraphyletic species demonstrated that no shared haplotype was found between each pair of the species (e.g., Patelloida spp. Fig. 4B).

The phylogenetic analysis of four species of the genus Patelloida.

(A) Neighbour-Joining (NJ) tree shows the relationships of the Patelloida spp. based on the K2P parameter model with bootstrap values more than 50% indicated. (B) The network connecting the haplotypes documented in the Patelloida spp. Haplotypes are represented by circles. The numbers on the internodes indicate mutation steps, and the other numbers are the frequencies of each haplotype. Color-coding represents distinct species. The black solid circle indicates missing intermediate steps between observed haplotypes.

Members of six species pairs and two species trioes showed cases of barcode sharing, producing a mixed-species cluster in the NJ tree (Supplementary Fig. 4, framed clusters e.g., Meretrix spp Fig. 5A). For Meretrix spp., the sharing of COI haplotypes was found in the haplotype networks of the closely related species (Fig. 5B). Overall, all these eighteen species with undifferentiated barcodes formed only fifteen clusters in phylogenetic trees.

The phylogenetic analysis of five species of the genus Meretrix.

(A) Neighbour-Joining (NJ) tree of barcodes from individuals of the genus Meretrix based on the K2P parameter model with bootstrap values more than 50% indicated. (B) Haplotype networks of Meretrix species. Haplotypes are represented by circles. The numbers on the internodes indicate mutation steps, and the other numbers are the frequencies of each haplotype. The haplotypes have a size proportional to the number of analyzed specimens with this haplotype. Color-coding represents distinct species. The black solid circle indicates missing intermediate steps between observed haplotypes.

Deeply divergent intraspecific clusters were found within 62 of the 569 analyzed species (10.9%), indicating the occurrence of cryptic diversity (Table 5, Supplementary Fig. 4, groups highlighted in magenta). Those divergent intraspecific clusters, which correspond to divergent evolutionary lineages, were restricted to 32 of the 91 analyzed families (Table 5). The number of lineages by species varied from 2 to 4, for a total of 137 divergent lineages among 62 named species, which suggests a 13% increase in species diversity. Deeply divergent intraspecific lineages (>2%) were always (19/62) found in different geographical locations (e.g Echinolittorina vidua and Serratina capsoides Fig. 6A–D). Notably, the inflated geographical coverage changed the clustering pattern of conspecific individuals. In our data set, 3 species (Patelloida pygmaea, Thais luteostoma and Conus sanguinolentus) moved from monophyletic to paraphyletic after inclusion of additional populations. Consequently, we concentrated the study on how does the inclusion of geographically separated populations influence DNA barcoding. As expected, expansion of geographical coverage significantly increased intraspecific variation. The mean value of maximum intraspecific genetic distance increased eight-fold: from x ± S.E. = 1.02 ± 0.06% (when one population species was considered) to x ± S.E. = 8.77 ± 0.17% (when individuals from distinct populations were included).

Examples of taxa with deep intraspecific divergence.

(A) Sampling sites of the two COI lineages found in Echinolittorina vidua. The specimens of both lineages were present (scale bar, 400 km). (B) Neighbour-Joining (NJ) tree of COI barcodes of E. vidua with bootstrap values more than 50% indicated. (C) Sampling sites of the two COI lineages found in Serratina capsoides. The specimens of one of the lineages were allopatric (scale bar, 400 km). (D) Neighbour-Joining (NJ) tree of COI barcodes of S. capsoides with bootstrap values more than 50% indicated. The map of northwestern Pacific with Greater China, Japan, and Korea was rendered with ODV v4.7.3 63 (available at and modified in Microsoft Office.

Complete example of application

In this section, we report a complete example of the BarcodingR software application.

File formats: input and output

Most of functions in BarcodingR take objects of DNAbin class as inputs. To obtain DNAbin objects, users should prepare two fasta format files, corresponding to reference and query sequences, respectively, and these two data sets are read into r using function read.dna in r package ape (Paradis, Claude & Strimmer 2004 ) or another r function fasta2DNAbin in package adegenet ( Please note that the reference data set must contain taxon information with a special format like ‘>seqID, species_names’, for instance, ‘>seq1, Dendrolimus_punctatus’, or ‘>seq1, Lasiocampidae_Dendrolimus_punctatus’, whereas query data set can be in general fasta format. The main species identification functions in BarcodingR return a list, containing queIDs, identified species, and bp.probilities/FMF values, outputs can be written into a file using the function save.ids, which takes object of ‘BarcodingR’ class as inputs.

How to run

Here, we provide examples of utilities of the main functions in BarcodingR using included data sets. First, one should load the package and example data.

Watch the video: The Role of DNA Barcoding and Environmental DNA-Based Monitoring in Support of a One Health Agenda (January 2023).