Information

What is “contigs” in Picard's ReorderSAM?

What is “contigs” in Picard's ReorderSAM?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I've used BWA to map my NGS reads against the hg38 genome, and I have a BAM file. I'm not doing genome assembly, and my reference genome file has the human chromosomes. Thus, I shouldn't have "contigs". But…

https://broadinstitute.github.io/picard/command-line-overview.html#ReorderSam

and quote:

ReorderSam reorders reads in a SAM/BAM file to match the contig ordering in a provided reference file, as determined by exact name matching of contigs

Q: What doescontig orderingmean, for my whole-genome-sequencing experiment? In particular, what does matching the contig against a reference file mean?


I'm not familiar with picard and their reorderSam function, but as far as I know/understand from their documentation they mean this:
The ordering of the contigs while using a reference sequence. Like this:

Figure 5: Anatomy of whole-genome assembly. In whole-genome assembly, the BAC fragments (red line segments) and the reads from five individuals (black line segments) are combined to produce a contig and a consensus sequence (green line). The contigs are connected into scaffolds, shown in red, by pairing end sequences, which are also called mates. If there is a gap between consecutive contigs, it has a known size. Next, the scaffolds are mapped to the genome (gray line) using sequence tagged site (STS) information, represented by blue stars. © 2001 American Association for the Advancement of Science Venter, C. et al. The sequence of the human genome. Science 291, 1304-1351 (2001). All rights reserved. (source)

ReorderSAM (Picard) So in Picard you have yourINPUT (File), the reads in this file are then mapped on theREFERENCE (File). This can also be seen in their code:

// write the reads in contig order 109 for (final SAMSequenceRecord contig : refDict.getSequences() ) { 110 final SAMRecordIterator it = in.query(contig.getSequenceName(), 0, 0, false); 111 writeReads(out, it, newOrder, contig.getSequenceName()); 112 }

(code source)

ReorderSam reorders reads in a SAM/BAM file to match the contig ordering in a provided reference file

Some more background
There are two main approches two obtain a genome sequence:

there are two "main" approches for this:
g. Second-generation sequencing technologies produce millions of short(a few hundred bp) strings of nucleotides (reads), which is ideal for resequencing when reads are mapped to a reference genome (reference-based assembly). De novo genome assembly based on second-generation sequencing is challenging due to difficulties with GC- or AT-rich and homonucleotide DNA stretches, which are under-represented in the sequencing output (source)

The characteristics of these are:
de novo

  • no bias towards a reference genome
  • no template to adapt to
  • the assembly is normally more fragmented
  • it normally works better for large-scale/median scale differences (source)


reference mapping

  • less contigs
  • in most methods the reads that don't map are not used in the final sequence (this is also the case with reorderSAM:Reads mapped to contigs absent in the new reference are dropped
  • you look what is similar to your reference genome
  • SNPs and very small veriations are more easily positioned and compared among groups (source)

I would highly recommend to watch this short animation to differentiate between these two and understand what reference genome mapping is.