Whole transcriptome sequencing is increasingly being used as a functional
genomics tool to study non- model organisms. However, when the reference
transcriptome used to calculate differential expression is incomplete,
significant error in the inferred expression levels can result. In this study,
we use simulated reads generated from real transcriptomes to determine the
accuracy of read mapping, and measure the error resulting from using an
incomplete transcriptome. We show that the two primary sources of count- ing
error are 1) alternative splice variants that share reads and 2) missing
transcripts from the reference. Alternative splice variants increase the false
positive rate of mapping while incomplete reference tran- scriptomes decrease
the true positive rate, leading to inaccurate transcript expression levels.
Grouping transcripts by gene or read sharing (similar to mapping to a reference
genome) significantly decreases false positives, but only by improving the
reference transcriptome itself can the missing transcript problem be addressed.
We also demonstrate that employing different mapping software does not yield
substantial increases in accuracy on simulated data. Finally, we show that read
lengths or insert sizes must increase past 1kb to resolve mapping ambiguity.