CHGI Bioinformatics: Transcriptome analysis without an annotated genome: pretty fast and slightly dirty

tl;dr If you're doing RNASeq for a species without a well-annotated, contiguous genome, and want useful differential expression results "quickly", run Trinity in genome guided mode, then Kallisto+Sleuth. Annotated functionally using Diamond and a few Perl one liners (below). It'll take less than a week, mostly hands off in Trinity run time.

_________________

An interesting RNA sequencing project came across my desk recently, where a group was looking at a species of fish (with a published genome) they'd corralled and raised at multiple sites upstream and downstream of a riparian urban center. The idea is to look at fish gene signatures for fauna water quality stress.

These days there are a lot of genomes out there that have been generated by groups that basically got funding to run a few HiSeq lanes, and they push out assemblies with thousands (if not more) contigs to get a paper out of it. I don't blame them, as getting a nice genome of contiguous chromosomes is a lot of work, though becoming easier with long reads from PacBio and Oxford Nanopore instruments. The academic rewards these days are not that high for such an endeavour, especially if your bailiwick is molecular biology, ecology, physiology, etc.. Even if you get a nice genome, gene structure and functional annotation in eukaryotes (except maybe fungal genomes, which I was involved in automating) takes a lot of manpower unless you can convince Ensembl your assembly is ripe for them to take on. The fish genome in question is in this dead space: no gene models, lots of contigs.

That being said, as soon as a "genome" sequence is available, researchers often assume you can easily do RNASeq experiments now with good functional analysis and the whole nine yards. I've done this before with Tophat + Cufflinks + Cuffdiff in de novo gene modelling mode for other organisms, which is okay but has issues with genes split across contigs, missassemblies, isoform detection, chimeric genes, etc. Since those tools were in vogue there have been a few advances in RNASeq processing, so I thought it's worth seeing what alternatives can do.

First off, don't map back to NCBI's UniGene. I tried this since it's easy and got nothing, which is due to the several factors such as poor coverage of genes expressed in the tissue of interest (liver), and the fact that UniGene gives you the best representative sequence for each gene, not a consensus transcript. The representative is the longest good quality read in dbEST for a gene, and because these mostly come from (poly A-tail amplified) ESTs, they tend to have a 3' mRNA bias which leads to a whole host of issues for differential expression analysis especially in fish, which often have 8 copies of genes (they went through an extra whole genome duplication, so ohnologs that have 4 copies in most vertebrates have 8 in teleosts, i.e. most fish).

Given the volume of tissue specific transcript information you get these days for cheap, why not do a de novo transcript assembly, then run a quick differential expression workflow like I blogged in Using Kallisto & Sleuth for RNASeq analysis? In my case, this yielded about 50 genes in logical functional categories that were candidates for further study in water quality stress response. Here are the details of how to proceed.

I downloaded the reference genome from the NCBI for the fish in question.

wget ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/JN/CD/JNCD01/JNCD01.1.fsa_nt.gz

Did the same for FastA parts 2 and 3. Concatenated the files to get a final FastA genome.

cat JNCD01.*.fsa_nt.gz | gzip -cd - > genome.fna

BWA (version 0.7+) indexed the genome. If you need samtools, bwa, Trinity, etc., you may find it easiest to just install BioBuilds. Then...

bwa index genome.fna

Ran BWA against the genome for all the samples, in this case X = {site0, site1, site2, site3, site4}, and Y = {1,2,3,4} since there were four fish sampled from each site.

bwa mem -t 10 genome.fna <(gzip -cd siteX_sampleY.R1.fastq.gz) <(gzip -cd siteX_sampleY.R2.fastq.gz) | samtools sort -@ 4 -T siteX_sampleY -O bam > siteX_sampleY.bam

Merged all the BAMs into one big one (using 16 threads):

samtools merge -r -@ 16 -O BAM all_samples.bam *.bam

Ran Trinity on reference-genome-guided mode with these data to generate consensus transcripts. NOTE: unlike most genome-guided methods, Trinity only uses the genome to pre-cluster the reads for the de novo assembly, so it doesn't inherit all the gaps and flaws of the "genome".

I have a lot of threads available on my machine, so I let it use 80. YMMV.

Trinity --genome_guided_bam all_samples.bam --genome_guided_max_intron 10000 --max_memory 450G --CPU 80

If you didn't have a reference genome, you would have just skipped the bwa bit, the samtools bit and previous line, going straight to this instead:

Trinity --seqType fq --max_memory 800G --single (<gzip -cd myreads.fastq.gz) --CPU 80 --trimmomatic

At this point, book a last minute, long weekend getaway. Trinity finished after 5 days, generating 510738 transcript contigs in trinity_output_dir. Using a normal Trinity run instead on this volume of data would normally take 2 weeks or more. #winning

I ran these against the NCBI non-redundant protein database with Diamond (blastx mode),

Download the non-redundant protein dataset from the NCBI.

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

Uncompress it, then index it with Diamond.

diamond index -i nr nr

Run Diamond on the Trinity results (FastA). It's hard to overstate how awesomely fast Diamond is at finding good protein homology for DNA sequences.

diamond blastx -d nr -q Trinity-GG.fasta -p 80 -a fathead.nr

Then run MEGAN6's taxonomic classification algorithm (after being sure to install thre "tools" during the setup script, and downloading the latest NCBI acc2tax file from the same site):

/export/common/programs/megan6/tools/blast2lca -i fathead.nr.daa -f DAA -m BlastX -o fathead.nr.diamond.tax -a2t /export/common/dbs/megan6/prot_acc2tax-Oct2017X1.abin

This yielded 142539 contigs with protein matches (the remaining 370K are mostly rRNA, short nonsense fragments/assembly chimera, low quality, etc.). Of these 136764 match teleost proteins (the rest are Burkholderia, other contaminants, or unclassified). In my case, the fish was evolutionarily fairly close to zebrafish, so most of the top scores had e-values<10e-35. But, you might want to pimp up this one liner with a e-value or % ID filter if you've got an organism with no reasonable model organism in the evolutionary 'hood.

perl -F\; -ane 'print "$F[0]\n" if /Actinopteri/' fathead.nr.diamond.tax > fathead.nr.diamond.tax.fish_matches.txt

perl -ne 'BEGIN{%keep=split /(\s)/s, `cat fathead.nr.diamond.tax.fish_matches.txt`; $/=">"}print if /^>?(\S+)/ and $keep{$1}' Trinity-GG.fasta > Trinity-GG.fish_keep.fna

Index the kept contigs with kallisto:

kallisto index -i fishname Trinity-GG.fish_keep.fna

Run kallisto against the fishy (that's a good thing) contigs for each sample.

kallisto quant -i fishname.kallisto_index -b 100 -l 180 -s 20 -o siteX_sampleY.kallisto <(gzip -cd siteX_sample1.R1.fastq.gz) <(gzip -cd siteX_sampleY.R2.fastq.gz)

Kallisto finished running the reads against the Trinity assembled transcripts.

I then mapped the Trinity transcript assembly IDs to the best Diamond matches in the NCBI (RefSeq NP_ preferred, then XP_, then whatever):

perl -ane 'if(/(NP_.*?)\s/ and not $p{$F[0]}++){$f{$F[0]} = "$1\t$F[10]"}elsif(/(XP_.*?)\s/ and not $p{$F[0]} and not $x{$F[0]}++){$f{$F[0]} = "$1\t$F[10]"}elsif(not $p{$F[0]} and not $x{$F[0]}){$f{$F[0]} = "$F[1]\t$F[10]"}END{for(keys %f){print "$_\t$f{$_}\n"}}' fathead.nr.diamond.tab > trinity2refid

Because it's quite likely that we only have partial coverage of genes, I want to consolidate our assembled transcripts into gene symbol level as much as possible before differential expression analysis. To this end, I'll map the refids to gene names in IPA (you could do this with DAVID too for example if you don't have an IPA license) and save to ref2symbol_ipa.txt. I then mapped the IDs using the Entrez Gene database.

wget ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz

perl -F\\t mv -ane 'print "$F[5]\t$F[15]"' gene2refseq > refid2symbol_entrez.txt

I also mapped the RefSeq IDs to UniProtKB here, then extracted the UniProtKB IDs to gene names (saved together as refid2symbol_uniprot.txt).

perl -F\\t -ane 'print "$F[0]\t$1\n" if $F[6] =~ /^(\S+)/' uniprot-yourlist.tab > refid2symbol_uniprot.txt

In the end I had a file of Trinity IDs and their corresponding gene (if available from IPA, Entrez Gene or UniProt) or RefSeq transcript by running this:

perl -ane 'BEGIN{%r2i=split /\s/s,`cat refid2symbol_ipa.txt`; %r2e=split /\s/s,`cat refid2symbol_ncbi.txt`; %r2u=split /\s/s,`cat refid2symbol_uniprot.txt`} print "$F[0]\t", ($r2i{$F[1]} || $r2e{$F[1]} || $r2u{$F[1]} || $F[1]), "\n"' trinity2refid > trinity2symbol.txt

I ran the following Sleuth (R) analysis which generates a table of the expression values for all sites where there is a significant log-ratio test qval (multiple testing corrected p-value) for the factor. Most of the code from here down is a variation on my earlier post Using Kallisto & Sleuth for RNASeq analysis.

library(sleuth)

meta <- read.table("metadata.tab", header=TRUE)
trinity2symbol <- read.table("trinity2symbol.txt")

meta$path <- as.character(meta$path)

so <- sleuth_prep(meta, ~site, target_mapping=trinity2symbol, use_extra_bootstraps=TRUE)

so <- sleuth_fit(so)

so <- sleuth_fit(so, ~1, "reduced")

so <- sleuth_lrt(so, 'reduced', 'full')

results_table_lrt <- sleuth_results(so, 'reduced:full', test_type = 'lrt’)

results_table_lrt <- results_table_lrt[sort.list(results_table_lrt[,1]),]

lrt.sig_ids <- results_table_lrt$target_id[which(results_table_lrt$qval < 0.05)]

Now we want to perform the Wald test for each site vs. site0 (our factor base level) since this will give us the fold-changes and site-specific significance of the differential expression.

so <- sleuth_wt(so, 'site1')

so <- sleuth_wt(so, 'site2')

so <- sleuth_wt(so, 'site3')

so <- sleuth_wt(so, 'site4')

results_table_wt_1 <- sleuth_results(so, 'site1')

results_table_wt_2 <- sleuth_results(so, 'site2')

results_table_wt_3 <- sleuth_results(so, 'site3')

results_table_wt_4 <- sleuth_results(so, 'site4')

We'll want to report the beta (~fold change) and wald test q-value for each sample for each ID that passed the original log-ratio-test.

shared_results_1 <- results_table_wt_1[results_table_wt_1$target_id %in% lrt.sig_ids,]

shared_results_2 <- results_table_wt_2[results_table_wt_2$target_id %in% lrt.sig_ids,]

shared_results_3 <- results_table_wt_3[results_table_wt_3$target_id %in% lrt.sig_ids,]

shared_results_4 <- results_table_wt_5[results_table_wt_4$target_id %in% lrt.sig_ids,]

It's important at this point to make sure we have all the rows in each sample in the same index position for collating the output table, and the easiest way to do this is to sort all the results by target_id (transcript name) alphabetically, like I snuck in earlier for the LRT test results in case you weren't paying attention.

shared_results_1 <- shared_results_1[sort.list(shared_results_1[,1]),]

shared_results_2 <- shared_results_2[sort.list(shared_results_2[,1]),]

shared_results_3 <- shared_results_3[sort.list(shared_results_3[,1]),]

shared_results_4 <- shared_results_4[sort.list(shared_results_4[,1]),]

Collate the results and write to a file. The first column is the significant (q-value < .05) log-ratio test score information for the factor for each gene across all the sites, while the subsequent columns are the Wald scores and log fold-changes (kinda) for each specific site. Note that because we have multiple values for the factor, a gene can pass the LRT test, but have insufficient replicates at any given site to have a good (qval <.05) Wald score. This is the opposite of the binary factors people usually are testing with RNASeq (e.g., treatment vs. control). Sequencing more replicates is left as an exercise to the reader.

combined_result <- data.frame(results_table_lrt[results_table_lrt$target_id %in% lrt.sig_ids,], site1_qval=shared_results_1[,"qval"], site1_b=shared_results_1

[,"b"], site2_qval=shared_results_2[,"qval"], site2_b=shared_results_2[,"b"], site3_qval=shared_results_3[,"qval"], site3_b=shared_results_3[,"b"],

site4_qval=shared_results_4[,"qval"], site4_b=shared_results_4[,"b"])

write.table(combined_result, "all_sites_lrt_passed_qval.05.txt", sep="\t")

Now that we have the diff exp table, let's annotate the transcripts that had protein hits (note the literal ctrl-A in the command):

perl -ne 'BEGIN{%t=reverse(split /\s/s, `cut -f 2,1 trinity2refid`); $/=">"}print "$1\t$2\n" if /^(\S+)\s+(.*?)^A/ and exists $t{$1}' nr > trinity_descs.txt

sort tids | perl -ane 'BEGIN{%d=split /[\t\n]/s, `cut -f 1,2 id_desc.txt`}print $d{$F[0]},"\n"' | perl -ane 'BEGIN{%d=split /[\t\n]/s, `cut -f 2,3 trinity_descs.txt`}print $d{$F[0]},"\n"' > descs

Lets also append the p-value of the protein hit used for the annotation:

perl -F\\t -lane 'BEGIN{%e=split /\s/s, `cut -f 1,3 trinity2refid`}print $_,"\t",$e{$F[0]},"\n"' all_sites_lrt_passed_qval_desc.05.txt > lrt_any_factor_qval_lt_0.05_desc_diamond_evalues.txt

That's it! Now you can go down the rabid whole of functional analysis, a blog post I'll get to writing eventually...

1 comment:

UnknownAugust 25, 2017 at 12:36 AM
Hi,
I found here (https://rpubs.com/kapeelc12/Sleuth) (step 10) that it is possible to generate the following text file:

target_id ens_gene
Sb10g000101.1 Sb10g000101
Sb10g000200.1 Sb10g000200
Sb10g000210.1 Sb10g000210

t2g <- read.table("/home/kapeelc/sleuth_analysis/t2g.txt", header = TRUE, stringsAsFactors=FALSE)
so <- sleuth_prep(s2c, ~ condition, target_mapping = t2g, extra_bootstrap_summary = TRUE)

I just wondering whether you have tried create the above table and lot into sleuth?

Thank you in advance.

Michal

CHGI Bioinformatics

Blog Archive

February 8, 2017

Transcriptome analysis without an annotated genome: pretty fast and slightly dirty

1 comment: