RNAseq demo & notes
Yuxuan Wu Lv13

RNAseq

RNA-seq (RNA-sequencing) is a technique that can examine the quantity and sequences of RNA in a sample using next generation sequencing (NGS). It analyzes the transcriptome of gene expression patterns encoded within our RNA.

What are the applications of RNA-seq?

RNA-seq lets us investigate and discover the transcriptome, the total cellular content of RNAs including mRNA, rRNA and tRNA. Understanding the transcriptome is key if we are to connect the information on our genome with its functional protein expression.

RNA-seq can tell us which genes are turned on in a cell, what their level of expression is, and at what times they are activated or shut off. This allows scientists to more deeply understand the biology of a cell and assess changes that may indicate disease. Some of the most popular techniques that use RNA-seq are transcriptional profiling, SNP identification, RNA editing and differential gene expression analysis.

RNA-seq Short Reads

Figure1: RNA-seq data uses uses short reads of mRNA which is free of intronic non-coding DNA. These reads must then be aligned back to the reference genome.

This can give researchers vital information about the function of genes. For example, the transcriptome can highlight all the tissues in which a gene of unknown function is expressed, which might indicate what its role is. It also captures information about alternative splicing events (Figure 1), which produce different transcripts from one single gene sequence. These events would not be picked up by DNA sequencing. It can also identify post-transcriptional modifications that occur during mRNA processing such as polyadenylation and 5’ capping

How does RNA-seq work?

Early RNA-seq techniques used Sanger sequencing technology, a technique that although innovative at the time, was also low-throughput, costly, and inaccurate. It is only recently, with the advent and proliferation of NGS technology, have we been able to fully take advantage of RNA-seq’s potential

The first step in the technique involves converting the population of RNA to be sequenced into cDNA fragments (a cDNA library). This allows the RNA to be put into an NGS workflow. Adapters are then added to each end of the fragments. These adapters contain functional elements which permit sequencing; for example, the amplification element and the primary sequencing site. The cDNA library is then analyzed by NGS, producing short sequences which correspond to either one or both ends of the fragment. The depth to which the library is sequenced varies depending on techniques which the output data will be used for. The sequencing often follows either single-read or paired-end sequencing methods. Single-read sequencing is a cheaper and faster technique (for reference, about 1% of the cost of Sanger sequencing) that sequences the cDNA from just one end, whilst paired-end methods sequence from both ends, and are therefore more expensive and time-consuming5,6.

A further choice must be made between strand-specific and non-strand-specific protocols. The former method means the information about which DNA strand was transcribed is retained. The value of extra information obtained from strand-specific protocols make them the favorable option.

These reads, of which there will be many millions by the end of the workflow, can then be aligned to a genome of reference and assembled to produce an RNA sequence map that spans the transcriptome7.

The whole pipeline

img

聊聊转录组测序——2.数据分析与解读(上)

CHIP mechanisms

  • Chromatin ImmunoPrecipitation Sequencing (ChIP-Seq)

  • An approach to ==detect specific DNA regions/sequences associated with a protein of interest, in vivo.==

  • Became a powerful tool to analyze protein/DNA interactions, furthermore to detect any signal/modification associated with DNA/Chromatin.

• Protein-DNA interactions

• Chromatin States

• Transcriptional regulation

• DNA modifications (MeDIP)

image-20210326233143439

ChIP

•DNA and associated proteins on chromatin in living cells or tissues are crosslinked (this step is omitted in Native ChIP).

•The DNA-protein complexes (chromatin-protein) are then sheared into ~500 bp DNA fragments by sonication or nuclease digestion.

•Cross-linked DNA fragments associated with the protein(s) of interest are selectively immunoprecipitated from the cell debris using an appropriate protein-specific antibody.

•The associated DNA fragments are purified and their sequence is determined. Enrichment of specific DNA sequences represents regions on the genome that the protein of interest is associated with in vivo.

image-20210326233303656

image-20210327085931977


NGS flow chart & details

image-20210329113524383

FASTQ format

image-20210329113608444

SAM format:

image-20210329113732323

NGS reads Alignment

image-20210329113822691

Fasta format

  • https://en.wikipedia.org/wiki/FASTA
  • The first line in a FASTA file starts either with a “>” (greater-than) symbol
  • Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code.

TDF format

  • A tiled data file (TDF) file (.tdf) is a binary file that contains data that has been preprocessed for faster display in IGV. There are other alternative formats, including wig, bigwig, etc.

image-20210329114133206

FASTAQC- NGS Data Quality

image-20210329114513821

如何阅读fastqc

非常详细

https://zhuanlan.zhihu.com/p/20731723

image-20210331144050385

Two main problems we need to pay attention to:

  • Base quality
  • Adaptor

Data format:

  • SRA: Sequencing data (compressed format)

  • FASTQ: Sequencing data

  • SAM: Aligned sequencing data

  • BAM: Aligned sequencing data (compressed format)

  • TDF: (Aligned) Tag density file for visualization of BAM

  • GTF: Gene annotation

  • BED: range format (chr1, 210-350). It can be used to store gene annotation, sequencing read alignment information, TF binding site information, etc.


Key_Info.bash

1
2
3
4
5
6
7
8
9
10
11
12
13
# Project #2
# BIO405
# 2021-03-15
# Bash Script

# S1. make your working directory ##################################
# connect to the Linux server
pwd # check where you are
ls # see what do you have
mkdir ~/P2 # make a folder for this project under your home directory ~
# "~" means your home directory on the linux server
ls ~ # to check whether you have the P2 folder generated under your home directory

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# S2. download one fastq data
# Alternative 1: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA486780&o=acc_s%3Aa (used example)
# Alternative 2: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA403831&o=acc_s%3Aa
# Most RNAseq datasets were deposited into GEO: https://www.ncbi.nlm.nih.gov/geo/
# try to find a dataset YOU are interested in. Download process can be done in parallel

# download the fastq file
mkdir ~/P2/fastq
(nohup fastq-dump --split-3 -v SRR7721730 -O ~/P2/fastq > ~/P2/sra/730.out)&
ls -l ~/P2/fastq # check what you have now

# Explanations
# (nohup)&: means run the command in background mode
# --splite-3: if datas are paired-end, they will be divided into two files
# SRRxxxxxx: the sample number in GEO
# -O: where to save the downloaded sample
# > ~/P2/fastq/730.out: where to save the messages received while downloading the file

ls -l ~/P2/fastq # check what you have now

Download the fastq file

Explanations

  • (nohup)&: means run the command in background mode
  • –splite-3: if datas are paired-end, they will be divided into two files
  • SRRxxxxxx: the sample number in GEO
  • -O: where to save the downloaded sample
  • > ~/P2/fastq/730.out: where to save the messages received while downloading the file
1
2
3
4
# S3. Top
htop -u "$USER" # check the load of the server and the tasks you are running, if you see too many tasks are running, you should not submit additional jobs.
q # to quit from htop command
killall -u "$USER" # to terminate all your running jobs

Quality control

1
2
3
4
5
6
# S4. quality control ####################################
mkdir ~/P2/fastqc_report
fastqc -o ~/P2/fastqc_report -f fastq ~/P2/fastq/SRR7721730_1.fastq ~/P2/fastq/SRR7721730_2.fastq
# for paired end of data, you have to check the data quality of both reads

# you could check the html file in the Rstudio

Obtain the index for alignment

1
2
3
4
5
6
7
8
9
10
11
12
13
# S6. get the whole genome fasta
mkdir ~/P2/index
# it takes a long time to download this genome annotation pacakge, so I have downloaded everything for you.
# all the genome annotation (human, mouse, etc) is available from: http://10.7.88.74/MyWeb/Genomes
wget -O ~/P2/index/hg38.fa http://10.7.88.74/MyWeb/Genomes/hg38/genome.fa # download genome
wget -O ~/P2/index/hg38.gtf http://10.7.88.74/MyWeb/Genomes/hg38/genes.gtf # download annotation

# S7. get the index, needed for alignment
# hisat2-build hg38.fa hg38
# build index, take a very long time
# with the assistance of index, we could increase the speed of query data
(nohup hisat2-build ~/P2/index/hg38.fa ~/P2/index/hg38 > ~/P2/index/index_message.out)& # let us do it in background mode
ls -l ~/P2/index/ # the index files are very larger

tophat alignment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# S8. Align the fastq files to genome with tophat
cd ..
mkdir ~/P2/bam # make a folder to store aligned bam files

(nohup hisat2 -x ~/P2/index/hg38 -1 ~/P2/fastq/SRR7721730_1.fastq -2 ~/P2/fastq/SRR7721730_2.fastq -S ~/P2/bam/SRR7721730.sam > ~/P2/bam/hisat_730.out)&
head ~/P2/bam/SRR7721730/730.out # check the summary result of alignment

# covert to bam file to save space
(nohup samtools view -S ~/P3/bam/SRR7721730.sam -b > ~/P3/bam/SRR7721730.bam)&

# sort alignment
(nohup samtools sort ~/P3/bam/SRR7721730.bam -o ~/P3/bam/SRR7721730_sorted.bam)& # sort alignment

# index all the bam files, which is required for visualization in IGV
(nohup samtools index ~/P3/bam/SRR7721730_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.

TDF file for visualization of bam files

  • IGV browser, downloading it when required
1
2
3
4
5
6
7
8
9
10
11
# S9. prepare TDF file for visualization of bam files
# download the chromesome size file
wget -O ~/P2/tdf/hg38.chrom.sizes http://10.7.88.74/MyWeb/Genomes/chrom.sizes/hg38.chrom.sizes # use a different genome size file if necessary
mkdir ~/P2/tdf
(nohup igvtools count -z 5 -w 10 -e 0 ~/P2/bam/SRR7721730_sorted.bam ~/P2/tdf/SRR7721730.tdf ~/P2/tdf/hg38.chrom.sizes > ~/P2/tdf/igv_730.out)&
ls ~/P2/tdf

# S10. genome browser
# download IGV browser: http://software.broadinstitute.org/software/igv/download
# use IGV to visualize the TDF files
# use IGV to visualize the BAM files (with the .bai index file)
  • use IGV to visualize the TDF files
  • use IGV to visualize the BAM files (with the .bai index file)

RNAseq data analysis

Data preparation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# RNAseq data analysis
# BIO316
# 2021-03-15
# Bash script

# S1. make your working directory ##################################
# connect to the Linux server
cd ~ # go to your home directory
mkdir P3 # or use whatever name you prefer
ls # check whether you have the new folder for project 3
pwd # check your current working directory


# S2. download the raw data in sra format ####################################
# Data from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE53794
mkdir ~/P3/fastq # make a new directory to store raw sequencing data in SRA format
ls ~/P3/fastq # nothing is inside


# S3 download all the data in parellel
(nohup fastq-dump --split-3 SRR1068226 -O ~/P3/fastq > ~/P3/fastq/226.out)&
(nohup fastq-dump --split-3 SRR1068227 -O ~/P3/fastq > ~/P3/fastq/227.out)&
(nohup fastq-dump --split-3 SRR1068228 -O ~/P3/fastq > ~/P3/fastq/228.out)&
(nohup fastq-dump --split-3 SRR1068229 -O ~/P3/fastq > ~/P3/fastq/229.out)&
(nohup fastq-dump --split-3 SRR1068230 -O ~/P3/fastq > ~/P3/fastq/230.out)&
(nohup fastq-dump --split-3 SRR1068231 -O ~/P3/fastq > ~/P3/fastq/231.out)&

# the structure of this command is: (nohup command_to_be_executed > text_log_file.out)&
## --splite-3 for mate-pairs: biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq.
## If only one biological read is present it is placed in *.fastq Biological reads and above are ignored.
## If the data is single-end, it will not save as two files.

Syntax explanation

  • the structure of this command is: ==(nohup command_to_be_executed > text_log_file.out)&==
  • ==–splite-3 for mate-pairs: biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq.==
  • If only one biological read is present it is placed in *.fastq Biological reads and above are ignored.
  • If the data is single-end, it will not save as two files.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
top # check the download process
ps -u j # check you on-going processes, replace "j" with your own user ID "y18u1"
ls ~/P3/fastq # check downloaded fastq files
cat ~/P3/fastq/226.out # check whether there is any error during the data downloading

# S4. quality control ####################################
mkdir ~/P3/report
fastqc -o ~/P3/report -f fastq ~/P3/fastq/SRR1068226.fastq ~/P3/fastq/SRR1068227.fastq\
~/P3/fastq/SRR1068228.fastq ~/P3/fastq/SRR1068229.fastq ~/P3/fastq/SRR1068230.fastq ~/P3/fastq/SRR1068231.fastq
# notice, "\" indicates the line is not over yet, continue in the next line. It is equivalent to the following:
# this is different from R, in which you can change the line after any meaningful unit.
ls ~/P3/report # check the generated fastqc reports, see whether everything is OK.
# you may have to trim the data if necessary

# S5. copy the mouse genome information
mkdir ~/P3/index
# it takes a long time to download this genome annotation pacakge, so I have downloaded everything for you.
# all the genome annotation (human, mouse, etc) is available from: http://10.7.88.74/MyWeb/Genomes
wget -O ~/P3/index/mm9.fa http://10.7.88.74/MyWeb/Genomes/mm9/genome.fa
wget -O ~/P3/index/mm9.gtf http://10.7.88.74/MyWeb/Genomes/mm9/genes.gtf

# S7. Build index, needed for alignment
(nohup hisat2-build ~/P3/index/mm9.fa ~/P3/index/mm9 > ~/P3/index/build_index.out)& # build index, take a very long time
top # check the progress
cat ~/P3/index/build_index.out # check the notifications
ls -l ~/P3/index/ # check the index. the index files are very larger

# S8. Align the fastq files to genome with tophat
mkdir ~/P3/bam # make a folder to store aligned bam files
(nohup hisat2 -x ~/P3/index/mm9 -U ~/P3/fastq/SRR1068226.fastq -S ~/P3/bam/SRR1068226.sam > ~/P3/bam/226.out)&
(nohup hisat2 -x ~/P3/index/mm9 -U ~/P3/fastq/SRR1068227.fastq -S ~/P3/bam/SRR1068227.sam > ~/P3/bam/227.out)&
(nohup hisat2 -x ~/P3/index/mm9 -U ~/P3/fastq/SRR1068228.fastq -S ~/P3/bam/SRR1068228.sam > ~/P3/bam/228.out)&
(nohup hisat2 -x ~/P3/index/mm9 -U ~/P3/fastq/SRR1068229.fastq -S ~/P3/bam/SRR1068229.sam > ~/P3/bam/229.out)&
(nohup hisat2 -x ~/P3/index/mm9 -U ~/P3/fastq/SRR1068230.fastq -S ~/P3/bam/SRR1068230.sam > ~/P3/bam/230.out)&
(nohup hisat2 -x ~/P3/index/mm9 -U ~/P3/fastq/SRR1068231.fastq -S ~/P3/bam/SRR1068231.sam > ~/P3/bam/231.out)&
# for paired-end data, you can use below command
(nohup hisat2 -x ~/P2/index/hg38 -1 ~/P2/fastq/SRR7721730_1.fastq -2 ~/P2/fastq/SRR7721730_2.fastq -S ~/P3/bam/SRR7721730.sam> ~/P2/bam/hisat_730.out)&
top # check the progress of all tasks
cat ~/P3/bam/226.out # check the alignment efficiency

# conver to bam
(nohup samtools view -S ~/P3/bam/SRR1068226.sam -b > ~/P3/bam/SRR1068226.bam)& # covert to bam file to save space
(nohup samtools view -S ~/P3/bam/SRR1068227.sam -b > ~/P3/bam/SRR1068227.bam)& # covert to bam file to save space
(nohup samtools view -S ~/P3/bam/SRR1068228.sam -b > ~/P3/bam/SRR1068228.bam)& # covert to bam file to save space
(nohup samtools view -S ~/P3/bam/SRR1068229.sam -b > ~/P3/bam/SRR1068229.bam)& # covert to bam file to save space
(nohup samtools view -S ~/P3/bam/SRR1068230.sam -b > ~/P3/bam/SRR1068230.bam)& # covert to bam file to save space
(nohup samtools view -S ~/P3/bam/SRR1068231.sam -b > ~/P3/bam/SRR1068231.bam)& # covert to bam file to save space

# sort alignment
(nohup samtools sort ~/P3/bam/SRR1068226.bam -o ~/P3/bam/SRR1068226_sorted.bam)& # sort alignment
(nohup samtools sort ~/P3/bam/SRR1068227.bam -o ~/P3/bam/SRR1068227_sorted.bam)& # sort alignment
(nohup samtools sort ~/P3/bam/SRR1068228.bam -o ~/P3/bam/SRR1068228_sorted.bam)& # sort alignment
(nohup samtools sort ~/P3/bam/SRR1068229.bam -o ~/P3/bam/SRR1068229_sorted.bam)& # sort alignment
(nohup samtools sort ~/P3/bam/SRR1068230.bam -o ~/P3/bam/SRR1068230_sorted.bam)& # sort alignment
(nohup samtools sort ~/P3/bam/SRR1068231.bam -o ~/P3/bam/SRR1068231_sorted.bam)& # sort alignment

# index alignment
(nohup samtools index ~/P3/bam/SRR1068226_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.
(nohup samtools index ~/P3/bam/SRR1068227_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.
(nohup samtools index ~/P3/bam/SRR1068228_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.
(nohup samtools index ~/P3/bam/SRR1068229_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.
(nohup samtools index ~/P3/bam/SRR1068230_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.
(nohup samtools index ~/P3/bam/SRR1068231_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.

# S9. find transcribed regions of a single sample
mkdir ~/P3/gtf
(nohup cufflinks ~/P3/bam/SRR1068226_sorted.bam -o ~/P3/gtf/SRR1068226 > ~/P3/gtf/cufflinks.out)& # reference-based transcriptome reconstruction
head ~/P3/gtf/SRR1068226/transcripts.gtf
cufflinks --help # check how to do reference-based transcriptome assembly

# S10. visualization of bam files
# we use mm9 here, you may check the folder /home/jmeng/App/Genomes/chrom.sizes for other genome assembly numbers, such as, hg19 and mm10.
mkdir ~/P3/tdf
wget -O ~/P3/tdf/mm9.chrom.sizes http://10.7.88.74/MyWeb/Genomes/chrom.sizes/mm9.chrom.sizes # use a different genome size file if necessary
(nohup igvtools count -z 5 -w 10 -e 0 ~/P3/bam/SRR1068226_sorted.bam ~/P3/tdf/SRR1068226.tdf ~/P3/tdf/mm9.chrom.sizes)&
(nohup igvtools count -z 5 -w 10 -e 0 ~/P3/bam/SRR1068226_sorted.bam ~/P3/tdf/SRR1068227.tdf ~/P3/tdf/mm9.chrom.sizes)&
(nohup igvtools count -z 5 -w 10 -e 0 ~/P3/bam/SRR1068226_sorted.bam ~/P3/tdf/SRR1068228.tdf ~/P3/tdf/mm9.chrom.sizes)&
(nohup igvtools count -z 5 -w 10 -e 0 ~/P3/bam/SRR1068226_sorted.bam ~/P3/tdf/SRR1068229.tdf ~/P3/tdf/mm9.chrom.sizes)&
(nohup igvtools count -z 5 -w 10 -e 0 ~/P3/bam/SRR1068226_sorted.bam ~/P3/tdf/SRR1068230.tdf ~/P3/tdf/mm9.chrom.sizes)&
(nohup igvtools count -z 5 -w 10 -e 0 ~/P3/bam/SRR1068226_sorted.bam ~/P3/tdf/SRR1068231.tdf ~/P3/tdf/mm9.chrom.sizes)&
ls tdf

# use IGV to visualize the TDF files
# compare /P3/tdf/SRR1068226.tdf with the gtf
# file generated ~/P3/gtf/SRR1068226/transcripts.gtf and with reference

# S11. differential analysis with cuffdiff
# copy the gene annotation file mm9.gtf to the same directory, or you may provide a full path
# gene annotation GTF of mm9 and hg38 are available at directory: /home/j/BIO405/Genomes, choose the one you need:
mkdir ~/P3/DGE
(nohup cuffdiff -o ~/P3/DGE -p 2 -L WT,CI994\
-b ~/P3/index/mm9.fa -u ~/P3/index/mm9.gtf\
~/P3/bam/SRR1068226_sorted.bam,~/P3/bam/SRR1068227_sorted.bam,~/P3/bam/SRR1068228_sorted.bam\
~/P3/bam/SRR1068229_sorted.bam,~/P3/bam/SRR1068230_sorted.bam,~/P3/bam/SRR1068231_sorted.bam\
> ~/P3/DGE/cuffdiff.out)&
# the above step will take a long time, so let us do it in background mode

ps -u $USER
ls ~/P3/DGE
head ~/P3/DGE/cuffdiff.out
head ~/P3/DGE/gene_exp.diff

# Use "Cuffdiff" to conduct differential expression analysis
# please check the meaning of "-p 4", "-L WT,CI994", "-b ~/P3/index/mm9.fa", "-u" with the following command
cuffdiff --help
# check the files generated in output folder DGE specified by "-o ~/P3/DGE"

Chipseq_MACS2.bash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# S1. MACS2 peak calling from ChIP-seq data
# Download ChIPseq data
mkdir ~/P4
mkdir ~/P4/fastq
mkdir ~/P4/index
mkdir ~/P4/report
mkdir ~/P4/bam
mkdir ~/P4/tdf
mkdir ~/P4/macs

# download the data from GEO
# please use the data of your interests in the assignment
# it is better to find a data set related to your master thesis
(nohup fastq-dump SRR568477 -O ~/P4/fastq > ~/P4/fastq/477.out)&
(nohup fastq-dump SRR1768320 -O ~/P4/fastq > ~/P4/fastq/320.out)&

# please note that
# if the sample use pair-end sequencing, if you should use "fastq-dump --split-files ..." and will get two fastq files.
# if the reads quality of your data is weird, it might because you have two reads cancatenated together.
# more reads are no pair-end, it means you will see "--split-files" more often in the future.

# check the data quality
(nohup fastqc -o ~/P4/report -f fastq ~/P4/fastq/SRR568477.fastq > ~/P4/report/fastqc_477.out)& # quality control
(nohup fastqc -o ~/P4/report -f fastq ~/P4/fastq/SRR1768320.fastq > ~/P4/report/fastqc_320.out)& # quality control

# download the genome and annotation file
# it takes a long time to do that, so I have downloaded everything for you.
# all the genome annotation (human, mouse, etc) is available from: http://10.7.88.74/MyWeb/Genomes
wget -O ~/P4/index/mm10.fa http://10.7.88.74/MyWeb/Genomes/mm10/genome.fa # download the genome
wget -O ~/P4/index/mm10.gtf http://10.7.88.74/MyWeb/Genomes/mm10/genes.gtf # download the annotation
# If you need to work on other species, then you will have to find them from other places.


# you should usually go through the following
# build index, needed for alignment
(nohup hisat2-build ~/P4/index/mm10.fa ~/P4/index/mm10 > ~/P3/index/build_index.out)& # build index, take a very long time
top # check the progress
ps -u $USER # check your on-going processes
cat ~/P4/index/build_index.out # check the notifications
ls -l ~/P4/index/ # check the index. the index files are very larger

# alignment
(nohup hisat2 -x ~/P4/index/mm10 -U ~/P4/fastq/SRR568477.fastq -S ~/P4/bam/SRR568477.sam > ~/P4/bam/align_SRR568477.log)& # do alignment in background
(nohup hisat2 -x ~/P4/index/mm10 -U ~/P4/fastq/SRR1768320.fastq -S ~/P4/bam/SRR1768320.sam > ~/P4/bam/align_SRR1768320.log)& # do alignment in background

# convert SAM to BAM
(nohup samtools view -bS ~/P4/bam/SRR568477.sam -o ~/P4/bam/SRR568477.bam)& # convert sam to bam file
(nohup samtools view -bS ~/P4/bam/SRR1768320.sam -o ~/P4/bam/SRR1768320.bam)& # convert sam to bam file
ls -l ~/P4/bam/ # check the size of files. BAM is a lot smaller than SAM

# sort the alignments, which is needed for many tools
(nohup samtools sort ~/P4/bam/SRR568477.bam -o ~/P4/bam/SRR568477_sorted.bam > ~/P4/bam/sort_SRR568477.out)& # sort alignment by coordinates
(nohup samtools sort ~/P4/bam/SRR1768320.bam -o ~/P4/bam/SRR1768320_sorted.bam > ~/P4/bam/sort_SRR1768320.out)& # sort alignment by coordinates

# visualization
wget -O ~/P4/tdf/mm10.chrom.sizes http://10.7.88.74/MyWeb/Genomes/chrom.sizes/mm10.chrom.sizes # use a different genome size file if necessary
(nohup igvtools count -z 5 -w 10 -e 0 ~/P4/bam/SRR568477_sorted.bam\
~/P4/tdf/SRR568477.tdf ~/P4/tdf/mm10.chrom.sizes > ~/P4/tdf/tdf_SRR568477.out)& # generated tdf
(nohup igvtools count -z 5 -w 10 -e 0 ~/P4/bam/SRR1768320_sorted.bam\
~/P4/tdf/SRR1768320.tdf ~/P4/tdf/mm10.chrom.sizes > ~/P4/tdf/tdf_SRR1768320.out)& # generated tdf

# peak calling
(nohup macs2 callpeak -t ~/P4/bam/SRR568477_sorted.bam -c ~/P4/bam/SRR1768320_sorted.bam --outdir ~/P4/macs -n H3K4me3 -g mm -f BAM -q 0.01 > ~/P4/macs/callpeak.out)& # peak calling
# load the .BED file in IGV
# compare the peaks called with tdf file and gene annotation information in IGV

# motif finding
(nohup bedtools getfasta -fi ~/P4/index/mm10.fa -bed ~/P4/macs/H3K4me3_peaks.narrowPeak -fo ~/P4/macs/H3K4me3_region.fa > ~/P4/macs/findmotif.out)& # extract sequence
# extract the sequence for motif finding: https://meme-suite.org/meme/tools/streme
# if the file H3K4me3_region.fa is too large, you may use the first 10000 lines
# head -n 10000 ~/P4/macs/H3K4me3_region.fa > ~/P4/macs/H3K4me3_region_top_10000.fa


### A little advanced bash programming.
# This part is not related to the ChIPseq data analysis practical session
# Use "Ctrl + c" to terminate a running job
# You can run tasks in the background. If the tasks are running the background, you can log out and leave with the tasks still running on the server.

Sequence data analysis

- Determination of binding sites from the sequence data is a challenge. Conceptually, genomic regions with an increased number of sequencing reads (tags) compared to control is considered to be a TFBS

- Statistical filtering criteria is used to determine if these putative sites represent true binding sites.

- After statistical analysis of binding sites a further analysis of data is required. These may include analysis of location of binding, relative to transcription factor binding sites or potential nearby target genes.

image-20210331160224048

cuffdiff doc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
cuffdiff v2.2.1 ()
-----------------------------
Usage: cuffdiff [options] <transcripts.gtf> <sample1_hits.sam> <sample2_hits.sam> [... sampleN_hits.sam]
Supply replicate SAMs as comma separated lists for each condition: sample1_rep1.sam,sample1_rep2.sam,...sample1_repM.sam
General Options:
-o/--output-dir write all output files to this directory [ default: ./ ]
-L/--labels comma-separated list of condition labels
--FDR False discovery rate used in testing [ default: 0.05 ]
-M/--mask-file ignore all alignment within transcripts in this file [ default: NULL ]
-C/--contrast-file Perform the constrasts specified in this file [ default: NULL ]
-b/--frag-bias-correct use bias correction - reference fasta required [ default: NULL ]
-u/--multi-read-correct use 'rescue method' for multi-reads [ default: FALSE ]
-p/--num-threads number of threads used during quantification [ default: 1 ]
--no-diff Don't generate differential analysis files [ default: FALSE ]
--no-js-tests Don't perform isoform switching tests [ default: FALSE ]
-T/--time-series treat samples as a time-series [ default: FALSE ]
--library-type Library prep used for input reads [ default: below ]
--dispersion-method Method used to estimate dispersion models [ default: below ]
--library-norm-method Method used to normalize library sizes [ default: below ]

Advanced Options:
-m/--frag-len-mean average fragment length (unpaired reads only) [ default: 200 ]
-s/--frag-len-std-dev fragment length std deviation (unpaired reads only) [ default: 80 ]
-c/--min-alignment-count minimum number of alignments in a locus for testing [ default: 10 ]
--max-mle-iterations maximum iterations allowed for MLE calculation [ default: 5000 ]
--compatible-hits-norm count hits compatible with reference RNAs only [ default: TRUE ]
--total-hits-norm count all hits for normalization [ default: FALSE ]
-v/--verbose log-friendly verbose processing (no progress bar) [ default: FALSE ]
-q/--quiet log-friendly quiet processing (no progress bar) [ default: FALSE ]
--seed value of random number generator seed [ default: 0 ]
--no-update-check do not contact server to check for update availability[ default: FALSE ]
--emit-count-tables print count tables used to fit overdispersion [ DEPRECATED ]
--max-bundle-frags maximum fragments allowed in a bundle before skipping [ default: 500000 ]
--num-frag-count-draws Number of fragment generation samples [ default: 100 ]
--num-frag-assign-draws Number of fragment assignment samples per generation [ default: 50 ]
--max-frag-multihits Maximum number of alignments allowed per fragment [ default: unlim ]
--min-outlier-p Min replicate p value to admit for testing [ DEPRECATED ]
--min-reps-for-js-test Replicates needed for relative isoform shift testing [ default: 3 ]
--no-effective-length-correction No effective length correction [ default: FALSE ]
--no-length-correction No length correction [ default: FALSE ]
-N/--upper-quartile-norm Deprecated, use --library-norm-method [ DEPRECATED ]
--geometric-norm Deprecated, use --library-norm-method [ DEPRECATED ]
--raw-mapped-norm Deprecated, use --library-norm-method [ DEPRECATED ]
--poisson-dispersion Deprecated, use --dispersion-method [ DEPRECATED ]

Debugging use only:
--read-skip-fraction Skip a random subset of reads this size [ default: 0.0 ]
--no-read-pairs Break all read pairs [ default: FALSE ]
--trim-read-length Trim reads to be this long (keep 5' end) [ default: none ]
--no-scv-correction Disable SCV correction [ default: FALSE ]

Supported library types:
ff-firststrand
ff-secondstrand
ff-unstranded
fr-firststrand
fr-secondstrand
fr-unstranded (default)
transfrags

Supported dispersion methods:
blind
per-condition
poisson
pooled (default)

Supported library normalization methods:
classic-fpkm
geometric (default)
geometric
quartile

cuffdiff 生成的文件

==FPKM: isoform水平上的丰度, Fragments Per Kilobase of exon model per Million mapped fragments;==

产生了一大堆文件,大致分为以下几类

本次分析有3个样本每个样本两个重复,共6个文件

FPKM trcking files 计算的每个样本的FPKM值

其中又分为如下四种,下面的类似

isoforms.fpkm_tracking Transcript FPKMs
genes.fpkm_tracking Gene FPKMs. 因为一个基因可以有多个转录本,所以gene FPKM值是具有相同基因id的转录本的Fpkm值的总和
cds.fpkm_tracking Coding sequence FPKMs. 具有相同蛋白id转录本fpkm值的总和
tss_groups.fpkm_tracking Primary transcript FPKMs. 具有相同转录起始位点的转录本FPKM值的总和

Differential splicing tests - splicing.diff

对每个转录前体产生的可变剪接变体的数目

Differential coding output - cds.diff

在这个文件中只记录了能产生多个CDS的基因的差异表达数

Differential promoter use - promoters.diff

在这个文件中只记录了能产生不同转录前体(i.e. muti-promoter genes)的基因的差异表达

Read group info - read_groups.info

记录了样本重复之间计算定量的一些信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
cufflinks: unrecognized option '--help'
cufflinks v2.2.1
linked against Boost version 106501
-----------------------------
Usage: cufflinks [options] <hits.sam>
General Options:
-o/--output-dir write all output files to this directory [ default: ./ ]
-p/--num-threads number of threads used during analysis [ default: 1 ]
--seed value of random number generator seed [ default: 0 ]
-G/--GTF quantitate against reference transcript annotations
-g/--GTF-guide use reference transcript annotation to guide assembly 提供GFF文件,以此来指导转录子组装(RABT assembly)。此时,输出结果会包含reference transcripts和novel genes and isforms。
-M/--mask-file ignore all alignment within transcripts in this file
-b/--frag-bias-correct use bias correction - reference fasta required [ default: NULL ]
-u/--multi-read-correct use 'rescue method' for multi-reads (more accurate) [ default: FALSE ]
--library-type library prep used for input reads [ default: below ]
--library-norm-method Method used to normalize library sizes [ default: below ]

Advanced Abundance Estimation Options:
-m/--frag-len-mean average fragment length (unpaired reads only) [ default: 200 ]
-s/--frag-len-std-dev fragment length std deviation (unpaired reads only) [ default: 80 ]
--max-mle-iterations maximum iterations allowed for MLE calculation [ default: 5000 ]
--compatible-hits-norm count hits compatible with reference RNAs only [ default: FALSE ]
--total-hits-norm count all hits for normalization [ default: TRUE ]
--num-frag-count-draws Number of fragment generation samples [ default: 100 ]
--num-frag-assign-draws Number of fragment assignment samples per generation [ default: 50 ]
--max-frag-multihits Maximum number of alignments allowed per fragment [ default: unlim ]
--no-effective-length-correction No effective length correction [ default: FALSE ]
--no-length-correction No length correction [ default: FALSE ]
-N/--upper-quartile-norm Deprecated, use --library-norm-method [ DEPRECATED ]
--raw-mapped-norm Deprecated, use --library-norm-method [ DEPRECATED ]

Advanced Assembly Options:
-L/--label assembled transcripts have this ID prefix [ default: CUFF ]
-F/--min-isoform-fraction suppress transcripts below this abundance level [ default: 0.10 ]
-j/--pre-mrna-fraction suppress intra-intronic transcripts below this level [ default: 0.15 ]
-I/--max-intron-length ignore alignments with gaps longer than this [ default: 300000 ]
-a/--junc-alpha alpha for junction binomial test filter [ default: 0.001 ]
-A/--small-anchor-fraction percent read overhang taken as 'suspiciously small' [ default: 0.09 ]
--min-frags-per-transfrag minimum number of fragments needed for new transfrags [ default: 10 ]
--overhang-tolerance number of terminal exon bp to tolerate in introns [ default: 8 ]
--max-bundle-length maximum genomic length allowed for a given bundle [ default:3500000 ]
--max-bundle-frags maximum fragments allowed in a bundle before skipping [ default: 500000 ]
--min-intron-length minimum intron size allowed in genome [ default: 50 ]
--trim-3-avgcov-thresh minimum avg coverage required to attempt 3' trimming [ default: 10 ]
--trim-3-dropoff-frac fraction of avg coverage below which to trim 3' end [ default: 0.1 ]
--max-multiread-fraction maximum fraction of allowed multireads per transcript [ default: 0.75 ]
--overlap-radius maximum gap size to fill between transfrags (in bp) [ default: 50 ]

Advanced Reference Annotation Guided Assembly Options:
--no-faux-reads disable tiling by faux reads [ default: FALSE ]
--3-overhang-tolerance overhang allowed on 3' end when merging with reference[ default: 600 ]
--intron-overhang-tolerance overhang allowed inside reference intron when merging [ default: 30 ]

Advanced Program Behavior Options:
-v/--verbose log-friendly verbose processing (no progress bar) [ default: FALSE ]
-q/--quiet log-friendly quiet processing (no progress bar) [ default: FALSE ]
--no-update-check do not contact server to check for update availability[ default: FALSE ]

Supported library types:
ff-firststrand
ff-secondstrand
ff-unstranded
fr-firststrand
fr-secondstrand
fr-unstranded (default)
transfrags

Supported library normalization methods:
classic-fpkm

Cufflinks的使用

1. Cufflinks简介

Cufflinks程序主要根据Tophat的比对结果,依托或不依托于参考基因组的GTF注释文件,计算出(各个gene的)isoform的FPKM值,并给出trascripts.gtf注释结果(组装出转录组)。

注意:

  1. fragment的长度的估测,若为pair-end测序,则cufflinks自己会有一套算法,算出结果。若为single-end测序,则cufflinks默认的是高斯分布,或者你自己提供相关的参数设置。

  2. cufflinks计算multi-mapped reads,一般a read map到10个位置,则每个位置记为10%。a read mapping to 10 positions will count as 10% of a read at each position.

  3. 一般不推荐用cufflinks拼接细菌的转录组,推荐 Glimmer。但是,若有注释文件,可以用cufflinks和cuffdiff来检测基因的表达和差异性。

  4. cufflinks/cuffdiff不能计算出exon或splicing event的FPKM

  5. cuffdiff处理时间序列data:采用参数-t

  6. 当你使用cufflinks时,在最后出现了99%,然后一直不动。因为cuffdiff需要更多的CPU来处理一些匹配很多reads的loci。而这些位点一般要等其他位点全部解决了后,才由cuffdiff来处理。可以用参数-M来提供相关的文件,过滤掉rRNA或者线粒体RNA。

  7. 当使用cufflinks或cuffdiff出现了“crash with a ‘bad_alloc’ error”,cuffdiff和cufflinks运行了很长时间才结束————这表明计算机拼接一个高表达的基因或定量分析一个高表达的基因,运行的内存使用玩尽了!解决方法:修改选项“-max-bundle-frags”,可以先尝试500000,若错误依旧在,可以继续下调!

  8. cuffdiff报道的结果里面所有的基因和转录本的FPKM=0,这表明GTF中的染色体名字和BAM里的名字不匹配。

  9. cuffdiff和cufflinks的缺点:存在一定的假基因和转录本(原因:测序深度,测序质量,测序样本的测序次数,以及注释的错误)

  10. large fold change表达量不代表数据的明显性(这些基因的isform多或这些基因测序测到的少,整体较低的表达)。cuffdiff中明显表达倍数改变的基因,存在不确定性。

  11. 通过cufflinks产生的结果中transcript.gtf文件中cuff标识的转录本就是新的转录本。相应的,其他模块输出中CUFF标识代表着新的转录本。

  12. 若出现了如下错误:

You are using Cufflinks v2.2.1, which is the most recent release.
open: No such file or directory
File 30 doesn’t appear to be a valid BAM file, trying SAM…
Error: cannot open alignment file 30 for reading
这表明,你的参数有问题。例如“–min-intron-length”,你设置为了:“-min-intron-length”

Cufflinks输出结果

  • cufflinks的输入文件是sam或bam格式。并且sam或bam格式的文件必须排好序。(The SAM file supplied to Cufflinks must be sorted by reference position.)
  • Tophat的输出结果sam或bam已经排好了序。针对其他的未排序的sam或bam文件采用如下排序方式:
1
sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted
  1. transcripts.gtf

该文件包含Cufflinks的组装结果isoforms。前7列为标准的GTF格式,最后一列为attributes。其每一列的意义:
列数 列的名称 例子 描述

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1     序列名    chrX        染色体或contig名;
2 来源 Cufflinks 产生该文件的程序名;
3 类型 exon 记录的类型,一般是transcript或exon;
4 起始 1 1-base的值;
5 结束 1000 结束位置;
6 得分 1000 ;
7 链 + Cufflinks猜测isoform来自参考序列的那一条链,一般是'+','-'或'.';
8 frame . Cufflinks不去预测起始或终止密码子框的位置;
9 attributes ... 详见下
每一个GTF记录包含如下attributes:

gene_id CUFF.1 Cufflinks的gene id;
transcript_id CUFF.1.1 Cufflinks的转录子 id;
FPKM 101.267 isoform水平上的丰度, Fragments Per Kilobase of exon model per Million mapped fragments;
frac 0.7647 保留着的一项,忽略即可,以后可能会取消这个;
conf_lo 0.07 isoform丰度的95%置信区间的下边界,即 下边界值 = FPKM * ( 1.0 - conf_lo );
conf_hi 0.1102 isoform丰度的95%置信区间的上边界,即 上边界值 = FPKM * ( 1.0 + conf_hi );
cov 100.765 计算整个transcript上read的覆盖度;
full_read_support yes 当使用 RABT assembly 时,该选项报告所有的introns和exons是否完全被reads所覆盖
  1. isoforms.fpkm_tracking

isoforms(可以理解为gene的各个外显子)的fpkm计算结果

  1. genes.fpkm_tracking

gene的fpkm计算结果

在R里面接着处理:

https://www.jianshu.com/p/47b5ea646932

igv tools

什么是IGV

它是一款本地的探索基因组数据的可视化浏览器,有多个系统版本,支持多种不同类型的输入格式,包括芯片测序、二代测序、基因组注释文件等。推荐使用BAM与SAM格式,主要格式见下表

GTF 文件可以直接拖入igv中进行可视化

数据来源 文件格式
序列比对 SAM/BAM
显示覆盖率 TDF
拷贝数 SNP、CN
基因表达 GCT、RES
基因注释 GFF3/GTF、BED
突变数据 MUT
追踪参考基因组覆盖度、测序深度(UCSC) WIG、BW

这里推荐生信星球,所有和生物信息相关,测序相关的都可以在这里找到

https://www.jieandze1314.com/

https://www.jieandze1314.com/post/cnposts/3/

  • Post title:RNAseq demo & notes
  • Post author:Yuxuan Wu
  • Create time:2021-03-25 23:07:32
  • Post link:yuxuanwu17.github.io2021/03/25/2021-03-26-RNAseq-demo-&-notes/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.