RNAseq demo in BIO316 CW1
Yuxuan Wu Lv13

BIO316-CW1

Requirements

  • Briefly explain the results you obtained. You can use figures to illustrate the results, however, figures must be a readable size and of publication quality resolution. It is essential that you explain what you find; i.e. you need to explain the results using your own words. Figures and the outputs of the analysis software alone will only earn partial marks: the most important aspect is to demonstrate your understanding of the result.

  • Provide your code with proper annotations.

  • The answers for Tasks 1-3, including figures and scripts, must not exceed 2 pages for each task and for task 4 must not exceed 300 words. Please use font size 10 and double spacing.

如何使用并且观察igvtools

  • https://zhuanlan.zhihu.com/p/36703085

  • # use IGV to visualize the TDF files
    # use IGV to visualize the BAM files (with the .bai index file)
    
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21



    1. Download any RNAseq datasets for an experiment. You can use your own data or find one of interest from NCBI GEO (https://www.ncbi.nlm.nih.gov/geo/ ). You must identify a dataset comparing at least two conditions (you may use the example shown in class), and align these to the appropriate reference genome. Perform basic data quality assessment using FastQC or other software, trim data if necessary. Please don’t be over-ambitious and discuss with the lecturer about the dataset you selected if necessary.



    The answers for Tasks 1-3, including figures and scripts, must not exceed 2 pages (Please use font size 10 and double spacing)

    ---

    The RNAseq datasets used in this experiments were downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE128699. The datasets were obtained from single-nucleotide resolution m6A mapping (miCLIP) on wild-type and METTL5 knockout HCT116 cells to identify the METTL5-dependent methylation events.

    ```bash
    # Wild Type
    (nohup fastq-dump --split-3 SRR8767363 -O $dir_fastq > $dir_fastq/226.out)&
    (nohup fastq-dump --split-3 SRR8767364 -O $dir_fastq > $dir_fastq/227.out)&

    # METTL5 knockout
    (nohup fastq-dump --split-3 SRR8767365 -O $dir_fastq > $dir_fastq/228.out)&
    (nohup fastq-dump --split-3 SRR8767366 -O $dir_fastq > $dir_fastq/229.out)&
    Once the datasets were successfully downloaded to the server, quality control method would be used by "fastqc". In this case the sequence data were paired-end, which required me to check the data quality of both ends.
1
2
3
4
5
6
# S4. quality control ####################
# for paired end of data, you have to check the data quality of both reads
fastqc -o $dir_fastqc_report -f fastq $dir_fastq/SRR8767363_1.fastq $dir_fastq/SRR8767363_2.fastq
fastqc -o $dir_fastqc_report -f fastq $dir_fastq/SRR8767364_1.fastq $dir_fastq/SRR8767364_2.fastq
fastqc -o $dir_fastqc_report -f fastq $dir_fastq/SRR8767365_1.fastq $dir_fastq/SRR8767365_2.fastq
fastqc -o $dir_fastqc_report -f fastq $dir_fastq/SRR8767366_1.fastq $dir_fastq/SRR8767366_2.fastq

All the data achieved high performance, but considering the page limits, I would randomly select one sample to illustrate the results.

image-20210330194453016

Task2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# S5 obtain the reference genome

wget -O ~/cw1/P1/index/hg38.fa http://10.7.88.74/MyWeb/Genomes/hg38/genome.fa # download the genome
wget -O ~/cw1/P1/index/hg38.gtf http://10.7.88.74/MyWeb/Genomes/hg38/genes.gtf # download the annotation

# S6 build index, needed for alignment
(nohup hisat2-build ~/cw1/P1/index/hg38.fa ~/cw1/P1/index/hg38 > ~/cw1/P2/index/build_index.out)& # build index, take a very long time
#the index is necessary for the alignment and increase the query rate

# S7 alignment

(nohup hisat2 -x ~/cw1/P1/index/hg38 -1 ~/cw1/P1/fastq/SRR8767363_1.fastq -2 ~/cw1/P1/fastq/SRR8767363_2.fastq -S ~/cw1/P1/bam/SRR8767363.sam> ~/cw1/P2/bam/align_SRR8767363.out)&
(nohup hisat2 -x ~/cw1/P1/index/hg38 -1 ~/cw1/P1/fastq/SRR8767364_1.fastq -2 ~/cw1/P1/fastq/SRR8767364_2.fastq -S ~/cw1/P1/bam/SRR8767364.sam> ~/cw1/P2/bam/align_SRR8767364.out)&
(nohup hisat2 -x ~/cw1/P1/index/hg38 -1 ~/cw1/P1/fastq/SRR8767365_1.fastq -2 ~/cw1/P1/fastq/SRR8767365_2.fastq -S ~/cw1/P1/bam/SRR8767365.sam> ~/cw1/P2/bam/align_SRR8767365.out)&
(nohup hisat2 -x ~/cw1/P1/index/hg38 -1 ~/cw1/P1/fastq/SRR8767366_1.fastq -2 ~/cw1/P1/fastq/SRR8767366_2.fastq -S ~/cw1/P1/bam/SRR8767366.sam> ~/cw1/P2/bam/align_SRR8767366.out)&

# S8 convert SAM to BAM

(nohup samtools view -bS ~/cw1/P1/bam/SRR8767363.sam -o ~/cw1/P1/bam/SRR8767363.bam)& # convert sam to bam file
(nohup samtools view -bS ~/cw1/P1/bam/SRR8767364.sam -o ~/cw1/P1/bam/SRR8767364.bam)& # convert sam to bam file
(nohup samtools view -bS ~/cw1/P1/bam/SRR8767365.sam -o ~/cw1/P1/bam/SRR8767365.bam)& # convert sam to bam file
(nohup samtools view -bS ~/cw1/P1/bam/SRR8767366.sam -o ~/cw1/P1/bam/SRR8767366.bam)& # convert sam to bam file

# S9 sort alignment IGV Tools
(nohup samtools sort ~/cw1/P1/bam/SRR8767363.bam -o ~/cw1/P1/bam/SRR8767363_sorted.bam > ~/cw1/P2/bam/sort_SRR8767363.out)& # sort alignment by coordinates
(nohup samtools sort ~/cw1/P1/bam/SRR8767364.bam -o ~/cw1/P1/bam/SRR8767364_sorted.bam > ~/cw1/P2/bam/sort_SRR8767364.out)& # sort alignment by coordinates
(nohup samtools sort ~/cw1/P1/bam/SRR8767364.bam -o ~/cw1/P1/bam/SRR8767365_sorted.bam > ~/cw1/P2/bam/sort_SRR8767365.out)& # sort alignment by coordinates
(nohup samtools sort ~/cw1/P1/bam/SRR8767364.bam -o ~/cw1/P1/bam/SRR8767366_sorted.bam > ~/cw1/P2/bam/sort_SRR8767366.out)& # sort alignment by coordinates


# S10 index all the bam files, which is required for visualization in IGV
(nohup samtools index ~/cw1/P1/bam/SRR8767363_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.
(nohup samtools index ~/cw1/P1/bam/SRR8767364_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.
(nohup samtools index ~/cw1/P1/bam/SRR8767365_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.
(nohup samtools index ~/cw1/P1/bam/SRR8767366_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.


# S11 visualization of bam files
wget -O ~/cw1/P1/tdf/hg38.chrom.sizes http://10.7.88.74/MyWeb/Genomes/chrom.sizes/hg38.chrom.sizes # use a different genome size file if necessary

(nohup igvtools count -z 5 -w 10 -e 0 ~/cw1/P1/bam/SRR8767363_sorted.bam\
~/cw1/P1/tdf/SRR8767363.tdf ~/cw1/P1/tdf/hg38.chrom.sizes > ~/P2/tdf/tdf_SRR8767363.out)& # generated tdf
(nohup igvtools count -z 5 -w 10 -e 0 ~/cw1/P1/bam/SRR8767364_sorted.bam\
~/cw1/P1/tdf/SRR8767364.tdf ~/cw1/P1/tdf/hg38.chrom.sizes > ~/P2/tdf/tdf_SRR8767364.out)& # generated tdf
(nohup igvtools count -z 5 -w 10 -e 0 ~/cw1/P1/bam/SRR8767365_sorted.bam\
~/cw1/P1/tdf/SRR8767365.tdf ~/cw1/P1/tdf/hg38.chrom.sizes > ~/P2/tdf/tdf_SRR8767365.out)& # generated tdf
(nohup igvtools count -z 5 -w 10 -e 0 ~/cw1/P1/bam/SRR8767366_sorted.bam\
~/cw1/P1/tdf/SRR8767366.tdf ~/cw1/P1/tdf/hg38.chrom.sizes > ~/P2/tdf/tdf_SRR8767366.out)& # generated tdf

# use IGV to visualize the TDF files
# use IGV to visualize the BAM files (with the .bai index file)
# compare the tdf file generated with the gtf files

# S12. find transcribed regions of a single sample
(nohup cufflinks ~/cw1/P1/bam/SRR8767363_sorted.bam -o ~/cw1/P1/bam/SRR8767363 -g /home/yuxuan/cw1/P1/index/hg38.gtf> ~/cw1/P1/gtf/cufflinks_SRR8767363_g.out)& # reference-based transcriptome reconstruction
(nohup cufflinks ~/cw1/P1/bam/SRR8767364_sorted.bam -o ~/cw1/P1/bam/SRR8767364 -g /home/yuxuan/cw1/P1/index/hg38.gtf> ~/cw1/P1/gtf/cufflinks_SRR8767364_g.out)& # reference-based transcriptome reconstruction
(nohup cufflinks ~/cw1/P1/bam/SRR8767365_sorted.bam -o ~/cw1/P1/bam/SRR8767365 -g /home/yuxuan/cw1/P1/index/hg38.gtf> ~/cw1/P1/gtf/cufflinks_SRR8767365_g.out)& # reference-based transcriptome reconstruction
(nohup cufflinks ~/cw1/P1/bam/SRR8767366_sorted.bam -o ~/cw1/P1/bam/SRR8767366 -g /home/yuxuan/cw1/P1/index/hg38.gtf> ~/cw1/P1/gtf/cufflinks_SRR8767366_g.out)& # reference-based transcriptome reconstruction

image-20210402233857001

image-20210402234114271

image-20210402234233220

image-20210403093831911

Task3

1
2
3
4
5
6
7
8
9
wt=/home/yuxuan/cw1/P1/bam/SRR8767363_sorted.bam,/home/yuxuan/cw1/P1/bam/SRR8767364_sorted.bam
METTL5_KO=/home/yuxuan/cw1/P1/bam/SRR8767365_sorted.bam,/home/yuxuan/cw1/P1/bam/SRR8767366_sorted.bam
labels=WT,METTL5_KO
(nohup cuffdiff -o /home/yuxuan/cw1/P1/DGE -p 2 -L $labels\
-b /home/yuxuan/cw1/P1/index/hg38.fa \
-u /home/yuxuan/cw1/P1/index/hg38.gtf \
$wt\
$METTL5_KO\
> /home/yuxuan/cw1/P1/DGE/cuffdiff.out)&

image-20210403000843602

1
2
selected <- cuffdiff_result[cuffdiff_result$p_value<=0.05,]
nrow(selected)

第三题的终极疑惑,cuffdiff的结果怎么在igv里面显示

The full code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# RNAseq data analysis
# BIO316
# 2021-03-29
# Bash script

# S1. make your working directory P1 ##################################
dir="cw1"
cd $dir

P1="P1"
if [ ! -d $P1 ]; then
mkdir $P1
else
echo "the file of $P1 exists!"
fi

# S2. download the raw data in sra format ####################################
# Data from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE128699

# specify the directory inside P1 ~cw1/P1/fastq

dir_p1=$(pwd)/P1
fastq="fastq"
if [ ! -d "$dir_p1/$fastq" ]; then
mkdir "$dir_p1/$fastq"
else
echo "the file of $fastq exists!"
fi

dir_fastq="$dir_p1/$fastq"

# S3 download all the data in parallel

# Wild Type
(nohup fastq-dump --split-3 SRR8767363 -O $dir_fastq > $dir_fastq/226.out)&
(nohup fastq-dump --split-3 SRR8767364 -O $dir_fastq > $dir_fastq/227.out)&

# METTL5 knockout
(nohup fastq-dump --split-3 SRR8767365 -O $dir_fastq > $dir_fastq/228.out)&
(nohup fastq-dump --split-3 SRR8767366 -O $dir_fastq > $dir_fastq/229.out)&

# ZCCHC4 knockout (no use in this case)
(nohup fastq-dump --split-3 SRR8767367 -O $dir_fastq > $dir_fastq/230.out)&
(nohup fastq-dump --split-3 SRR8767368 -O $dir_fastq > $dir_fastq/231.out)&


# S4. quality control ####################################
fastqc_report="fastqc_report"
dir_fastqc_report="$dir_p1/$fastqc_report"
if [ ! -d $dir_fastqc_report ]; then
mkdir $dir_fastqc_report
else
echo "the file of $fastqc_report exists!"
fi

fastqc -o $dir_fastqc_report -f fastq $dir_fastq/SRR8767363_1.fastq $dir_fastq/SRR8767363_2.fastq
fastqc -o $dir_fastqc_report -f fastq $dir_fastq/SRR8767364_1.fastq $dir_fastq/SRR8767364_2.fastq
fastqc -o $dir_fastqc_report -f fastq $dir_fastq/SRR8767365_1.fastq $dir_fastq/SRR8767365_2.fastq
fastqc -o $dir_fastqc_report -f fastq $dir_fastq/SRR8767366_1.fastq $dir_fastq/SRR8767366_2.fastq
# for paired end of data, you have to check the data quality of both reads

# S5 obtain the reference genome

wget -O ~/cw1/P1/index/hg38.fa http://10.7.88.74/MyWeb/Genomes/hg38/genome.fa # download the genome
wget -O ~/cw1/P1/index/hg38.gtf http://10.7.88.74/MyWeb/Genomes/hg38/genes.gtf # download the annotation

# S6 build index, needed for alignment
(nohup hisat2-build ~/cw1/P1/index/hg38.fa ~/cw1/P1/index/hg38 > ~/cw1/P2/index/build_index.out)& # build index, take a very long time
#the index is necessary for the alignment and increase the query rate

# S7 alignment

(nohup hisat2 -x ~/cw1/P1/index/hg38 -1 ~/cw1/P1/fastq/SRR8767363_1.fastq -2 ~/cw1/P1/fastq/SRR8767363_2.fastq -S ~/cw1/P1/bam/SRR8767363.sam> ~/cw1/P2/bam/align_SRR8767363.out)&
(nohup hisat2 -x ~/cw1/P1/index/hg38 -1 ~/cw1/P1/fastq/SRR8767364_1.fastq -2 ~/cw1/P1/fastq/SRR8767364_2.fastq -S ~/cw1/P1/bam/SRR8767364.sam> ~/cw1/P2/bam/align_SRR8767364.out)&
(nohup hisat2 -x ~/cw1/P1/index/hg38 -1 ~/cw1/P1/fastq/SRR8767365_1.fastq -2 ~/cw1/P1/fastq/SRR8767365_2.fastq -S ~/cw1/P1/bam/SRR8767365.sam> ~/cw1/P2/bam/align_SRR8767365.out)&
(nohup hisat2 -x ~/cw1/P1/index/hg38 -1 ~/cw1/P1/fastq/SRR8767366_1.fastq -2 ~/cw1/P1/fastq/SRR8767366_2.fastq -S ~/cw1/P1/bam/SRR8767366.sam> ~/cw1/P2/bam/align_SRR8767366.out)&

# S8 convert SAM to BAM

(nohup samtools view -bS ~/cw1/P1/bam/SRR8767363.sam -o ~/cw1/P1/bam/SRR8767363.bam)& # convert sam to bam file
(nohup samtools view -bS ~/cw1/P1/bam/SRR8767364.sam -o ~/cw1/P1/bam/SRR8767364.bam)& # convert sam to bam file
(nohup samtools view -bS ~/cw1/P1/bam/SRR8767365.sam -o ~/cw1/P1/bam/SRR8767365.bam)& # convert sam to bam file
(nohup samtools view -bS ~/cw1/P1/bam/SRR8767366.sam -o ~/cw1/P1/bam/SRR8767366.bam)& # convert sam to bam file

# S9 sort alignment IGV Tools需要使用按坐标sort好的BAM文件并配合相应的BAM index
(nohup samtools sort ~/cw1/P1/bam/SRR8767363.bam -o ~/cw1/P1/bam/SRR8767363_sorted.bam > ~/cw1/P2/bam/sort_SRR8767363.out)& # sort alignment by coordinates
(nohup samtools sort ~/cw1/P1/bam/SRR8767364.bam -o ~/cw1/P1/bam/SRR8767364_sorted.bam > ~/cw1/P2/bam/sort_SRR8767364.out)& # sort alignment by coordinates
(nohup samtools sort ~/cw1/P1/bam/SRR8767364.bam -o ~/cw1/P1/bam/SRR8767365_sorted.bam > ~/cw1/P2/bam/sort_SRR8767365.out)& # sort alignment by coordinates
(nohup samtools sort ~/cw1/P1/bam/SRR8767364.bam -o ~/cw1/P1/bam/SRR8767366_sorted.bam > ~/cw1/P2/bam/sort_SRR8767366.out)& # sort alignment by coordinates


# S10 index all the bam files, which is required for visualization in IGV
(nohup samtools index ~/cw1/P1/bam/SRR8767363_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.
(nohup samtools index ~/cw1/P1/bam/SRR8767364_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.
(nohup samtools index ~/cw1/P1/bam/SRR8767365_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.
(nohup samtools index ~/cw1/P1/bam/SRR8767366_sorted.bam)& # a file with the same name and ended with ".bai" is generated. This is the index of a BAM file.


# S11 visualization of bam files
wget -O ~/cw1/P1/tdf/hg38.chrom.sizes http://10.7.88.74/MyWeb/Genomes/chrom.sizes/hg38.chrom.sizes # use a different genome size file if necessary

(nohup igvtools count -z 5 -w 10 -e 0 ~/cw1/P1/bam/SRR8767363_sorted.bam\
~/cw1/P1/tdf/SRR8767363.tdf ~/cw1/P1/tdf/hg38.chrom.sizes > ~/P2/tdf/tdf_SRR8767363.out)& # generated tdf
(nohup igvtools count -z 5 -w 10 -e 0 ~/cw1/P1/bam/SRR8767364_sorted.bam\
~/cw1/P1/tdf/SRR8767364.tdf ~/cw1/P1/tdf/hg38.chrom.sizes > ~/P2/tdf/tdf_SRR8767364.out)& # generated tdf
(nohup igvtools count -z 5 -w 10 -e 0 ~/cw1/P1/bam/SRR8767365_sorted.bam\
~/cw1/P1/tdf/SRR8767365.tdf ~/cw1/P1/tdf/hg38.chrom.sizes > ~/P2/tdf/tdf_SRR8767365.out)& # generated tdf
(nohup igvtools count -z 5 -w 10 -e 0 ~/cw1/P1/bam/SRR8767366_sorted.bam\
~/cw1/P1/tdf/SRR8767366.tdf ~/cw1/P1/tdf/hg38.chrom.sizes > ~/P2/tdf/tdf_SRR8767366.out)& # generated tdf

# use IGV to visualize the TDF files
# use IGV to visualize the BAM files (with the .bai index file)
# compare the tdf file generated with the gtf files

# S12. find transcribed regions of a single sample
(nohup cufflinks ~/cw1/P1/bam/SRR8767363_sorted.bam -o ~/cw1/P1/bam/SRR8767363 -g /home/yuxuan/cw1/P1/index/hg38.gtf> ~/cw1/P1/gtf/cufflinks_SRR8767363_g.out)& # reference-based transcriptome reconstruction
(nohup cufflinks ~/cw1/P1/bam/SRR8767364_sorted.bam -o ~/cw1/P1/bam/SRR8767364 -g /home/yuxuan/cw1/P1/index/hg38.gtf> ~/cw1/P1/gtf/cufflinks_SRR8767364_g.out)& # reference-based transcriptome reconstruction
(nohup cufflinks ~/cw1/P1/bam/SRR8767365_sorted.bam -o ~/cw1/P1/bam/SRR8767365 -g /home/yuxuan/cw1/P1/index/hg38.gtf> ~/cw1/P1/gtf/cufflinks_SRR8767365_g.out)& # reference-based transcriptome reconstruction
(nohup cufflinks ~/cw1/P1/bam/SRR8767366_sorted.bam -o ~/cw1/P1/bam/SRR8767366 -g /home/yuxuan/cw1/P1/index/hg38.gtf> ~/cw1/P1/gtf/cufflinks_SRR8767366_g.out)& # reference-based transcriptome reconstruction

cufflinks --help # check how to do reference-based transcriptome assembly
# view the gtf file in IGV
# compare it with the standard gtf file provided in IGV


#S13 differential analysis with cuffdiff
wt=/home/yuxuan/cw1/P1/bam/SRR8767363_sorted.bam,/home/yuxuan/cw1/P1/bam/SRR8767364_sorted.bam
METTL5_KO=/home/yuxuan/cw1/P1/bam/SRR8767365_sorted.bam,/home/yuxuan/cw1/P1/bam/SRR8767366_sorted.bam
labels=WT,METTL5_KO
(nohup cuffdiff -o /home/yuxuan/cw1/P1/DGE -p 2 -L $labels\
-b /home/yuxuan/cw1/P1/index/hg38.fa \
-u /home/yuxuan/cw1/P1/index/hg38.gtf \
$wt\
$METTL5_KO\
> /home/yuxuan/cw1/P1/DGE/cuffdiff.out)&

  • Post title:RNAseq demo in BIO316 CW1
  • Post author:Yuxuan Wu
  • Create time:2022-03-11 11:37:32
  • Post link:yuxuanwu17.github.io2022/03/11/BIO316-CW1/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.