“Analysis of omics data” course
Skoltech Term 4
20 April 2020
This presentation can be found at https://slides.com/agalicina/epigenetics-practice1-2020
"Computational analysis of
Sanger
Illumina, Roche 454, SOLiD
Nanopore
Wasserman & Sandelin, 2004
Park 2009
Park 2009
Park 2009
The cross-correlation typically peaks at the shift corresponding to the fragment length and the shift corresponding to the read length.
NSC (normalized strand cross-correlation coefficient):
The ratio between the cross-correlation at the fragment length and the background crosscorrelation.
RSC (relative strand cross-correlation coefficient):
The ratio between cross-correlation at the fragment length and the crosscorrelation at the read length.
Very successful ChIP experiments generally have NSC>1.05 and RSC>0.8
Bailey 2013
Park 2009
Park 2009
ENCODE Standards:
https://www.encodeproject.org/chip-seq/transcription_factor/
MACS2 is the most common software:
Binding of protein is typically specific to DNA sequence:
Buenrostro et al., 2013
Meyer & Liu Nat. Rev. Gen. 2014
Meyer & Liu Nat. Rev. Gen. 2014
Meyer & Liu Nat. Rev. Gen. 2014
Bisulfite-Seq
Bisulfite-Seq
echo "Bioinformatics has never been easier! Copy&paste wisely, or rm -rf ~/ will strike you one day."
Review of some useful bash commands:
ls
pwd
less file.txt
some_command -h
some_command --help
man command
cp target_file destination_folder
cd directory_name/
cd ../
ls -hts
mkdir
gzip -d file.gz
unzip file.zip
1. Login to the server via terminal:
or use Putty on Windows.
(example instructions https://www.ssh.com/ssh/putty/windows/#sec-Configuration-options-and-saved-profiles)
2. Create the folder with your project:
3. Activate prepared environment:
(if you have problems with it, please, contact the TA)
4. Check your environment:
ssh username@servername
mkdir EpiPract1
cd EpiPract1
export PATH="/home/galitsyna/anaconda3/bin:$PATH"
printf "envs_dirs:\n - /home/galitsyna/anaconda3/envs/" > .condarc
conda activate epigenetics
ls # Should list all the filesin the directory
pwd # Should result in your currentl location
conda list # Should list all the programs installed, if conda imported successfully.
less /home/galitsyna/EpiPract1/students_spreadsheet.txt
cp /home/galitsyna/EpiPract1/FILES/<your_file.fa.gz> ~/EpiPract1/
mkdir ~/EpiPract1/GENOME/
cp /home/galitsyna/EpiPract1/GENOME/chr[5,6].fa ~/EpiPract1/GENOME/
cp /home/galitsyna/EpiPract1/GENOME/chrom_sizes.tsv ~/EpiPract1/GENOME/
ls -hts
ls -hts GENOME/
You should have:
~/EpiPract1/ directory:
- FASTQ file (<your_file.fa.gz>) from the list on the right
~/EpiPract1/GENOME/ directory:
- FASTA files with hg38 reference genome
for chr5, chr6 (chr5.fa, chr6.fa)
- TSV (Tab Separated Values) file with
chromosome sizes (chrom_sizes.tsv)
Name | File |
---|---|
Anastasia Pivnyuk | ENCFF001CWQ_1.fa.gz |
Nikita Sharaev | ENCFF001CWQ_2.fa.gz |
Artemy Shumskiy | ENCFF002ERE_1.fa.gz |
Dmitrii Kriukov | ENCFF002ERE_2.fa.gz |
Pletenev Ilya | ENCFF228WDF_1.fa.gz |
Konstantin Chernyshov | ENCFF228WDF_2.fa.gz |
Ivan Kuznetsov | ENCFF280UJR_1.fa.gz |
Anna Kalinina | ENCFF280UJR_2.fa.gz |
Sofya Kasatskaya | ENCFF288DXQ_1.fa.gz |
Julia Bocharkina | ENCFF288DXQ_2.fa.gz |
Vasily Borodin | ENCFF291MFN_1.fa.gz |
Sofia Kamalyan | ENCFF291MFN_2.fa.gz |
Anna Krasivskaya | ENCFF512FUR_1.fa.gz |
Aleksandra Ozerova | ENCFF512FUR_2.fa.gz |
Victoria Kobets | ENCFF586YZS_1.fa.gz |
Slesareva Anastasiia | ENCFF586YZS_2.fa.gz |
Mikhail Moldovan | ENCFF591XCX_1.fa.gz |
Viktor Mamontov | ENCFF591XCX_2.fa.gz |
Evgeniia Alekseeva | ENCFF601JYM_1.fa.gz |
Trofimova Anna | ENCFF601JYM_2.fa.gz |
gzip -d <file.fastq.gz> # This will result in <file.fastq> unzipped file
less <file.fastq>
wc -l <file.fastq>
fastqc <file.fastq>
Task 1 (0.5 points): report the number of entries in your FASTQ file, read length and passed/failed fastqc filters. Provide a brief explanation.
Task 2 (1 points): go to ENCODE (https://www.encodeproject.org/). Find ID of your file (e.g. ENCFF601JYM) there. Report the experiment type, sequencing platform, cell type (Biosample summary) for your NGS data.
Let's review the NGS data processing workflow:
fastqc <input.fastq>
ls
# input.fastq input_fastqc/
cd input_fastqc/
ls
# fastqc_data.txt fastqc_report.html Icons Images summary.txt
Don't run this code, it's FYI
(Manual: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml)
bowtie2-build -h
bowtie2-build chr1.fa,chr2.fa,chr3.fa,chr4.fa hg38
# Check the results:
ls -hs
# 938M hg38.1.bt2 701M hg38.2.bt2 12K hg38.3.bt2 701M hg38.4.bt2 3.0G hg38.fa 938M hg38.rev.1.bt2 701M hg38.rev.2.bt2
bowtie2-inspect -n hg38
Don't run this code, it's FYI
FASTQ
+ indexed database
-> SAM file
(Manual: http://www.htslib.org/doc/samtools.html)
samtools view file.sam
samtools view -F 4 input.sam
samtools stats file.sam
samtools view -b file.sam > file.bam
(Manual: https://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
BED format specification: https://genome.ucsc.edu/FAQ/FAQformat.html)
bedtools bamtobed -i file.bam > file.bed
map
intersect
Full set of compiled tools: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
bedGraphToBigWig -h
These steps are time-consuming, but we have to complete them:
Task 3 (1 points): report the number of unique, multiple mappings and unmapped reads in your FASTQ file. Report the major chromosome, on which your reads have mapped.
cd GENOME/
bowtie2-build chr5.fa,chr6.fa hg38_chr5-6_index
bowtie2-build -h
bowtie2 -h
bowtie2 --very-sensitive -x ./GENOME/hg38_chr5-6_index -U <file.fastq> -S <file.sam>
less <file.sam>
ls
cd ../
samtools view -h -F 4 <file.sam> | less
samtools --help
samtools view -h -q 10 <file.sam> > <file_filtered.sam>
samtools view -b -q 10 <file.sam> > <file_filtered.bam>
samtools sort <file_filtered.bam> > <file_filtered.sorted.bam>
bedtools makewindows -g GENOME/chrom_sizes.tsv -w 1000 > hg38_windows.1000.bed
grep "chr5" hg38_windows.1000.bed > chr5_windows.1000.bed
bedtools genomecov -ibam <file_filtered.sorted.bam> -bga -split > <file.genomecov.bedgraph>
bedtools genomecov -h
bedtools makewindows -h
bedtools coverage -h
bedtools coverage -counts -a chr5_windows.1000.bed -b <file_filtered.sorted.bam> > <file.binned.bedgraph>
ls
less <file.genomecov.bedgraph>
less <file.binned.bedgraph>
We will use UCSC browser for bedgraph files visualization.
Task 4 (1 point): Load both BEDGRAPH files with your dataset.
Select some gene from your chromosome and send the screenshot of 1 Kb, 10Kb and 1 Mb regions around it with visualization of your dataset.
Task 5 (1 points): Provide explanation of your observations. What do you expect from this type of experiment? What is the difference between genome coverage and window coverage?
mv <file.genomecov.bw> <YOUR_CELL_LINE-CHROMOSOME.bw>
bedGraphToBigWig <file.genomecov.bedgraph> ./GENOME/chrom_sizes.tsv <file.genomecov.bw>
Task 6 (0.5 point): Send the screenshot of your directory (as listed by ls command) with collected files.