Charles Explorer logo
🇬🇧

UNIX and work with genomic data

Class at Faculty of Science |
MB170C47

Syllabus

I. Introduction to Unix - Learn about the Unix philosophy.

II. Basic Unix - Learn to use the basic commands (cd, ls, ll, mkdir, mv, cp, pwd, htop, screen, grep, globbing, less, head, tail, cat, cut, sort, uniq, paste, join, pipes).

III. Advanced Unix - Learn basics of awk, sed, regular expressions, shell scripting, shell variables, parallel, subshells.

IV. Introduction to Genomics - Learn how ‘genomes’ are made.

V. Data visualization - Learn how to format your data for effective visualization and how to use RStudio, tidyr, dplyr and ggplot2 to explore your data visually.

VI. Read quality assessment - Learn how to use Unix to explore FASTQ files, calculate some basic statistics, assess read quality, filter out low-quality reads.

VII. Genome assembly - Learn how to do a (small) genome assembly.

VIII. Variant calling - Learn how to use the original NGS reads and a genome assembly to call variants.

IX. Standard annotation formats - Learn how information on genes, variants and genome properties is stored (GFF, VCF, BED formats) and how to obtain quick summaries with impressive speed (bedtools, vcftools, etc.)

X. A lot of practice.

Annotation

As the field of biology evolves, biologists increasingly require advanced computational skills and expanded computational resources. An essential tool in this domain is the Unix command line, which also facilitates remote access to more powerful computing platforms. Furthermore, tools like git are indispensable for the reproducibility of research, ensuring consistency and reliability in findings.

We present an updated course with focus on remote computing and code reproducibility. Participants of the course will gain sufficient skills and confidence in unix-like environments in order to be able to use it for processing and analysis of their own genomics data. Besides a lot of hands-on exercise we will also provide an overview of available computational environments used in academic as well as commercial setups in bioinformatics.