A structured, job-market-aligned curriculum built from the ground up. Follow the phases in order — each one builds directly on the last. Every module ends with a real GitHub repository you push as proof of skill.
Even if you know some R or Python, Phase 1 fills critical gaps — Bash, Git, and Conda are used in every single later module.
Each module is 1–2 weeks at 1 hour per day. Complete all lessons, do all exercises, then move on. Depth beats speed.
Every module ends with a real repo push. This is not optional — your GitHub is your portfolio. No repo = no proof.
All exercises use freely available datasets from GEO, SRA, or 10x Genomics. You never need to pay for data.
Each phase ends with a capstone project that ties all modules together. Complete the capstone before starting the next phase.
The command line is the backbone of all bioinformatics. Every tool — STAR, GATK, Snakemake — runs here. Learn to navigate, write scripts, handle files, and manage processes before touching any bioinformatics software.
Every repo you build in this curriculum gets pushed to GitHub as proof of skill. Learn Git properly once — branching, commit messages, .gitignore — and use it every single day from here on.
Bioinformatics tools constantly conflict with each other. Conda environments isolate them perfectly. This is a daily-use skill — every pipeline you build will depend on it.
You know R basics. Now master the tools used in genomics: tidyverse for data wrangling, ggplot2 for publication figures, R Markdown for reproducible reports, and Bioconductor for biology-specific packages.
Extend your Python basics into bioinformatics-specific tools. pandas and NumPy handle large genomics tables. Biopython reads FASTA and FASTQ files. argparse lets you build command-line tools like the pros.
Every sequencing pipeline starts here — before you align a single read you need to check quality and remove adapter contamination. A bad QC step ruins every downstream result.
Map sequencing reads to a reference genome. Understand SAM/BAM formats, alignment flags, and how to extract useful statistics. You cannot do differential expression or variant calling without this step.
The most common analysis in RNA-seq biology. Go deep — understand the negative binomial model, produce publication-quality volcano plots and heatmaps, and build a fully reproducible R Markdown report.
Call SNPs and indels from sequencing data — the foundation of GWAS, population genomics, and precision medicine. Follow GATK best practices, understand VCF format, and filter variants correctly.
Sequence similarity search and long-read alignment. BLAST finds homologous sequences across species. Minimap2 aligns long Oxford Nanopore and PacBio reads. Both are used weekly in plant genomics.
Know where the data lives and how to fetch it programmatically. Download genomes, annotations, and raw reads from NCBI, Ensembl, and TAIR without clicking through web interfaces — write scripts instead.
Listed in 55% of bioinformatics job postings. Write SELECT queries, JOINs, and aggregations to query genomics databases. Use SQLite locally and connect SQL to pandas — a short module with high career impact.
Turn your individual scripts into one reproducible, automated pipeline. Snakemake is increasingly required in job postings — not just nice-to-have. Build the full RNA-seq pipeline end-to-end in a Snakefile.
Your MSc thesis skill — and your strongest differentiator. Implement genomic prediction models with rrBLUP and BGLR, calculate BPV, run cross-validation, and produce Manhattan plots. Push your actual thesis code here.
60% of life science companies are increasing ML investment. Learn classification, clustering, dimensionality reduction, and cross-validation with scikit-learn — applied directly to genomics datasets.
Package your entire analysis so it runs identically on any machine — your laptop, a colleague's server, or a cloud cluster. Docker is now a standard requirement in pharma and biotech roles.
Most bioinformatics compute happens on HPC clusters. Write SLURM job scripts, run array jobs across dozens of samples in parallel, and integrate Snakemake with your cluster's scheduler.
Many pharma and biotech companies use Nextflow over Snakemake. nf-core provides 100+ ready-made pipelines including nf-core/rnaseq. Knowing both workflow managers makes you versatile in any team.
The standard format for shareable, reproducible analysis in Python — and increasingly in R via IRkernel. Every recruiter expects to see rendered notebooks on your GitHub. Learn best practices from day one.
Complementary to DESeq2. Many published papers use edgeR or limma-voom. Knowing all three methods and when to use each makes you more versatile and credible in peer review discussions.
Epigenomics roles are growing fast. Learn peak calling with MACS2, differential binding analysis with DiffBind, and visualisation in IGV. Connects directly to transcriptomics — chromatin state drives gene expression.
Strong visualisation skills are immediately visible on GitHub and in papers. Build multi-panel publication figures, annotated heatmaps, Manhattan plots, and interactive HTML reports with Quarto.
Tie everything from Phases 1–3 together. One polished end-to-end pipeline on public plant data — Snakemake, DESeq2, Docker, Jupyter notebook, full README with workflow diagram. Tag a v1.0 release on GitHub.
Before writing any code, understand what makes single-cell data fundamentally different from bulk — the dropout problem, sparse matrices, AnnData and Seurat object structures, and which public datasets to use for practice.
Seurat is the dominant scRNA-seq R package. Master the full standard workflow on the PBMC3k dataset — QC filtering, SCTransform normalisation, PCA, UMAP, Leiden clustering, marker detection, and cell type annotation.
Real experiments have multiple samples from different batches. Learn to integrate them with Harmony and Seurat CCA — and understand when to use each. Remove batch effects without removing real biological signal.
Scanpy is the Python equivalent of Seurat and is increasingly preferred for large datasets. Reproduce the exact same PBMC analysis in Python — having both Seurat and Scanpy versions shows bilingual competence that few candidates demonstrate.
Model cell differentiation and developmental processes — especially relevant in plant biology. Order cells along a developmental path with Monocle3 in R, then model RNA velocity with scVelo in Python.
Your DESeq2 knowledge transfers here. The pseudo-bulk approach — aggregate cells per sample per cluster, then run DESeq2 — is the statistically correct method. Wilcoxon per-cluster testing is exploratory only.
Your flagship GitHub project. Apply everything to a real plant scRNA-seq dataset — Arabidopsis root or rice. Full Seurat and Scanpy workflows, trajectory analysis, pseudo-bulk DE, and TAIR gene annotations. This is the crown jewel of your portfolio.
68% of hiring managers say the biggest gap in bioinformatics candidates is communication — not technical skill. Learn to write methods sections, craft compelling READMEs, present results to non-bioinformaticians, and write cover letters that get interviews.
At this point you will have 19 GitHub repos with real code, a published plant scRNA-seq analysis, a full RNA-seq pipeline, a genomic selection model from your thesis, and the communication skills to explain all of it in an interview.
Phase 1 is live. Open a terminal, boot into Ubuntu, and begin Lesson 1 right now — no sign-up required.