Lesson 2: File Operations — Bash & Linux

01 Creating files

Before you can copy or move files you need some to work with. There are two main ways to create files from the terminal: touch creates an empty file, and echo creates a file with content.

`touch` — Create an empty file

📄 Why touch exists

touch was originally designed to update the timestamp of a file (to "touch" it so Linux thinks it was recently modified). But its most common use today is simply to create a new empty file instantly, without opening any editor.

In bioinformatics you might use touch to create placeholder files in a pipeline — marking that a step is complete before it actually writes output — or just to quickly create a text file to start writing notes.

Run from: ~/bash-linux-bioinformatics/module-1-foundations

bash

cd ~/bash-linux-bioinformatics/module-1-foundations

touch sample.txt                # create one empty file
touch file1.txt file2.txt       # create multiple files at once
ls -lh                           # confirm — size will show 0 because they are empty

`echo` — Write text into a file

📝 How echo writes to files — and why > vs >> matters

echo on its own just prints text to your screen. But when you combine it with > or >>, it redirects that text into a file instead.

The single > means overwrite. It creates the file if it does not exist, and completely replaces the contents if it does. Think of it like taking a blank piece of paper and writing on it — anything that was there before is gone.

The double >> means append. It adds the new text to the end of the file without touching what was already there. Think of it like adding a new line at the bottom of an existing document.

Getting these two confused is one of the most common mistakes beginners make — and it can silently destroy data. Always ask yourself: "do I want to replace or add?"

Run from: ~/bash-linux-bioinformatics/module-1-foundations

bash

echo "Hello bioinformatics"                   # just prints to screen — no file created
echo "Hello bioinformatics" > notes.txt        # creates notes.txt with this one line
echo "Second line" >> notes.txt               # appends — notes.txt now has two lines
echo "This overwrites everything" > notes.txt # DANGER: notes.txt now has only this one line

⚠

A single > destroys the previous contents of a file silently — no warning, no confirmation. A double >> is always safe because it only adds. When in doubt, use >>.

02 Copying & moving files

`cp` — Copy a file

📋 Why cp exists — and the golden rule of raw data

cp stands for copy. It copies a file from one location to another. The original file stays exactly where it was — untouched.

In bioinformatics, cp exists because of one golden rule: never work directly on your raw data. When you receive a FASTQ file from a sequencer, you copy it into a working directory and run your pipeline on the copy. The original stays in a safe location, untouched. If anything goes wrong — a script error, a disk failure, a bad trim — you still have your raw data and can start again.

The -r flag means recursive. Without it, cp refuses to copy a folder and gives an error. With -r, it copies the folder and everything inside it. You must use -r any time you are copying a directory.

Run from: ~/bash-linux-bioinformatics/module-1-foundations

bash

cp notes.txt notes_backup.txt                    # copy in same folder with new name
cp notes.txt ~/bash-linux-bioinformatics/data/     # copy to a different folder
cp notes.txt ~/bash-linux-bioinformatics/data/notes_v2.txt  # copy with a new name in a new folder
cp -r ~/bash-linux-bioinformatics/data/ ~/bash-linux-bioinformatics/data_backup/  # copy entire folder

💡

After copying, always run ls -lh in the destination folder to confirm the file arrived and has the right size. A copy that silently fails is worse than one that gives an error.

`mv` — Move or rename a file

🚚 How mv works — and why it does two things

mv stands for move. It moves a file from one location to another. Unlike cp, the original is removed from its starting location — only one copy exists at a time.

Here is why mv also renames files: when you "move" a file to the same folder but give it a different name, you are effectively renaming it. Linux treats "moving to the same location with a new name" and "renaming" as exactly the same operation — so mv does both.

An important difference from cp: mv works on both files and folders without needing -r. You can move an entire folder with a simple mv folder/ destination/.

Run from: ~/bash-linux-bioinformatics/module-1-foundations

bash

mv notes.txt notes_renamed.txt                   # rename a file (same folder, new name)
mv notes_renamed.txt ~/bash-linux-bioinformatics/data/  # move to a different folder
mv ~/bash-linux-bioinformatics/data/ ~/bash-linux-bioinformatics/data_v1/  # rename a folder

cp vs mv in one sentence: cp is a photocopier — you end up with two copies. mv is physically picking up the file and putting it somewhere else — only one copy ever exists.

03 Deleting files safely

`rm` — Remove a file permanently

🗑 Why rm is dangerous — and how to use it safely

rm stands for remove. It permanently deletes files. There is no recycle bin in Linux — when you run rm, the file is gone immediately with no warning and no way to recover it.

This makes rm both powerful and dangerous. In bioinformatics you use it constantly — to clean up large intermediate files (trimmed FASTQ files, temporary BAM files) that can be several hundred gigabytes each. Keeping them wastes disk space. But deleting the wrong file can destroy months of work.

The -i flag (interactive) is your safety net. It makes rm ask "are you sure?" before deleting each file. You have to type y and press Enter to confirm. This one flag has saved countless researchers from disaster. Use it whenever you are not 100% certain.

The -r flag means recursive — needed to delete a folder and everything inside it. Combined with -f (force), it deletes everything silently. Never run rm -rf unless you are absolutely certain of what you are deleting.

Run from: ~/bash-linux-bioinformatics/module-1-foundations

bash

rm notes_backup.txt                 # delete one file — permanent, no confirmation
rm file1.txt file2.txt              # delete multiple files at once
rm -i sample.txt                  # -i asks "remove sample.txt?" — type y to confirm
rm -r ~/bash-linux-bioinformatics/data_backup/  # delete a folder and all its contents
rm -ri ~/bash-linux-bioinformatics/data_backup/ # same but confirms each file first

⚠

Never run rm -rf / — this attempts to delete the entire operating system. Never run rm -rf * from a folder you are not 100% sure about — it deletes everything in that folder silently. When in doubt, use rm -i.

04 Reading files without opening them

In bioinformatics, files are often enormous. A single compressed FASTQ file can be 10–50 GB. A VCF file with variants from a whole-genome sequencing experiment can have millions of lines. You never open these in a text editor — it would either crash or take minutes to load. Instead, you use command-line tools to see exactly the part you need.

`cat` — Print the entire file

📖 What cat does and when not to use it

cat stands for concatenate. Its original purpose was to join (concatenate) multiple files together and print the result. But its most common everyday use is simply printing a single small file to the screen.

The critical word here is small. If you run cat on a 50 GB FASTQ file, your terminal will scroll through millions of lines for hours — and you will have to force-quit it with Ctrl + C. Only use cat on files you know are small, like configuration files, short scripts, or text files you just created.

Run from: ~/bash-linux-bioinformatics/module-1-foundations

bash

cat notes.txt                   # print entire file — only use on small files
cat file1.txt file2.txt          # print two files back to back (concatenate)

`head` — See the beginning of a file

🔍 Why head is safer than cat for large files

head shows only the first lines of a file — 10 by default. It reads only what it needs and then stops, no matter how large the file is. This makes it completely safe to use on any file of any size.

In bioinformatics, head is one of the first things you run on any new file to understand its format. For example, running head sample.fastq immediately shows you the header format, sequence length, and quality score encoding of the file — enough to confirm the file is what you expect before running a long alignment job.

Since each read in a FASTQ file is exactly 4 lines, head -n 8 shows you the first 2 reads — enough to understand the file's structure.

Run from: ~/bash-linux-bioinformatics/module-1-foundations

bash

head notes.txt                   # first 10 lines (default)
head -n 4 notes.txt              # first 4 lines only
head -n 8 sample.fastq           # first 8 lines = first 2 FASTQ reads (4 lines each)

`tail` — See the end of a file

📋 Why tail is essential for pipeline monitoring

tail shows the last lines of a file — 10 by default. It is the mirror image of head.

Its most powerful use in bioinformatics is the -f flag (follow). When you run a long pipeline — a STAR alignment that takes 30 minutes, a GATK variant calling job that takes hours — the pipeline writes its progress to a log file. tail -f pipeline.log keeps watching the log file and prints new lines as they appear, so you can monitor the pipeline's progress in real time from a second terminal window. Press Ctrl + C to stop following.

Run from: ~/bash-linux-bioinformatics/module-1-foundations

bash

tail notes.txt                   # last 10 lines (default)
tail -n 20 pipeline.log          # last 20 lines of a log file
tail -f pipeline.log             # live view — prints new lines as they appear
                                    # press Ctrl + C to stop

💡

When running a long pipeline: open a second terminal tab and run tail -f your_pipeline.log to watch progress live. This is standard practice in bioinformatics — you never just wait and guess whether something is still running.

05 Counting with wc

`wc` — Count lines, words, and characters

🔢 Why wc is so useful in bioinformatics

wc stands for word count, but the name is misleading — you will use it almost exclusively for counting lines with wc -l.

The most important bioinformatics use is counting reads in a FASTQ file. Every sequencing read in FASTQ format takes exactly 4 lines: a header line starting with @, the DNA sequence, a + separator, and the quality scores. So the number of reads = total lines ÷ 4.

For example, if wc -l sample.fastq returns 4000000, then you have exactly 1,000,000 reads in that file. This is one of the first checks you run after receiving sequencing data — to confirm you got the right number of reads before spending hours aligning them.

When you run wc notes.txt without any flag, it prints three numbers: lines, words, and characters — in that order. The filename appears at the end.

Run from: ~/bash-linux-bioinformatics/module-1-foundations

bash

wc notes.txt                   # prints: lines  words  characters  filename
wc -l notes.txt                # lines only — most commonly used
wc -w notes.txt                # words only
wc -c notes.txt                # characters (bytes) only
wc -l sample.fastq             # count lines in a FASTQ — divide by 4 for read count

# Example output of: wc notes.txt
  5  12  68 notes.txt
# 5 lines, 12 words, 68 characters

FASTQ read count formula: wc -l sample.fastq gives you the total lines. Divide by 4 to get the number of reads. So 400000 lines = 100,000 reads.

06 Quick reference

Command	What it does	Key flags
touch [file]	Create an empty file (or update timestamp of existing file)	—
echo "text" > [file]	Write text into a file — overwrites existing content	`>>` appends instead of overwriting
cp [src] [dest]	Copy a file — original stays in place	`-r` required for folders
mv [src] [dest]	Move or rename a file/folder — original is removed	—
rm [file]	Permanently delete a file — no undo, no recycle bin	`-i` ask before deleting · `-r` delete folder
cat [file]	Print entire file to screen — use on small files only	—
head [file]	Show first 10 lines — safe on any file size	`-n N` show N lines
tail [file]	Show last 10 lines — great for log files	`-n N` show N lines · `-f` live follow
wc [file]	Count lines, words, and characters	`-l` lines only · `-w` words · `-c` bytes

07 Exercises

Work through all five exercises in your Ubuntu terminal. Type every command yourself — do not copy-paste. After each exercise, verify the result before moving on.

Exercise 1Create and inspect a file

Navigate to your ~/bash-linux-bioinformatics/module-1-foundations/ folder. Create a file called species.txt and write three lines into it one by one: Sorghum bicolor, Arabidopsis thaliana, and Oryza sativa. Then print the complete file to the screen to verify all three lines are there.

💬 Hint: use > for the first line only, then >> for the second and third — otherwise you will overwrite and only have one line.

Show answer

cd ~/bash-linux-bioinformatics/module-1-foundations
echo "Sorghum bicolor" > species.txt          # creates the file with line 1
echo "Arabidopsis thaliana" >> species.txt    # appends line 2
echo "Oryza sativa" >> species.txt           # appends line 3
cat species.txt
Sorghum bicolor
Arabidopsis thaliana
Oryza sativa

Exercise 2Copy and move

Copy species.txt to ~/bash-linux-bioinformatics/data/raw/. Then rename the copy from species.txt to plant_species.txt using a single mv command. Finally, confirm the original still exists in module-1-foundations/ and the renamed copy exists in data/raw/.

Show answer

cd ~/bash-linux-bioinformatics/module-1-foundations
cp species.txt ~/bash-linux-bioinformatics/data/raw/
mv ~/bash-linux-bioinformatics/data/raw/species.txt ~/bash-linux-bioinformatics/data/raw/plant_species.txt
# Confirm original exists
ls ~/bash-linux-bioinformatics/module-1-foundations/
species.txt    # still here
# Confirm renamed copy exists
ls ~/bash-linux-bioinformatics/data/raw/
plant_species.txt

Exercise 3head, tail, and wc on a system file

The file /etc/passwd lists every user account on your system — it has many lines. Without opening it, answer three questions using three separate commands: How many lines does it have? What are the first 3 lines? What is the very last line?

💬 Hint: wc -l /etc/passwd, head -n 3 /etc/passwd, tail -n 1 /etc/passwd. Note how you can read this file without navigating to /etc first — just give the full path.

Show answer

wc -l /etc/passwd
45 /etc/passwd          # exact number varies by system

head -n 3 /etc/passwd
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin

tail -n 1 /etc/passwd
shajedur:x:1000:1000:,,,:/home/shajedur:/bin/bash
# Your username appears as the last entry

Exercise 4Safe delete

Navigate to your home directory. Create a temporary file called temp_delete_me.txt. Now delete it using the -i flag so Linux asks you to confirm first. Type y and press Enter. Then verify the file is gone with ls.

💬 Hint: rm -i ~/temp_delete_me.txt will ask "remove temp_delete_me.txt?" — type y then Enter.

Show answer

cd ~
touch temp_delete_me.txt
rm -i temp_delete_me.txt
rm: remove regular empty file 'temp_delete_me.txt'? y
ls temp_delete_me.txt
ls: cannot access 'temp_delete_me.txt': No such file or directory
# Confirmed — the file is gone

Exercise 5 · ChallengeBuild and count a simulated FASTQ file

Navigate to ~/bash-linux-bioinformatics/module-1-foundations/. Create a file called sample.fastq with exactly 3 reads using echo and >>. Each read must have exactly 4 lines: a header starting with @, a DNA sequence, a + line, and a quality score line. Then count the total lines with wc -l and calculate how many reads that is.

💬 Hint: 3 reads × 4 lines each = 12 total lines. Use > for the very first line, then >> for all remaining 11 lines.

Show answer

cd ~/bash-linux-bioinformatics/module-1-foundations
echo "@read1" > sample.fastq
echo "ATCGATCG" >> sample.fastq
echo "+" >> sample.fastq
echo "IIIIIIII" >> sample.fastq
echo "@read2" >> sample.fastq
echo "GCTAGCTA" >> sample.fastq
echo "+" >> sample.fastq
echo "IIIIIIII" >> sample.fastq
echo "@read3" >> sample.fastq
echo "TTAACCGG" >> sample.fastq
echo "+" >> sample.fastq
echo "IIIIIIII" >> sample.fastq

wc -l sample.fastq
12 sample.fastq
# 12 lines ÷ 4 lines per read = 3 reads ✓

File Operations

01 Creating files

touch — Create an empty file

📄 Why touch exists

echo — Write text into a file

📝 How echo writes to files — and why > vs >> matters

02 Copying & moving files

cp — Copy a file

📋 Why cp exists — and the golden rule of raw data

mv — Move or rename a file

🚚 How mv works — and why it does two things

03 Deleting files safely

rm — Remove a file permanently

🗑 Why rm is dangerous — and how to use it safely

04 Reading files without opening them

cat — Print the entire file

📖 What cat does and when not to use it

head — See the beginning of a file

🔍 Why head is safer than cat for large files

tail — See the end of a file

📋 Why tail is essential for pipeline monitoring

05 Counting with wc

wc — Count lines, words, and characters

🔢 Why wc is so useful in bioinformatics

06 Quick reference

07 Exercises

`touch` — Create an empty file

`echo` — Write text into a file

`cp` — Copy a file

`mv` — Move or rename a file

`rm` — Remove a file permanently

`cat` — Print the entire file

`head` — See the beginning of a file

`tail` — See the end of a file

`wc` — Count lines, words, and characters