ls

ls -lh

The option -l of the command ls also reports the size of the file.

Adding the flag -h makes it “human-readable” (i.e., using K, M, instead of printing the number of bytes).

(base) yuxuan@Yuxuan sandbox % ls -lh
total 640
-rw-r--r--  1 yuxuan  staff   311K Jan 20 18:22 BodyM.csv
drwxr-xr-x  3 yuxuan  staff    96B Jan 20 15:00 Papers and reviews
(base) yuxuan@Yuxuan sandbox % ls -l
total 640
-rw-r--r--  1 yuxuan  staff  318599 Jan 20 18:22 BodyM.csv
drwxr-xr-x  3 yuxuan  staff      96 Jan 20 15:00 Papers and reviews

The size could also be calculated as
1
du -h ../data/Marra2014_data.fasta 

grep

# To count the occurrences of a given string, use 
grep with the option -c

# Prompt grep to match only full words
grep with the option -w

grep is a powerful command that finds all the lines of a file that match a given pattern

wc

wc Line, word, and byte (character) count of a file. The option returns the wc-l line count only and is a quick way to get an idea of the size of a text file

cut

cut is used to select the columns and the option d is to specify the delimiter

The additional option -f let us to extract specific column 1 (-f 1-4)

1	grep '>' my_file.fasta \| cut -d ',' -f 4 \| head -n 2

sort

1
2
3

# Now we want to sort according to the number of reads. However, the number of reads is part of a more complex string. We can use -t '=' to split according to the = sign, and then take the second column (-k 2) to sort numerically (-n)

grep '>' my_file.fasta | cut -d ',' -f 1,3 | sort -t '=' -k 2 -n | head -n 5

tail

# Display last two lines of the file
tail -n 2 ....csv

# Display from line 2 on ward
tail -n +2 ....csv

tr

tr: is used to substitute characters using tr

Substitute all characters a with b:

1
2
3

$ echo "aaaabbb" | tr "a" "b"

bbbbbbb

Substitute every digit in the range 1 through 5 with 0:

1
2
3

$ echo "123456789" | tr 1-5 0

000006789

Substitute lowercase letters with uppercase ones:

1
2
3

$ echo "ACtGGcAaTT" | tr actg ACTG

ACTGGCAATT

We obtain the same result by using bracketed expressions that provide a predefined set of characters. Here, we use the set of all lowercase letters [:lower:] and translate into uppercase letters [:upper:]:

1
2
3

$ echo "ACtGGcAaTT" | tr [:lower:] [:upper:]

ACTGGCAATT

We can also indicate ranges of characters to substitute:

1
2
3

$ echo "aabbccddee" | tr a-c 1-3

112233ddee

Delete all occurrences of a:

1 2	$ echo "aaaaabbbb" \| tr -d a bbbb

“Squeeze” all consecutive occurrences of a:

1
2
3

$ echo "aaaaabbbb" | tr -s a

abbbb

Bash_script_hw1

http://computingskillsforbiologists.com/downloads/exercises/#unix

1

# 2. Write a script taking as input the file name and the ID of the individual, and returning the number of records for that ID.

# We just need to a) read the input from the command
# line (using $1 for filename and $2 for ID), and
# then run the commands above.

# The script count_baboons.sh shows the solution

   
#!/bin/bash

# function of script:
# read input:
# $1 is the file name
# $2 is the maleID

cut -f 1 $1 | grep -c -w $2


bash count_baboons.sh ../data/Gesquiere2011_data.csv 27

2

###################################################
# Exercise 1.10.3 
###################################################
# 1) Write a script that takes one of these files and determines the number of rows (pollinators) and columns (plants). Counting the number of rows amount to counting the number of lines. This is easily done with wc -l. For example:

wc -l ../data/Saavedra2013/n10.txt 

# Which returns 
# 14 ../data/Saavedra2013/n10.txt

# To avoid printing the file name we
# can either use cat

cat ../data/Saavedra2013/n10.txt | wc -l
# 14

# or use the symbol < for the input

wc -l < ../data/Saavedra2013/n10.txt 
# 14

# Counting the number of columns is more work.
# First, we need only the first line:

head -n 1 ../data/Saavedra2013/n10.txt

# 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0

# Now we can remove all spaces and the line terminator using tr:
head -n 1 ../data/Saavedra2013/n10.txt | tr -d ' ' | tr -d '\n'

# 01000001000000000100

# Finally, we can use wc -c to count the number
# of characters in the string:

head -n 1 ../data/Saavedra2013/n10.txt | tr -d ' ' | tr -d '\n' | wc -c
# 20

# Armed with this information, we can write our
# script (see the file netsize.sh).

3

#!/bin/bash

FILES=../data/Saavedra2013/*.txt
for f in $FILES 
do
    myrow=`cat $f | wc -l`
    mycol=`head -n 1 $f | tr -d ' ' | tr -d '\n' | wc -c`
    echo $f $myrow $mycol
done

# 2) Write a script that prints the number of rows and columns for each network

# This requires some commands that we haven't covered
# in the text. We put this example to show how many
# problems can be solved by searching the web for
# an answer. You don't need to remember all of the
# commands, but you need to be able to find solutions
# online!

# In a script, if you write:

FILES=../data/Saavedra2013/*.txt

# You create a list of all the .txt files in the
# "variable" FILES. You can then iterate through
# all the files in the directory with a "for"
# loop (more on this when we start programming).
# For example, a script with the line above and

for f in $FILES
do
    echo $f
done

# would print the file names in the directory.

# From the point above, you know how to determine
# the number of rows. You can store the result in
# another variable:

for f in $FILES
do
    myrow=`cat $f | wc -l`
    echo $f $myrow
done

# Note the use of backticks `: the syntax of bash is
# very finnicky, and you need to write the commands
# exactly as above.

# This script prints the file name as well as
# the number of rows. Using these concepts, you can
# write the script netsize_all.sh

4

###################################################
# Exercise 1.10.4 on Buzzard et al. (2016)
###################################################
# 1)  Write a script that, for a given csv file and
# column number, prints:
# - Column name
# - Number of distinct values
# - Minimum value
# - Maximum value

# First, we need to extract the column name.
# For example, for the Buzzard data file, and col 7

cut -d ',' -f 7 ../data/Buzzard2015_data.csv | head -n 1

# biomass

# Second, we need to obtain the number of
# distinct values. We can sort the results
# (after removing the header), and use uniq

cut -d ',' -f 7 ../data/Buzzard2015_data.csv | tail -n +2 | sort | uniq | wc -l

# 285

# Third, to get the max/min value we can use
# the code above, sort using -n, and either
# head (for min) or tail (for max) the result.

# Minimum
cut -d ',' -f 7 ../data/Buzzard2015_data.csv | tail -n +2 | sort -n | head -n 1

# 1.048466198

# Maximum
cut -d ',' -f 7 ../data/Buzzard2015_data.csv | tail -n +2 | sort -n | tail -n 1

# 14897.29471

# The file explore.sh is possible incarnation
# of the script.

# For example:

bash explore.sh ../data/Buzzard2015_data.csv 6

# Column name
# Abund.n
# Number of distinct values:
# 46
# Minimum value:
# 1
# Maximum value:
# 157

# This works well also for alphabetical order:

bash explore.sh ../data/Buzzard2015_data.csv 3

# Column name
# genus
# Number of distinct values:
# 85
# Minimum value:
# Acacia
# Maximum value:
# Zanthoxylum