The option -l of the command ls also reports the size of the file.
Adding the flag -h makes it “human-readable” (i.e., using K, M, instead of printing the number of bytes).
1 2 3 4 5 6 7 8 9
(base) yuxuan@Yuxuan sandbox % ls -lh total 640 -rw-r--r-- 1 yuxuan staff 311K Jan 20 18:22 BodyM.csv drwxr-xr-x 3 yuxuan staff 96B Jan 20 15:00 Papers and reviews (base) yuxuan@Yuxuan sandbox % ls -l total 640 -rw-r--r-- 1 yuxuan staff 318599 Jan 20 18:22 BodyM.csv drwxr-xr-x 3 yuxuan staff 96 Jan 20 15:00 Papers and reviews
The size could also be calculated as
1
du -h ../data/Marra2014_data.fasta
grep
1 2 3 4 5 6 7
# To count the occurrences of a given string, use grep with the option -c
# Prompt grep to match only full words grep with the option -w
grep is a powerful command that finds all the lines of a file that match a given pattern
wc
wc Line, word, and byte (character) count of a file. The option returns the wc-l line count only and is a quick way to get an idea of the size of a text file
cut
cut is used to select the columns and the option d is to specify the delimiter
The additional option -f let us to extract specific column 1 (-f 1-4)
# Now we want to sort according to the number of reads. However, the number of reads is part of a more complex string. We can use -t '=' to split according to the = sign, and then take the second column (-k 2) to sort numerically (-n)
# Display last two lines of the file tail -n 2 ....csv
# Display from line 2 on ward tail -n +2 ....csv
tr
tr: is used to substitute characters using tr
Substitute all characters a with b:
1 2 3
$ echo"aaaabbb" | tr "a""b"
bbbbbbb
Substitute every digit in the range 1 through 5 with 0:
1 2 3
$ echo"123456789" | tr 1-5 0
000006789
Substitute lowercase letters with uppercase ones:
1 2 3
$ echo"ACtGGcAaTT" | tr actg ACTG
ACTGGCAATT
We obtain the same result by using bracketed expressions that provide a predefined set of characters. Here, we use the set of all lowercase letters [:lower:] and translate into uppercase letters [:upper:]:
1 2 3
$ echo"ACtGGcAaTT" | tr [:lower:] [:upper:]
ACTGGCAATT
We can also indicate ranges of characters to substitute:
################################################### # Exercise 1.10.3 ################################################### # 1) Write a script that takes one of these files and determines the number of rows (pollinators) and columns (plants). Counting the number of rows amount to counting the number of lines. This is easily done with wc -l. For example:
wc -l ../data/Saavedra2013/n10.txt
# Which returns # 14 ../data/Saavedra2013/n10.txt
# To avoid printing the file name we # can either use cat
cat ../data/Saavedra2013/n10.txt | wc -l # 14
# or use the symbol < for the input
wc -l < ../data/Saavedra2013/n10.txt # 14
# Counting the number of columns is more work. # First, we need only the first line:
head -n 1 ../data/Saavedra2013/n10.txt
# 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
# Now we can remove all spaces and the line terminator using tr: head -n 1 ../data/Saavedra2013/n10.txt | tr -d ' ' | tr -d '\n'
# 01000001000000000100
# Finally, we can use wc -c to count the number # of characters in the string:
# 2) Write a script that prints the number of rows and columns for each network
# This requires some commands that we haven't covered # in the text. We put this example to show how many # problems can be solved by searching the web for # an answer. You don't need to remember all of the # commands, but you need to be able to find solutions # online!
# In a script, if you write:
FILES=../data/Saavedra2013/*.txt
# You create a list of all the .txt files in the # "variable" FILES. You can then iterate through # all the files in the directory with a "for" # loop (more on this when we start programming). # For example, a script with the line above and
for f in$FILES do echo$f done
# would print the file names in the directory.
# From the point above, you know how to determine # the number of rows. You can store the result in # another variable:
for f in$FILES do myrow=`cat $f | wc -l` echo$f$myrow done
# Note the use of backticks `: the syntax of bash is # very finnicky, and you need to write the commands # exactly as above.
# This script prints the file name as well as # the number of rows. Using these concepts, you can # write the script netsize_all.sh
################################################### # Exercise 1.10.4 on Buzzard et al. (2016) ################################################### # 1) Write a script that, for a given csv file and # column number, prints: # - Column name # - Number of distinct values # - Minimum value # - Maximum value
# First, we need to extract the column name. # For example, for the Buzzard data file, and col 7
cut -d ',' -f 7 ../data/Buzzard2015_data.csv | head -n 1
# biomass
# Second, we need to obtain the number of # distinct values. We can sort the results # (after removing the header), and use uniq