Whether you’re processing large datsets on remote servers or fixing bugs in a branch, command line skills are essential for working efficiently. Check out the commands that I’ve found to be useful for processing terabytes of Arctic data for visualization on the Permafrost Discovery Gateway.
Although the terminal lacks a friendly GUI, it processes commands faster than navigating your IDE’s file directory and it’s highly useful for working with large datasets across machines. It’s unforgiving with capitalization, typos, and undo-ing commands, but so worth it.
The following are my favorite terminal commands for branching repositories, counting files in large data directories, running scripts in the background, and more.
General commands to navigate directories and shuffle them around. Modifying large dataset structures and can overwhelm an IDE, so it’s best done in the terminal.
Additionally, running commands with tmux
allows
processes to continue running in the background, even if the connection
to the server is lost from VScode. It also allows multiple processes to
be run simultaneously with different tmux
sessions, so you
can transfer a directory in one terminal while running scripts, or just
close your laptop while a script runs with no concern about your laptop
falling asleep.
Use | Command |
---|---|
vertically list files and directories, with date created | ls -l |
list all files and directories, including hidden ones | ls -a |
remove directory and all contents recursively | rm -r {DIRECTORY} |
move file or directory | mv file_name new_path/ |
rename file or directory in current directory | mv file_name new_name |
check number of files in the current directory | ls -1 | wc -l (pay attention to ‘l’ versus ‘1’
here) |
count files of any kind, recursively, from current dir | find . -type f | wc -l |
count files with a certain extension, recursively | find . -type f -name "*.{EXTENSION}" | wc -l |
check total data storage in documents directory |
download package with:
curl https://sh.rustup.rs -sSf | sh , then install package
with: cargo install dirstat-rs , then run:
ds documents |
create symbolic link to folder | ln -s /path/to/folder {LINK NAME} |
create new tmux session | tmux |
exit tmux session & allow it to run in the
background |
ctrl + b d |
enter into a specific tmux session |
tmux a -t {SESSION ID} |
check all active tmux sessions |
tmux ls |
kill tmux session |
tmux kill-session -t {SESSION ID} |
Count number of files in current directory, recursively, and show how
many files are within each subdirectory (run all as one command):
find . -maxdepth 1 -type d -print0 \| sort -z \| while IFS= read -r -d '' dir;
do n=$(find "$dir" -maxdepth 1 -type f \| wc -l); printf "%4d : %s\n" "$n" "$dir"; done
Use | Command |
---|---|
remove requirement to enter GitHub credentials on server | git config --global credential.helper ,
"cache --timeout=100000000" |
switch into branch develop |
git checkout develop |
push to branch develop |
git add {FILES} ,
git commit -m "{MESSAGE}" ,
git push origin develop |
create new branch | git checkout -b {NewBranchName} |
create new branch from the develop branch |
switch to develop branch, pull updates
git pull , then:
git checkout -b {NewBranchName} develop |
print all branches in repo | git branch -a |
check current branch and how files differ from remote | git status |
check recent commits | git log |
merge develop branch into main |
push changes from develop , then:
git checkout main , then:
git merge develop |
display repo’s branching & commit history as a tree | git log --graph |
Options:
scp
- best for small amounts of files and minimum
complexity in commandsrsync
- best for large amounts of files, especially
with complex directory hierarchies, works well for transfers within a
computer, between computers, or between a computer and Google Driveglobus
- has a UI, best for large amounts files or
directories between servers without using the commandline or a script,
but you will need both the source and destination to have globus
endpointsscp
to copy file or directory from local machine to
a remote machinescp /path/to/local/file/or/directory/ username@server.host.ucsb.edu:/path/to/destination
feather
files from current directory to an
account on the “Taylor” serverscp ./*.feather jscohen@taylor.bren.ucsb.edu:/Users/jscohen/data_features
rsync
to copy directory from one directory on a
local machine to another directory on the same local machinersync -av /path/to/source/directory /path/to/destination/directory
Note: Options -av
includes -v
which sets
rsync
to communicate its progress throughout the transfer,
and -a
which is the --archive
option which
combined the options -rlptgoD
which stand for:
Option | Meaning |
---|---|
-r , --recursive |
include recursive directories |
-l , --links |
copy symlinks as links |
-p , --perms |
preserve permissions |
-t , --times |
preserve times |
-g , --group |
preserve group |
-o , --owner |
preserve owner |
-D |
same as --devices and --specials , also
transfer special files such as symbolic links, named sockets, and fifos
(pipes) |
Use rsync
to add complexity to the command, like the
following options:
--exclude
to omit certain files from the transfer
--update
to skip files that are newer in the destination
location
--remove-source-files
to delete the files from the source
directory after they are transferred to destination