Terminal Commands for Big Data Workflows & GitHub

Bash

Whether you’re processing large datsets on remote servers or fixing bugs in a branch, command line skills are essential for working efficiently. Check out the commands that I’ve found to be useful for processing terabytes of Arctic data for visualization on the Permafrost Discovery Gateway.

Juliet Cohen true
2023-01-22
Imagery Viewer with data layers on the Permafrost Discovery Gateway

The Command Line

Although the terminal lacks a friendly GUI, it processes commands faster than navigating your IDE’s file directory and it’s highly useful for working with large datasets across machines. It’s unforgiving with capitalization, typos, and undo-ing commands, but so worth it.

The following are my favorite terminal commands for branching repositories, counting files in large data directories, running scripts in the background, and more.

General Commands & tmux

General commands to navigate directories and shuffle them around. Modifying large dataset structures and can overwhelm an IDE, so it’s best done in the terminal.

Additionally, running commands with tmux allows processes to continue running in the background, even if the connection to the server is lost from VScode. It also allows multiple processes to be run simultaneously with different tmux sessions, so you can transfer a directory in one terminal while running scripts, or just close your laptop while a script runs with no concern about your laptop falling asleep.

Use Command
vertically list files and directories, with date created ls -l
list all files and directories, including hidden ones ls -a
remove directory and all contents recursively rm -r {DIRECTORY}
move file or directory mv file_name new_path/
rename file or directory in current directory mv file_name new_name
check number of files in the current directory ls -1 | wc -l (pay attention to ‘l’ versus ‘1’ here)
count files of any kind, recursively, from current dir find . -type f | wc -l
count files with a certain extension, recursively find . -type f -name "*.{EXTENSION}" | wc -l
check total data storage in documents directory download package with: curl https://sh.rustup.rs -sSf | sh, then install package with: cargo install dirstat-rs, then run: ds documents
create symbolic link to folder ln -s /path/to/folder {LINK NAME}
create new tmux session tmux
exit tmux session & allow it to run in the background ctrl + b d
enter into a specific tmux session tmux a -t {SESSION ID}
check all active tmux sessions tmux ls
kill tmux session tmux kill-session -t {SESSION ID}

Count number of files in current directory, recursively, and show how many files are within each subdirectory (run all as one command): find . -maxdepth 1 -type d -print0 \| sort -z \| while IFS= read -r -d '' dir; do n=$(find "$dir" -maxdepth 1 -type f \| wc -l); printf "%4d : %s\n" "$n" "$dir"; done

GitHub

Use Command
remove requirement to enter GitHub credentials on server git config --global credential.helper, "cache --timeout=100000000"
switch into branch develop git checkout develop
push to branch develop git add {FILES}, git commit -m "{MESSAGE}", git push origin develop
create new branch git checkout -b {NewBranchName}
create new branch from the develop branch switch to develop branch, pull updates git pull, then: git checkout -b {NewBranchName} develop
print all branches in repo git branch -a
check current branch and how files differ from remote git status
check recent commits git log
merge develop branch into main push changes from develop, then: git checkout main, then: git merge develop
display repo’s branching & commit history as a tree git log --graph

File & Directory Transfers

Options:

Examples

  1. Use scp to copy file or directory from local machine to a remote machine

scp /path/to/local/file/or/directory/ username@server.host.ucsb.edu:/path/to/destination

  1. Copy all feather files from current directory to an account on the “Taylor” server

scp ./*.feather jscohen@taylor.bren.ucsb.edu:/Users/jscohen/data_features

  1. Use rsync to copy directory from one directory on a local machine to another directory on the same local machine

rsync -av /path/to/source/directory /path/to/destination/directory

Note: Options -av includes -v which sets rsync to communicate its progress throughout the transfer, and -a which is the --archive option which combined the options -rlptgoD which stand for:

Option Meaning
-r, --recursive include recursive directories
-l, --links copy symlinks as links
-p, --perms preserve permissions
-t, --times preserve times
-g, --group preserve group
-o, --owner preserve owner
-D same as --devices and --specials, also transfer special files such as symbolic links, named sockets, and fifos (pipes)

Use rsync to add complexity to the command, like the following options:
--exclude to omit certain files from the transfer
--update to skip files that are newer in the destination location
--remove-source-files to delete the files from the source directory after they are transferred to destination