The course KIK-LG221 Command-Line Tools for Linguists is a regular intermediate-level undergraduate course at Uni Helsinki comprising 5 credits. Students are recommended to have previously taken the introductory course on language technology.
The course focusses on basic usage of Unix-like command-line tools, basic regular expressions, handling text files and different character encodings with the command line, handling text corpora using command-line tools, running and installing applications from the command line, writing basic scripts, working with remote servers, and using version-control tools.
This week we had a live lecture on how computers work, and learnt about Linux and Unix, about setting up a command-line environment, some basic Bash commands, how to quit applications, about command-line text editors, and about file formats and sizes.
I use Windows on my laptop, so in order to get a Unixlike command line up and running, I chose to install Windows Subsystem for Linux, and to try a few different distros for it, among these Debian, Ubuntu, Kali Linux, openSUSE, and Alpine. In the end, I settled for Debian, and have used it as my primary command-line environment ever since. During the course, I upgraded from Windows 10 to Windows 11, and from WSL1 to WSL2. I also learnt how to upgrade my Debian distro, and am now running Debian 12 Bookworm.
Here is Debian’s well-known swirl logo:
Among the basic shell commands we learnt this first week were:
Command | Usage |
---|---|
ls |
Lists the contents of the currect directory |
pwd |
Returns the currect working directory |
whoami |
Returns the current user |
mv |
Moves (or renames) a file |
cat |
Concatenates files |
less |
Displays file contents with scrolling ability |
cp |
Copies a file |
rm |
Removes a file |
mkdir |
Creates a new directory |
cd |
Changes the current directory |
We also learnt how to fetch contents from the internet using wget
, and how to open and edit text files with the
text editor emacs
.
We also learnt various keys and key combinations for quitting applications, for example:
For example, if we, in Bash as the user ‘user’, would want to create a new directory named ‘foo’, move into it, confirm where we are in the directory tree (i.e. print our current working directory), then move back up in the directory tree, and finally confirm who we are (i.e. print the current user), our terminal would look like this:
$ mkdir foo
$ cd foo
$ pwd
/home/user/foo
$ cd ..
$ whoami
user
This week we watched an interview with Tatu Ylönen, the creator of ssh
and PhD candidate at Uni Helsinki, and
learnt more about Unixlike systems. In particular, we learnt how to copy, move and remove directories, about
the standard system directories in a Unixlike system, about permissions and privacy in Unixlike systems, and how
to archive, extract, compress, and decompress files and directories using tar
and gzip
.
Furthermore, we learnt about processes and the management of processes, and how to control processes from the command line.
Finally, we learnt how to work with remote servers using ssh
and and the related scp
, working
with the CSC server Puhti.
Among the shell commands we learnt this week were:
Command | Usage |
---|---|
cp -R |
Copies recursively |
rm -R |
Removes recursively |
rmdir |
Removes an empty directory |
which |
Returns the pathname of the file a command would execute |
ls -F |
ls with indication of item type |
top |
Provides a real-time view of the system and its processes |
& |
Starts a process in the background |
fg |
Brings a process to the foreground |
ps |
Lists the current processes |
kill |
Sends a signal to a process, typically to end it |
ssh |
Connects securely to a remote server via SSH |
scp |
Copies securely to a remote server via SSH |
We also learnt the key combination Ctrl-z to stop the currently running process.
Understanding and using SSH and SCP was something I had wanted to learn for a long time, so I very much appreciated in particular these aspects of this week. I have since learnt a bit more and I am now comfortable using RSA, ECDSA, and Ed25519 keys and SSH config and authorization files for logging onto and working with a variety of different remote servers in a variety of different ways.
For example, if we, in Bash, would want to log onto the remote server Puhti as the user ‘user’ using SSH, we would give the command:
$ ssh user@puhti.csc.fi
This week we learnt about various fundamental tools for text processing in Unixlike environments. First, we learnt about different character encodings, in particular about ASCII, Latin-1, and UTF-8, and how to convert between these. We also learnt about the different line endings in Windows and UNIX text files, and how to convert between these.
Furthermore, we learnt various useful commands for processing and searching text, about standard streams and their redirection, and the basics of regular expressions. Finally, we also learnt about structured text, such as tables in the form of CSV or TSV, and commands for processing such structured text.
Among the shell commands we learnt this week were:
Command | Usage |
---|---|
file |
Returns information about a file, e.g. encoding and line endings |
dos2unix |
Converts CRLF line endings to LF line endings |
unix2dos |
Converts LF line endings to CRLF line endings |
iconv |
Converts text from one character encoding to another |
tr |
Translates, squeezes, or deletes characters |
sort |
Sorts lines in alphabetical or numerical order |
uniq |
Reports or omits repeated lines |
head |
Returns the beginning of a file |
tail |
Returns the end of a file |
wc |
Returns the number of lines, words, or bytes in file |
cut |
Remove sections from each line |
grep |
Returns lines matching the given regular expression pattern |
egrep |
grep with extended regular expressions |
I have always been interested in writing systems and their digital representation, and thus the themes of this
week were particularly interesting to me. They were also particularly useful to me as an aspiring linguist and
language technologist. I knew the fundamentals of regular expressions from before, but I appreciated the
opportunity to repeat them, as well as learning how to use them through grep
and egrep
in a command-line
environment. I also appreciated learning about handling CSV and TSV files through the command line.
For example, if we had the text file text.txt
with CRLF line endings and Latin-1 encoding, and, in Bash,
would want to convert the line endings to LF, convert the encoding to UTF-8, and then tokenize the text, our
terminal would look like this:
$ dos2unix text.txt
dos2unix: converting file text.txt to Unix format...
$ iconv -f latin1 -t utf-8 text.txt > text.utf8.txt
$ cat text.utf8.txt | tr -s "[:space:][:punct:]" "\n" > text.utf8.tok.txt
The option -f latin1
in iconv -f latin1 -t utf-8 text.txt > text.utf8.txt
indicates the encoding from which
we want to convert, i.e. Latin-1, and -t utf-8
the encoding to which we want to convert, i.e. UTF-8. The >
redirects the output, and thus > text.utf8.txt
redirects the result of the conversion into the file
text.utf8.txt
.
cat text.utf8.txt
returns the text content of the text file, and the |
pipes the output to the following
command, i.e. to the translation command tr -s "[:space:][:punct:]" "\n"
, in which "[:space:][:punct:]"
indicates what we want to replace, i.e. the so-called Posix classes of all whitespace and punctuation characters,
and "\n"
indicates what we want to translate these characters into, i.e. into newlines. The option -s
squeezes
repeats, i.e. drops repeated replacements, so that we always get just one newline, not multiple newlines. Finally,
the output of the translation command is redirected into the file text.utf8.tok.txt
.
This week we learnt about more advanced tools for text processing in Unixlike environments. First, we learnt about
the tool sed
, or stream editor, a powerful command-line tool for quickly editing text files without opening
them in an editor. It can be used for finding strings, replacing strings, insertion, deletion, and other similar
tasks. It is commonly used with regular expressions, and thus we also repeated and learnt more about regular
expressions.
Finally, we also learnt more about combining simple commands into more complex pipelines in Unixlike environments. In particular, we learnt how to create more advanced pipelines for processing text. Using pipelines, we don’t generate many unnecessary intermediate files.
Among the commands we learnt this week were:
Command | Usage |
---|---|
sed |
stream editor |
sed -n |
sed without default printing |
sed -E |
sed with extended regular expressions |
sed '//d' |
sed delete |
sed '//p' |
sed print |
sed 's//' |
sed substitution |
sed 's//g' |
sed global substitution |
The command sed -n '/pattern/p'
would duplicate the function of grep
, using the regular-expression pattern
pattern
.
For example, if we had the text file text.txt
(with LF line endings)which we wanted to transform into
one-sentence-per-line format, we could, in Bash, use the following command pipeline:
$ cat text.txt | sed 's/^$/#/' | tr '\n' ' ' | sed -E 's/(.?!) ([[:upper:]])/\1# \2/g' | tr '#' '\n' | sed 's/^ *//' | sed 's/ *$//' | grep -v "^$" > text.sent.txt
cat text.utf8.txt
returns the text content of the text file, and the |
pipes the output to the following
command. The stream-editor command sed 's/^$/#/'
replaces empty lines (matched by the regular expression
^$
) with a placeholder character #
. The following command tr '\n' ' '
replaces newlines with a space.
The following command sed -E 's/(.?!) ([[:upper:]])/\1# \2/g'
uses extended regular expressions to find final
punctuation, followed by a space, followed by an uppercase character, and uses capture groups to insert a
placeholder #
at the apparent end of the sentence. The following command tr '#' '\n'
replaces the
placeholders #
with newlines. The following command sed 's/^ *//'
removes whitespace at the beginning of
a line and the following command sed 's/ *$//'
at the end of a line. The final command grep -v "^$"
inversely finds empty lines, and thus returns all non-empty lines, while the final > text.sent.txt
redirects
the output to the file text.sent.txt
.
This week we learnt about Bash scripting as well as about Bash environment variables and configuration files. Scripting can be described as storing a command pipeline, possibly complex and branching, as a programme, a file which can be executed.
Specifically, we learnt the fundamentals of writing Bash scripts, about command-line variables and how to use
them in scripts, about conditional execution with if
, and about command substitution. Furthermore, we learnt
about environment variables in Bash and how to alter the behaviour of the system by changing such variables.
Finally, we learnt how to use Bash configuration files such as .bashrc
or .bash_profile
to effectively make
changes to the environment permanent, by setting environment variables at login. This can, for example, be used
to set a helpful custom prompt, or to set useful custom shortcut commands.
Among the shell commands, statements, and variables we learnt this week were:
Command | Usage |
---|---|
chmod u+x |
Grants the owner of a file execution rights |
printenv |
Returns all or part of the environment variables |
echo |
Returns a line of text |
export |
Exports a variable to the environment |
source |
Reads and executes commands from the specified file |
alias |
Creates an alias, i.e. a shortcut command |
if |
Begins a conditional statement |
$? |
Variable containing the exit status of the last task |
$# |
Variable containing the number of arguments given |
For example, if we wanted to write a simple Bash script for creating a word-frequency list based on a given text file, outputting the list to another file, we could write the following script:
#!/bin/bash
if [ $# -ne 2 ]
then
echo "ERROR: Two command line arguments required."
echo "$0 input_text_file output_freq_file"
exit 1
fi
cat $1 |
dos2unix |
tr -s "[:space:]" "\n" |
tr -d "[:punct:]" |
sort |
uniq -c |
sort -nr > $2
The very first line is a so-called shebang or hashbang, identifying what kind of script the script is. The
subsequent if
block tests that the user has given exactly two arguments (an input file and an output file), and
aborts if the user has not. The final block is a pipeline taking the contents of file that was the first argument,
converting the line endings to LF, replacing all whitespace with newlines, deleting all punctuation, sorting the
list, counting repeated occurences, sorting the lines (now with number of occurences prepended) in reverse
numerical order, and outputting to the file that was the second argument.
This week we learnt about becoming the superuser or root user, about installing and managing software, about Python packages and virtual environments, and about Makefiles.
We learnt it is not recommended to be logged in as the superuser more than absolutely necessary, which is why it
is wise to use sudo
for running commands requiring superuser privileges.
Furthermore, we learnt that software generally has complicated dependencies, which is why software tends to be
managed with dedicated package managers, such as APT, which can be called with apt
or apt-get
. We also learnt
some fundamentals about the programming language Python, and learnt how to use Python’s package manager Pip to
manage Python packages, specifically. In connection with this, we also learnt about so-called virtual Python
environment, in which package dependencies are handled separately from the rest of the system. I have since tried
actively using both venv
and pipenv
virtual environments in other projects, generally preferring venv
.
Finally, we learnt about Makefiles, a useful tool for building various projects in a command-line environment,
through automation of pipelines. The command used to build according to the Makefile is make
.
Among the commands we learnt this week were:
Command | Usage |
---|---|
sudo |
Executes a command as the superuser (or another user) |
su |
Runs an interactive shell as root (when no user specified) |
passwd |
Changes user password |
apt-get |
Calls the APT package manager |
locate |
List files in databases matching a pattern |
python |
Calls the Python interpreter |
pip |
Calls Python’s package manager Pip |
pip install |
Has Pip install a Python package |
This is an example of a Makefile which could be used in a project, where we wanted to run the shell scripts
freqlist.sh
and sent_per_line.sh
(located in the directory bin
) on a possibly large number of text files
(located in the directory data
; in this example there are only the two files foo.txt
and bar.txt
),
outputting to similarly named files in the directory results
:
# Makefile
TEXTS = foo bar
RESULTDIR = results
DATADIR = data
BINDIR = bin
FREQLISTS = $(TEXTS:%=$(RESULTDIR)/%.freq.txt)
SENTEDTEXTS = $(TEXTS:%=$(RESULTDIR)/%.sent.txt)
.PHONY: all clean
all: $(FREQLISTS) $(SENTEDTEXTS)
$(RESULTDIR)/%.freq.txt: $(DATADIR)/%.txt
$(BINDIR)/freqlist.sh $< $@
$(RESULTDIR)/%.sent.txt: $(DATADIR)/%.txt
$(BINDIR)/sent_per_line.sh $< $@
clean:
@echo "Cleaning up..."
rm -f $(RESULTDIR)/*
To explain some part in detail, the clean
block is a make
rule for cleaning up what has been built, by
(forcibly) removing all files in the results
directory. When called, it first echoes, i.e. prints, the string
‘Cleaning up…’, and then calls the ordinary Bash command rm -f
, expanding $(RESULTDIR)
to results
. The
clean
rule can be called from the command line with make clean
.
This week we learnt about version or revision control, specifically with Git, a distributed version control system. We also learnt about GitHub, an online service for hosting Git repositories. We set up accounts on GitHub and created new Git repositories on GitHub. Furthermore, we learnt about configuring Git, about initializing and cloning Git repositories, about staging, committing and pushing, and about branches and merging. Finally, we also learnt about undoing common mistakes with Git.
I had personally used GitHub together with the GUI application GitHub Desktop before, but appreciated finally learning all the fundamentals about Git proper, and what is happing in the background, even when using a GUI. I now feel much more comfortable using Git and GitHub, and much better understand the concepts of cloning, pulling, staging, committing, pushing, merging, and pull-requesting.
Using Git and GitHub also made me revisit SSH keys, and made me learn more about GPG keys, with which I now sign all my Git commits.
If we wanted to push our correcting of typos to the remote branch, we would use the following sequence of commands in Bash:
$ git add --all
$ git commit -m "Correcting typos"
[master 1234567] Correcting typos
1 file changed, 2 insertions(+), 2 deletions(-)
$ git pull
Authenticated to github.com...
Already up to date.
$ git push
Authenticated to github.com...
To github.com:user/repo.git
1234568..1234567 master -> master
git add --all
stages our changes. git commit -m "Correcting typos"
creates a commit out of the staged changes
with an explanatory message. git pull
is, of course, the opposite of what we are about to to, but it is good
practice to make sure we are up to date with the remote branch before pushing to it. git push
finally pushes
our commit to the remote branch (this assumes we are set up to track the correct remote branch).
For the final assignment, we learnt about how to build webpages using GitHub Pages, and thus learnt about Markdown, about the Ruby Gems Bundler and Jekyll, and also how to create a LaTeX-based CV using Overleaf.
The final assignment also involved reviewing what we had learnt during the course, and much exercise with using Git and GitHub. I particularly appreciated the opportunities to learn how to create pull requests on GitHub, to learn more about Ruby and Ruby Gems, and to work with new Overleaf LaTeX templates.
If we wanted to create a table with Markdown, we could use code like this:
Left-aligned | Centered | Right-aligned
:--- | :---: | ---:
Foo | Foo | Foo
Bar | Bar | Bar
The colon in :---
marks alignment, and at least three hyphens are needed to mark a horizontal rule. Vertical
rules are marked by |
.