theCodingInterface

Overview of Common Linux Text Utility Programs

By Adam McQuistan in Linux 04/24/2020 Comment

Introduction

This article reviews common text manipulation utility programs available within the Linux operating system. The article is chock full of discussions paired with simple examples to demonstrate the usefulness of several text utility programs available to Linux users.

File Viewing Utilities: cat more less

To aid in demonstration I retrieved a text file copy of the famous book A Tale of Two Cities, by Charles Dickens from The Gutenberg Project and, I also parsed out the first two paragraphs into separate files as shown in the directory listing below.

$ ls -l
total 796
-rw-rw-r-- 1 tci tci    615 Apr 22 19:04 first-paragraph.txt
-rw-rw-r-- 1 tci tci    333 Apr 22 19:05 second-paragraph.txt
-rw-rw-r-- 1 tci tci 804335 Mar 19  2018 tail-of-two-cities.txt

I will begin by introducing the cat command which is a fairly low level utility used to combine files fed into the program and send the contents to standard output.

$ cat first-paragraph.txt 
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--
in short, the period was so far like the present period, that some of
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.

A fairly useful enhancement for the cat command is the -n argument which causes the output to show line numbers.

$ cat -n first-paragraph.txt 
 1	It was the best of times,
 2	it was the worst of times,
 3	it was the age of wisdom,
 4	it was the age of foolishness,
 5	it was the epoch of belief,
 6	it was the epoch of incredulity,
 7	it was the season of Light,
 8	it was the season of Darkness,
 9	it was the spring of hope,
10	it was the winter of despair,
11	we had everything before us,
12	we had nothing before us,
13	we were all going direct to Heaven,
14	we were all going direct the other way--
15	in short, the period was so far like the present period, that some of
16	its noisiest authorities insisted on its being received, for good or for
17	evil, in the superlative degree of comparison only.

To combine then output multiple files together just specify a list of space separated file names. For example, to combine the first and second paragraph files is as simple as the following.

$ cat first-paragraph.txt second-paragraph.txt 
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--
in short, the period was so far like the present period, that some of
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.

There were a king with a large jaw and a queen with a plain face, on the
throne of England; there were a king with a large jaw and a queen with
a fair face, on the throne of France. In both countries it was clearer
than crystal to the lords of the State preserves of loaves and fishes,
that things in general were settled for ever.

The cat command works really well for viewing the contents of relatively small files but, when inspecting larger files that span several screen sized pages of content you are better off with either the more or less commands.

The more command can be followed by the name of a file then the output will span your screen giving you the option to skip down the document a page at a time using the space bar or a line at a time using the enter key.

$ more tail-of-two-cities.txt

Results in output shown below.

The Project Gutenberg EBook of A Tale of Two Cities, by Charles Dickens

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: A Tale of Two Cities
       A Story of the French Revolution

Author: Charles Dickens

Release Date: January, 1994 [EBook #98]
Posting Date: November 28, 2009
Last Updated: March 4, 2018

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK A TALE OF TWO CITIES ***


Produced by Judith Boss


A TALE OF TWO CITIES

A STORY OF THE FRENCH REVOLUTION

By Charles Dickens


CONTENTS

     Book the First--Recalled to Life
--More--(0%)

The less command is used the same way where you follow it with the name of a file for inspection but, it is actually more feature rich than the more command because it allows for scrolling and down a file using the arrow keys. Scrolling down a page can still be accomplished with the space bar and enter key. Both more and less programs also offer the ability to search for text within a document. To do this supply / followed by a string of text you want to search for and pressing enter will jump to subsequent matches. To search backwards replace the / with a ?.

Combing Commands with Pipes

The Linux shell has a wonderfully powerful construct known as a pipe, specified with the | character, which allows for piecing together multiple utility programs using the pipe charater to redirect one program's output as input for a subsequent command. For example, I can use the pipe command to send the output from issuing cat on the full tail-of-two-cities.txt document to the less command for better navigation like so.

$ cat tail-of-two-cities.txt | less

This is obviously a pretty stange thing to do since I could have just used less in combination with the tail-of-two-cities.txt file to accomplish the same thing but, it will likely become more evident how powerful the use of pipes are in later examples.

Redirecting Command Output to Files

Similar in concept to the pipe command being able to capture the flow of a prior commands output you can use the > or >> redirection symbols followed by the name of a file for redirecting output to. The > symbol will send output to a newly created file yet, the >> symbol will append to an existing file should it already exist.

For example, if I wanted to create a file named first-two-paragraphs.txt which is the result from passing first-paragraph.txt as well as second-paragraph.txt to cat and redirecting the combined output to the new file.

$ cat first-paragraph.txt second-paragraph.txt > first-two-paragraphs.txt
$ cat first-two-paragraphs.txt 
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of Light,
it was the season of Darkness,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--
in short, the period was so far like the present period, that some of
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.

There were a king with a large jaw and a queen with a plain face, on the
throne of England; there were a king with a large jaw and a queen with
a fair face, on the throne of France. In both countries it was clearer
than crystal to the lords of the State preserves of loaves and fishes,
that things in general were settled for ever.

Manipulating File Contents: sort uniq cut

In this section I shift gears to focus on using programs to manipulate text as opposed to the simple viewing utilites that have been discussed thus far. Again, to aid in the discussion I have another text file I've created to use in examples which contains fictitious test scores for a group of people over the years 2018 and 2019 as shown below.

$ cat test-scores.txt 
Year	Name	Score
2018	Jim		90
2018	Sally	98
2018	Bob		76
2018	Tim		87
2019	Jim		95
2019	Sally	98
2019	Bob		88
2019	Tim		92

The first command I would like to introduce is the cut command which is commonly used to cut out specific columns or bytes of characters from text. This could be used to excise the list of the people in the second column in the test-scores.txt file using the -f2 argument indicating the second field like so.

$ cat test-scores.txt | cut -f2
Name
Jim
Sally
Bob
Tim
Jim
Sally
Bob
Tim

By default fields are assumed to be delimited by tabs which is the case in my test-scores.txt example file. However, you can actually also use the cut command along with the --output-delimiter argument and a new delimiter. Doing so I could issue the following to transform my test-scores.txt file from tab delimited to a comma delimited file named test-scores.csv.

$ cat test-scores.txt | cut -f1-3 --output-delimiter=',' > test-scores.csv
tci@thecodinginterface:~/demo$ cat test-scores.csv 
Year,Name,Score
2018,Jim,90
2018,Sally,98
2018,Bob,76
2018,Tim,87
2019,Jim,95
2019,Sally,98
2019,Bob,88
2019,Tim,92

The above series of commands read the test-scores.txt file, cut the fields out individually and replace the tab delimiter with a comma then redirect the result to a new file. With a CSV version of the text data I can show how the -d argument can be used to specify a comma as the delimiter and again select out just the names.

$ cat test-scores.csv | cut -f2 -d ','
Name
Jim
Sally
Bob
Tim
Jim
Sally
Bob
Tim

Building on this example I can add in the sort command to sort the list of names like so.

$ cat test-scores.csv | cut -f2 -d ',' | sort
Bob
Bob
Jim
Jim
Name
Sally
Sally
Tim
Tim

The sort command has a number of useful optional arguments but, probably the most used is the -r for sorting in reverse order.

$ cat test-scores.csv | cut -f2 -d ',' | sort -r
Tim
Tim
Sally
Sally
Name
Jim
Jim
Bob
Bob

A natural next manipulation would be to remove the duplicates from the list of names with the uniq command.

$ cat test-scores.csv | cut -f2 -d ',' | sort | uniq
Bob
Jim
Name
Sally
Tim

Its worth noting that the uniq command will only remove duplicates if they occur sequentially.

The All Powerful grep Utility and Regular Expressions

In this final section I introduce the grep text utility program. The grep program is used for searching and parsing out lines containing patterns within text and is especially useful when paired with [regular expressions](https://en.wikipedia.org/wiki/Regular_expression).

The basic syntax for the grep command is to follow the grep keyword with a string of text representing a pattern to match against then ending with one or more files. For example, if I search the first-paragraph.txt file of the Tail of Two Cities book for the word 'epoch' I get the following lines returned in the output.

$ grep 'epoch' first-paragraph.txt 
it was the epoch of belief,
it was the epoch of incredulity,

Similar to the other commands I've covered, grep can be used in conjunction with the pipe operator to feed the result of an earlier command into the grep program. In the previous example I could have used the cat command to read the first-paragraph.txt file and pipe its output into grep then search for the word epoch like so.

$ cat first-paragraph.txt | grep 'epoch'
it was the epoch of belief,
it was the epoch of incredulity,

As mentioned previously, the grep program is compatible with regular expression pattern matching which significantly enhances the power and flexibility of it's searching capabilities. Regular expressions are an incredibly large and fairly involved topic so, my goal is to only scratch the surface a bit to enable the reader to become comfortable enough to further explore the topic at a later date.

The first regular expression topics to cover are what are known as the anchor symbols ^ and $. The ^ symbol is used to indicate that a line should start with a particular pattern and the $ symbol means to search for lines ending with a particular pattern.

For example, to find all lines in the first-paragraph.txt file starting with 'we' I use the following.

$ grep '^we' first-paragraph.txt 
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--

Conversely, to find all paragraphs ending with 'us,' I use the following.

$ grep 'us,$' first-paragraph.txt 
we had everything before us,
we had nothing before us,

Another symbol that is highly used in the world of regular expression is the character wildcard symbol '.' which means match any character. This is especially useful when used with a quantifier symbol such as * which means match the preceeding expression zero or more times. Another demonstration will likely be helpful in explaining this so, if I wanted to find all lines that started with 'we' and ended with a comma ',' I could use the following.

$ grep '^we.*,$' first-paragraph.txt 
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,

The above syntax of '^we.*,$' means search for lines starting with 'we' followed by any character '.' occuring zero or more times '*' and ending with a comma ',$'.

The last thing I want to cover in the context of regular expressions is the notion of a subpattern. You can use brackets [...] followed by a quantifier to search for a group of characters within your pattern. If I wanted to search first-paragraph.txt for all lines that start with an uppercase letter I could use the pattern ^[A-Z] as shown below.

$ grep '^[A-Z]' first-paragraph.txt 
It was the best of times,

Now lets say I wanted to find all lines that started with exactly three lowercase letters. To accomplish this I would start with he [a-z] range specifier, append a quantifier of \{3\} then, end with \b which is the bounding symbol saying that the characters should make up a word.

$ grep '^[a-z]\{3\}\b' first-paragraph.txt 
its noisiest authorities insisted on its being received, for good or for

Note the use of lowercase a-z as compared to the previous uppercase range specifier A-Z.

Similarly, if I wanted find all lines in first-paragraph.txt that started with three or more lowercase letters I could use '^[a-z]\{3,\}\b' like so leaving an empty slot after the , means any additional occurrences.

$ grep '^[a-z]\{3,\}\b' first-paragraph.txt 
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.

Resources For Learning More

Unix and Linux System Administration Handbook (5th Edition) is my go to resource for both indepth Linux topics as well as short to the point refreshers and is always within arms reach from my desk
Linux Pocket Guide: Essential Commands is an abbreviated, low cost, and useful resource at great for providing a quick overview of the many useful command line utility programs in Linux

thecodinginterface.com earns commision from sales of linked products such as the books above. This enables providing continued free tutorials and content so, thank you for supporting the authors of these resources as well as thecodinginterface.com

Conclusion

In this tutorial I have covered around a half dozen different common text viewing and manipulation utility programs used amongst Linux users. The use cases for these programs are extremely vast and when taking into account the use of the pipe operator to string together multiple operations the possibilities are enormous. However, through the use of several different examples and simple explanations I feel confident that a reader can be comfortable doing similar simple tasks and going further with more investigations.

As always, thanks for reading and don't be shy about commenting or critiquing below.

sysadmin CentOS Ubuntu Linux

Share with friends and colleagues

[[ likes ]] likes