Unix command line tricks for Linux, Mac, and Windows: grep

in #technology7 years ago (edited)

The Unix operating system was the granddaddy of Linux, OS X, and Windows, and most of its commands for exploring and pulling information out of text files are available on these modern operating systems. This is part of a series where we'll be looking at some of the most useful Unix command line commands, because as the idea of Data Science gets bigger and bigger, it turns out that these old school Unix tools are more valuable than ever for dealing with the different kinds of data files that may show up.

Today, we'll look at grep, which searches through files.

On Windows 10, you'll want to install the bash command line, and for older versions of Windows, install Cygwin; on OS X and Linux you'll want the Terminal app. (In the old days, before computer screens had windows, that terminal was the whole interface!)

Imagine that we have a text file called sampledata.csv exported as a CSV file from a spreadsheet program like Excel or LibreOffice Calc, and it looks like this:

Employee Number,Family Name,Given Name,Hire Date,Phone Extension
1001,Johnson,Emily,11/13/2016,x0023
1002,Smith,John,03/16/2017,x7225
1003,Baker,Debbie,03/23/2017,x8834
1004,Morales,Kermit,06/09/2017,x2643

When I tell grep to search that file for a certain string of characters,

grep ker sampledata.csv

it outputs all the lines with that string:

1003,Baker,Debbie,03/23/2017,x8834

grep has many, many command line options to customize its behavior. For example, adding -i tells it to ignore the difference between upper and lower case when searching:

grep -i ker sampledata.csv

This gives us the line with "ker" and also the one with "Ker":

1003,Baker,Debbie,03/23/2017,x8834
1004,Morales,Kermit,06/09/2017,x2643

The -c switch tells grep to count how many lines have the string:

grep -c 2017 sampledata.csv

Instead of ouputting any lines from the file it searched, this command shows us the number of times it found the line:

3

The -v switch tells it to "invert the search"—in other words, to show the lines that do not have that line. In the next command, grep is asking for all the lines that do not have 2017 in them,

grep -v 2017 sampledata.csv

and here is the result:

Employee Number,Family Name,Given Name,Hire Date,Phone Extension
1001,Johnson,Emily,11/13/2016,x0023

The name "grep" means "globally search a regular expression and print." (In the earliest days of computers, a terminal was not a screen but a printer where all of your scrolled-up output got noisily printed. The word "print" is still used in most programming languages today to make something appear on the screen.) Regular expressions use special syntax to indicate patterns to look for. In a regular expression, a period means "any character at all," so the following command searches for lines with a date that has 03 before the first slash, 2017 after the second one, and any two characters between the slashes:

grep 03/../2017 sampledata.csv

Here is the result:

1002,Smith,John,03/16/2017,x7225
1003,Baker,Debbie,03/23/2017,x8834

(If you really want to look for a period in the file that grep is searching, put a backslash before it on the grep command line to "escape" it.) Regular expressions can get much fancier than this, and entire books have been written about them.

The Unix man command (short for "manual") shows documentation about Unix commands, so you can enter the following to learn more about the grep command's many other options:

man grep
Sort:  

I can't find an edit button for my post, so I'm adding a link to the table of contents for the series here: Unix command line tricks for Linux, Mac, and Windows: table of contents

Coin Marketplace

STEEM 0.20
TRX 0.13
JST 0.030
BTC 65092.40
ETH 3470.06
USDT 1.00
SBD 2.50