Unix command line tricks for Linux, Mac, and Windows: cut
This is part of a series on Unix command line tools that are available in on Linux, OS X, and Windows machines. We'll be looking at some of the most useful Unix command line commands, because as the idea of Data Science gets bigger and bigger, it turns out that these old school Unix tools are more valuable than ever for dealing with the different kinds of data files that may show up. See Unix command line tricks for Linux, Mac, and Windows: grep for more on this series.
Today, we'll look at
cut. Last week we looked at
tail, which you can combine to take horizontal slices of files—for example, to take lines 145,234 through 145,238 of a 300,000 line file.
cut lets you take vertical slices of files, which is especially useful when working with spreadsheet data or data exported from relational tables.
For our examples, we'll use this
sampledata.csv file exported as a CSV file from a spreadsheet program like Excel or LibreOffice Calc:
Employee Number,Family Name,Given Name,Hire Date,Phone Extension 1001,Johnson,Emily,11/13/2016,x0023 1002,Smith,John,03/16/2017,x7225 1003,Baker,Debbie,03/23/2017,x8834 1004,Morales,Kermit,06/09/2017,x2643
The following command tells
cut that our input file is delimited with commas and that we only want the third field:
cut -d ',' -f3 sampledata.csv
The command returns that third field for all the rows:
Given Name Emily John Debbie Kermit
We can also ask for multiple fields with a list of field numbers separated by commas. The following asks for the third and fifth fields:
cut -d ',' -f3,5 sampledata.csv
And here they are:
Given Name,Phone Extension Emily,x0023 John,x7225 Debbie,x8834 Kermit,x2643
Last week we also learned about combining commands into a pipeline where each command sends its output to be used as input by the next command. The following uses the
head command to take the first four lines of the file named as input, then sends its output to the
tail command, which will pass along the last three lines of its input to the
cut command, which will output the first and fourth columns of its input:
head -n 4 sampledata.csv | tail -n 3 | cut -d ',' -f1,4
Here is the result:
1001,11/13/2016 1002,03/16/2017 1003,03/23/2017
Again, this may not seem useful with an input file that is five lines long, but when you've exported a table from a massive database and have hundreds of thousands of lines that you can't just pull up in a text editor, you can combine these commands to perform a lot of very useful tasks. For example, you can pull a subset that has the parts that are most interesting to you and that will fit into a text editor or visualization tool.
As with all Unix commands, you can learn more
cut with the manual (