5 indispensable bash tricks

Don’t mind the lame Buzzfeed title… Here are a few handy bash tricks and tips that people either use every day or never knew existed. Hopefully I can help move some of you into the first camp!

Introductory notes

A few of these commands involve working with bash history. On the advice of a coworker, I dropped this in my .bashrc to keep tons of history:

HISTSIZE=100000 # keep 100k commands in a session history (memory)
HISTFILESIZE=200000 # store 200k commands in my history file (on disk)

Disk space is cheap, as is memory. The number of times (prior to this change) that I wanted a command that had aged out of my bash history is much greater than the number of times I’ve found bash cumbersome because my history file is almost 1MB in size (when I have a 500GB SSD and 16GB RAM in my 2-year-old laptop).

Meta key

A number of bash commands reference a Meta key. In general, on a Mac, the Escape key fills that roll. On Linux, it’s generally the Alt key. You can change that, but if you’ve done so, you don’t need me to tell you about it. My examples will use Esc for these commands, but if you’re on a Linux box, you will likely want to substitute Alt for it.

Esc-. | Insert last argument

Described in the docs as insert-last-argument (M-., M-_), this keyboard shortcut will spit out the last argument to the previous command. On a Mac, the Meta key is Escape; on Linux, it’s often Alt.

Example usage:

$ mkdir -p long/directory/name/that/would_suck_to/type
$ cd Esc .

The Esc + . will be expanded into long/directory/name/that/would_suck_to/type.

Note that Esc + _ is bound to the same function, but is a bit tougher to type.

Ctrl+R | Reverse history search

This one is tough to explain, but magical. Have you ever hit the up arrow a bunch of times to scroll through history, trying to find something you ran recently? Ctrl + r will open up an interactive search, or reverse-i-search in bash parlance.

Recently used vim on a file with a long filename? Press Ctrl + r and start typing vim. The most recent command matching vim will be showed. Keep typing to make your search more specific, or press Ctrl + r again to scroll to the next-newest one. When you find what you want, press Enter to run it, or the right arrow to start moving the cursor through the command. (Or something like Ctrl+E to jump to the end of the line.)

If you want to be really nutty, you can start commenting your commands at the end. vim /etc/X11/xorg.conf # fix video settings will allow Ctrl + r + video to match a search, for example. I’ve been known to throw in random keywords I think I might try looking for later on.

cd – | Return to previous directory

pushd and popd are awesome and you should use them. But sometimes you forget. bash has got your back. cd - will return you to the previous directory you were in. (This is stored in the OLDPWD environment variable.)

git checkout – | Switch back to the previous branch

If you use git, you’ll be delighted to know that it does something similar. git checkout - will check out the previous branch you were on. I’m often bad at cleaning up topic branches, and will git checkout master to do some catching up, and then realize I don’t remember what my topic branch was called. Sure, it would probably take me all of 30 seconds to figure it out, but checking out - is so much easier.

!! | Re-run the previous command

!! will re-run the command you just ran. Why not just hit the up arrow? Because !! can be combined. The most command usage:

<b>$</b> cat /root/whatever
Permission denied
<b>$</b> sudo !!
sudo cat /root/whatever
whatever


Hope you learned something useful! What other neat tricks should I know about?

Counting open processes by process

A site I host is offline, throwing the error “Too many open files.” The obvious solution would be to bounce the webserver to release all the file handles, but I wanted to figure out what was using all of them and see if I could figure out why they were leaking in the first place.

I had a few hunches, so I ran lsof -p PID on a few of them. But none of them had excessive files open. After a couple of minutes of guessing, I realized this was stupid, and set out to script things.

I hacked this quick-and-dirty script together:


pids = Dir.entries('/proc/').select{|p| p.match(/\d+/)}
puts "Found #{pids.size} processes..."
pfsmap = {}
pids.each do |pid|
files = Dir.entries("/proc/#{pid}/fd").size
cmdline = File.read("/proc/#{pid}/cmdline").strip
pfsmap[pid] = {
:files => files,
:name => cmdline
}
end

puts pfsmap.sort{|a,b| a[files] <-> b[files]}

There’s got to be a better way to get a process list from procfs than regexp matching directories in /proc that are only numeric. But I do that, and, for each process, count how many entires are in /proc/PID/fd and sort by that. So that the output isn’t just a giant mess of numbers, I also read the /proc/PID/cmdline.

This is hardly a polished script, but it did the job — it identified a script that was hitting the default 1024 FD limit. I was then able to lsof that and find… that they’re all UNIX sockets, so it’s anyone’s guess what they go to. So I just rolled Apache like a chump. Oh well. Maybe it’ll help someone else—or maybe someone knows of a less-ugly way to do some of this?

Fixing Gluster’s replicate0: Unable to self-heal permissions/ownership error

I helped recover a Gluster setup that had gone bad today, and wanted to write up what I did because there’s precious little information out there on what’s going on. Note that I don’t consider myself a Gluster expert by any stretch.

The problem

The ticket actually came to me as an Apache permissions error:

[Thu Mar 03 07:37:58 2016] [error] [client 192.168.1.1] (5)Input/output error: file permissions deny server access: /var/www/html/customer/header.jpg

(Full disclosure: I’ve mucked with all the logs here to not reveal any customer data.)

We suspected that it might have something to do with Gluster, which turned out to be correct. This setup is a pair of servers running Gluster 3.0.

We looked at Gluster’s logs, where we saw a ton of stuff like this:

[2016-03-03 15:54:53] W [fuse-bridge.c:862:fuse_fd_cbk] glusterfs-fuse: 97931: OPEN() /customer/banner.png => -1 (Input/output error)
[2016-03-03 16:00:26] E [afr-self-heal-metadata.c:566:afr_sh_metadata_fix] replicate0: Unable to self-heal permissions/ownership of '/customer/style.css' (possible split-brain). Please fix the file on all backend volumes

Those are separate errors for separate files, but both share a good degree of brokenness.

The problem

For reasons I haven’t yet identified, but that I’m arbitrarily assuming is a Gluster bug, Gluster got into a split-brain situation on the metadata of those files. Debugging this was a bit of an adventure, because there’s little information out there on how to proceed.

getfattr

After a lot of digging and hair-pulling, I eventually came across this exchange on gluster-users that addresses an issue that looked like ours. The files appeared the same and to have the same permissions, but Gluster thought they mismatched.

Gluster stores some information in extended attributes, or xattrs, on each file. This happens on the brick, not on the mounted Gluster filesystem. You can examine that with the getfattr tool. Some of the attributes are named trusted.afr.<brickname> for each host. As Jeff explains in that gluster-users post:

The [trusted.afr] values are arrays of 32-bit counters for how many updates we believe are still pending at each brick.

In that example, their values were:

[root at ca1.sg1 /]# getfattr -m . -d -e hex
/data/gluster/lfd/techstudiolfc/pub getfattr: Removing leading '/' from
absolute path names # file: data/gluster/lfd/techstudiolfc/pub 
trusted.afr.shared-client-0=0x000000000000000000000000 
trusted.afr.shared-client-1=0x000000000000001d00000000
trusted.gfid=0x3700ee06f8f74ebc853ee8277c107ec2


[root at ca2.sg1 /]# getfattr -m . -d -e hex
/data/gluster/lfd/techstudiolfc/pub getfattr: Removing leading '/' from
absolute path names # file: data/gluster/lfd/techstudiolfc/pub 
trusted.afr.shared-client-0=0x000000000000000300000000
trusted.afr.shared-client-1=0x000000000000000000000000 
trusted.gfid=0x3700ee06f8f74ebc853ee8277c107ec2

Note that their values disagree. One sees shared-client-1 with a “1d” value in the middle; the other sees shared-client-0 with a “03″ in the middle. Jeff explains:

Here we see that ca1 (presumably corresponding to client-0) has a count of 0x1d for client-1 (presumably corresponding to ca2). In other words, ca1 saw 29 updates that it doesn’t know completed at ca2. At the same time, ca2 saw 3 operations that it doesn’t know completed at ca1. When there seem to be updates that need to be propagated in both directions, we don’t know which ones should superseded which others, so we call it split brain and decline to do anything lest we cause data loss.

Red Hat has a knowledgebase article on this, though it’s behind a paywall.

If you run getfattr and have no output, you’re probably running it on the shared Gluster filesystem, not on the local machine’s brick. Run it against the brick.

Fixing it

Don’t just skip here; this isn’t a copy-and-paste fix. :)

To fix this, you want to remove the offending xattrs from one of the split-brain node’s bricks, and then stat the file to get it to automatically self-heal.

Use the trusted.afr.whatever values. I unset all of them, one per brick—but remember, only do this on one node! Don’t run it on both!

In our case, we had one node that looked like this:

trusted.afr.remote1315012=0x000000000000000000000000
trusted.afr.remote1315013=0x000000010000000100000000

And the other looked like this:

trusted.afr.remote1315012=0x000000010000000100000000
trusted.afr.remote1315013=0x000000000000000000000000

(Note here that the same two values appears on both hosts, but not for the same keys. One has the 1′s on remote1315013, and one sees them on remote1315012.)

Since it’s not like one is ‘right’ and the other is ‘wrong’, on one of the two nodes, I unset both xattrs, using setfattr:

setfattr -x trusted.afr.remote1315012 /mnt/brick1315013/customer/header.jpg
setfattr -x trusted.afr.remote1315013 /mnt/brick1315013/customer/header.jpg

I ran the getfattr command again to make sure the attributes had disappeared. (Remember: this is on the brick, not the presented Gluster filesystem.)

Then, simply stat the file on the mounted Gluster filesystem on that node, and it should automatically correct the missing attributes, filling them in from the other node. You can verify again withgetfattr against the brick.

If this happens for a bunch of files, you can simply script it.

Run a UNIX command for a limited amount of time

In the past couple of weeks, I’ve repeatedly found myself wishing for a UNIX command that would run a command for a while, and then stop it. For example, I might want to sample tcpdump output for 60 seconds, or tail the output of a log and search for a string to see if any errors occurred over a 5-minute period. So I begrudgingly set out to write one. And then I realized:

There is totally already a command that does this. It’s called timeout. Somehow, despite using Linux for about 15 years, I had never heard of it. (Not enough time writing shell scripts in bash? Is that actually a bad thing?) It’s part of coreutils.

For example, I ended up writing this gem:

sudo timeout 60 tcpdump -n net 191.247.228.0/24 and \
 dst port 123 -B 32000 | awk '{print $3}' | \
 cut -d "." -f 1-4 - | sort | uniq

Because it actually contains a lot of things I had to look up to get just right, I figure I’ll describe a bunch of those commands for my future-self:

sudo timeout

You can’t run timeout in front of a command with sudo, as I learned. It’ll launch the command with elevated privileges, but then try to kill it without them.

tcpdump -n net 191.247.228.0/24 and dst port 123 -B 32000

It annoys me that -n (don’t resolve hostnames) isn’t the default. Since name resolution is blocking, unless every host you’re resolving is on a network with functional, and fast, reverse-resolvers, you’re going to have a bad time.

I’m used to matching on host 1.2.3.4, but you can use net 1.2.3.0/24 or whatnot to match a network instead. You (this is the part I always get wrong) combine conditions with the and keyword (which seems so simple once you remember). dst port 123 matches traffic to port 123. (And even though it’s tcpdump, I’m using it to capture UDP port 123—NTP.)

-B 32000 is another fun one I just learned about. Ever seen this?

17 packets captured
37 packets received by filter
0 packets dropped by kernel

But with “packets dropped by kernel” as a non-zero number? It happens when there are so many packets coming in that they fall out of the buffer before tcpdump can process them. -B 32000 tries to set it to 32,000 kB. (The man page on my system doesn’t explain units, but this one does.)

awk ‘{print $3}’ | cut -d “.” -f 1-4 – |

My awk is pretty terrible, and it’s apparently quite powerful. But with a bunch of lines like this:

17:16:40.791327 IP 191.247.228.xxx.39440 > 10.252.153.236.ntp: NTPv3, Client, length 48

I just want the third column, with the IP. awk '{print $3}' achieves that. (It’s not zero-based. I get this wrong about 50% of the time.)

I use cut much less frequently. tcpdump shows the port number on the end, separated by a dot: “191.247.228.xxx.39440″ is IP 191.247.228.xxx, port 39440. So I want to split on the dots, and print only columns 1-4.

-d "." sets the . as a delimiter, and -f 1-4 says to print fields 1-4. (Like awk, it starts with column 1.) The part I struggled with most, actually is remember the trailing - to tell it to read from the pipe, versus expecting a filename.

sort | uniq

This burns me all the time, until I came to just always use them together: uniq doesn’t really detect duplicates. Per the man page:

Note: ‘uniq’ does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use ‘sort -u’ without ‘uniq’.
See for yourself:

$ echo -e "a\nb\na\na\nc" | uniq
a
b
a
c

(I should probably just do sort -u, but by now sort | uniq is etched into my brain.)