Fixing Gluster’s replicate0: Unable to self-heal permissions/ownership error

I helped recover a Gluster setup that had gone bad today, and wanted to write up what I did because there’s precious little information out there on what’s going on. Note that I don’t consider myself a Gluster expert by any stretch.

The problem

The ticket actually came to me as an Apache permissions error:

[Thu Mar 03 07:37:58 2016] [error] [client 192.168.1.1] (5)Input/output error: file permissions deny server access: /var/www/html/customer/header.jpg

(Full disclosure: I’ve mucked with all the logs here to not reveal any customer data.)

We suspected that it might have something to do with Gluster, which turned out to be correct. This setup is a pair of servers running Gluster 3.0.

We looked at Gluster’s logs, where we saw a ton of stuff like this:

[2016-03-03 15:54:53] W [fuse-bridge.c:862:fuse_fd_cbk] glusterfs-fuse: 97931: OPEN() /customer/banner.png => -1 (Input/output error)
[2016-03-03 16:00:26] E [afr-self-heal-metadata.c:566:afr_sh_metadata_fix] replicate0: Unable to self-heal permissions/ownership of '/customer/style.css' (possible split-brain). Please fix the file on all backend volumes

Those are separate errors for separate files, but both share a good degree of brokenness.

The problem

For reasons I haven’t yet identified, but that I’m arbitrarily assuming is a Gluster bug, Gluster got into a split-brain situation on the metadata of those files. Debugging this was a bit of an adventure, because there’s little information out there on how to proceed.

getfattr

After a lot of digging and hair-pulling, I eventually came across this exchange on gluster-users that addresses an issue that looked like ours. The files appeared the same and to have the same permissions, but Gluster thought they mismatched.

Gluster stores some information in extended attributes, or xattrs, on each file. This happens on the brick, not on the mounted Gluster filesystem. You can examine that with the getfattr tool. Some of the attributes are named trusted.afr.<brickname> for each host. As Jeff explains in that gluster-users post:

The [trusted.afr] values are arrays of 32-bit counters for how many updates we believe are still pending at each brick.

In that example, their values were:

[root at ca1.sg1 /]# getfattr -m . -d -e hex
/data/gluster/lfd/techstudiolfc/pub getfattr: Removing leading '/' from
absolute path names # file: data/gluster/lfd/techstudiolfc/pub 
trusted.afr.shared-client-0=0x000000000000000000000000 
trusted.afr.shared-client-1=0x000000000000001d00000000
trusted.gfid=0x3700ee06f8f74ebc853ee8277c107ec2


[root at ca2.sg1 /]# getfattr -m . -d -e hex
/data/gluster/lfd/techstudiolfc/pub getfattr: Removing leading '/' from
absolute path names # file: data/gluster/lfd/techstudiolfc/pub 
trusted.afr.shared-client-0=0x000000000000000300000000
trusted.afr.shared-client-1=0x000000000000000000000000 
trusted.gfid=0x3700ee06f8f74ebc853ee8277c107ec2

Note that their values disagree. One sees shared-client-1 with a “1d” value in the middle; the other sees shared-client-0 with a “03” in the middle. Jeff explains:

Here we see that ca1 (presumably corresponding to client-0) has a count of 0x1d for client-1 (presumably corresponding to ca2). In other words, ca1 saw 29 updates that it doesn’t know completed at ca2. At the same time, ca2 saw 3 operations that it doesn’t know completed at ca1. When there seem to be updates that need to be propagated in both directions, we don’t know which ones should superseded which others, so we call it split brain and decline to do anything lest we cause data loss.

Red Hat has a knowledgebase article on this, though it’s behind a paywall.

If you run getfattr and have no output, you’re probably running it on the shared Gluster filesystem, not on the local machine’s brick. Run it against the brick.

Fixing it

Don’t just skip here; this isn’t a copy-and-paste fix. :)

To fix this, you want to remove the offending xattrs from one of the split-brain node’s bricks, and then stat the file to get it to automatically self-heal.

Use the trusted.afr.whatever values. I unset all of them, one per brick—but remember, only do this on one node! Don’t run it on both!

In our case, we had one node that looked like this:

trusted.afr.remote1315012=0x000000000000000000000000
trusted.afr.remote1315013=0x000000010000000100000000

And the other looked like this:

trusted.afr.remote1315012=0x000000010000000100000000
trusted.afr.remote1315013=0x000000000000000000000000

(Note here that the same two values appears on both hosts, but not for the same keys. One has the 1’s on remote1315013, and one sees them on remote1315012.)

Since it’s not like one is ‘right’ and the other is ‘wrong’, on one of the two nodes, I unset both xattrs, using setfattr:

setfattr -x trusted.afr.remote1315012 /mnt/brick1315013/customer/header.jpg
setfattr -x trusted.afr.remote1315013 /mnt/brick1315013/customer/header.jpg

I ran the getfattr command again to make sure the attributes had disappeared. (Remember: this is on the brick, not the presented Gluster filesystem.)

Then, simply stat the file on the mounted Gluster filesystem on that node, and it should automatically correct the missing attributes, filling them in from the other node. You can verify again withgetfattr against the brick.

If this happens for a bunch of files, you can simply script it.

Run a UNIX command for a limited amount of time

In the past couple of weeks, I’ve repeatedly found myself wishing for a UNIX command that would run a command for a while, and then stop it. For example, I might want to sample tcpdump output for 60 seconds, or tail the output of a log and search for a string to see if any errors occurred over a 5-minute period. So I begrudgingly set out to write one. And then I realized:

There is totally already a command that does this. It’s called timeout. Somehow, despite using Linux for about 15 years, I had never heard of it. (Not enough time writing shell scripts in bash? Is that actually a bad thing?) It’s part of coreutils.

For example, I ended up writing this gem:

sudo timeout 60 tcpdump -n net 191.247.228.0/24 and \
 dst port 123 -B 32000 | awk '{print $3}' | \
 cut -d "." -f 1-4 - | sort | uniq

Because it actually contains a lot of things I had to look up to get just right, I figure I’ll describe a bunch of those commands for my future-self:

sudo timeout

You can’t run timeout in front of a command with sudo, as I learned. It’ll launch the command with elevated privileges, but then try to kill it without them.

tcpdump -n net 191.247.228.0/24 and dst port 123 -B 32000

It annoys me that -n (don’t resolve hostnames) isn’t the default. Since name resolution is blocking, unless every host you’re resolving is on a network with functional, and fast, reverse-resolvers, you’re going to have a bad time.

I’m used to matching on host 1.2.3.4, but you can use net 1.2.3.0/24 or whatnot to match a network instead. You (this is the part I always get wrong) combine conditions with the and keyword (which seems so simple once you remember). dst port 123 matches traffic to port 123. (And even though it’s tcpdump, I’m using it to capture UDP port 123—NTP.)

-B 32000 is another fun one I just learned about. Ever seen this?

17 packets captured
37 packets received by filter
0 packets dropped by kernel

But with “packets dropped by kernel” as a non-zero number? It happens when there are so many packets coming in that they fall out of the buffer before tcpdump can process them. -B 32000 tries to set it to 32,000 kB. (The man page on my system doesn’t explain units, but this one does.)

awk ‘{print $3}’ | cut -d “.” -f 1-4 – |

My awk is pretty terrible, and it’s apparently quite powerful. But with a bunch of lines like this:

17:16:40.791327 IP 191.247.228.xxx.39440 > 10.252.153.236.ntp: NTPv3, Client, length 48

I just want the third column, with the IP. awk '{print $3}' achieves that. (It’s not zero-based. I get this wrong about 50% of the time.)

I use cut much less frequently. tcpdump shows the port number on the end, separated by a dot: “191.247.228.xxx.39440” is IP 191.247.228.xxx, port 39440. So I want to split on the dots, and print only columns 1-4.

-d "." sets the . as a delimiter, and -f 1-4 says to print fields 1-4. (Like awk, it starts with column 1.) The part I struggled with most, actually is remember the trailing - to tell it to read from the pipe, versus expecting a filename.

sort | uniq

This burns me all the time, until I came to just always use them together: uniq doesn’t really detect duplicates. Per the man page:

Note: ‘uniq’ does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use ‘sort -u’ without ‘uniq’.
See for yourself:

$ echo -e "a\nb\na\na\nc" | uniq
a
b
a
c

(I should probably just do sort -u, but by now sort | uniq is etched into my brain.)

Using msmtp with Rackspace Email

When using mutt on my Mac, I use msmtp to continue the alliteration as an SMTP agent, to send mail through an actual authenticated SMTP server versus trying to connect from my laptop, which not many mailservers will accept.

I’m either missing something, or it’s a real pain with keys when using TLS, especially on the Mac, where the CA certs aren’t present except in the Keychain. I found some guides to getting this working with Gmail, but not Rackspace’s email service.

This is the .msmtprc file I ended up using:

account default
port 587
tls on
tls_starttls on
tls_fingerprint CD:E1:CD:60:FC:8C:8F:3B:6F:17:62:70:61:51:75:3D
auth on
host smtp.emailsrvr.com
user "you@example.com"
password "maybe you do not want it here"

Don’t trust me on the tls_fingerprint line. (I’m not up to anything, but you don’t know that.)

This page documents their SMTP settings, including the hostname. It doesn’t give you TLS fingerprints or a CA cert file, because no one on the Internet does that.

Following this advice concerning Gmail, I adapted it to find the fingerprint for Rackspace:

echo -n | openssl s_client -connect smtp.emailsrvr.com:587 \
 -starttls smtp -showcerts > x.tmp

That will save the exchange, which includes the key. You could probably extract it from there, but it was easier for me to go on and just get the fingerprint:

openssl x509 -noout -fingerprint -md5 -in x.tmp

Take the bit after MD5 Fingerprint= and drop that into .msmtprc on the tls_fingerprint line.

There’s got to be an easier way…

‘coding’ and ‘hacker’ make the State of the Union

Twitter user @benmschmidt tweeted last night a list of words that have never appeared in a State of the Union address before:

Screen Shot 2015-01-21 at 11.19.04 AM

It’s case-sensitive, so things like “internet” (previously spelled as a proper noun in official transcripts) and “Understand” (surely used before, but maybe not at the start of a sentence) are probably erroneous. (And it’s also been suggested that “healthcare” has always been used as two words.)

But it’s exciting to see that the words “coding” and “hacker” make an appearance in a State of the Union address. (Plus reference to eBay, Tesla, and Instagram. I’m hoping “CVS” was not a reference to the version control system.)

The appearance of “lesbian”, “bisexual”, and “transgender” in the address also seems like a sign of progress.