Update:
Just a few hours after posting this, Joyent committed the fix to the Illumos kernel on github. Kindly, Mike Zeller informed me of that fact on Twitter. I love the internet and open source software.
Original post:
Recently, I upgraded my home SmartOS server to 20140919T024804Z. Generally this is a simple process where I download the latest USB image from smartos.org, image it to a 2GB USB stick, and boot from it. This process usually takes no more than 10 minutes to complete.
This time, however, was a bit different. I wasn’t able to reach the apache server on one of my zones (megaman). Running svcs -xv quickly told me that yes, apache was not running.
[root@megaman ~]# svcs -xv
svc:/pkgsrc/apache:default (Apache 2 HTTP server)
State: maintenance since September 27, 2014 11:22:26 PM UTC
Reason: Start method failed repeatedly, last dumped core on Abort (6).
See: http://illumos.org/msg/SMF-8000-KS
See: man -M /opt/local/apache2/man -s 8 httpd
See: http://httpd.apache.org
See: /var/svc/log/pkgsrc-apache:default.log
Impact: This service is not running.
Tailing the service log, I found that the httpd binary was returning an ABRT signal.
[root@megaman ~]# tail /var/svc/log/pkgsrc-apache:default.log
[ Sep 27 23:14:25 Method "start" failed due to signal ABRT. ]
[ Sep 27 23:14:25 Executing start method ("/opt/local/sbin/httpd -k start"). ]
[ Sep 27 23:14:25 Method "start" failed due to signal ABRT. ]
[ Sep 27 23:22:24 Enabled. ]
[ Sep 27 23:22:26 Executing start method ("/opt/local/sbin/httpd -k start"). ]
[ Sep 27 23:22:26 Method "start" failed due to signal ABRT. ]
[ Sep 27 23:22:26 Executing start method ("/opt/local/sbin/httpd -k start"). ]
[ Sep 27 23:22:26 Method "start" failed due to signal ABRT. ]
[ Sep 27 23:22:26 Executing start method ("/opt/local/sbin/httpd -k start"). ]
[ Sep 27 23:22:26 Method "start" failed due to signal ABRT.
Attempting to run the httpd binary manually aborted almost immediately:
[root@megaman ~]# cd /opt/local/sbin/
[root@megaman /opt/local/sbin]# ./httpd
Abort (core dumped)
This was very strange considering apache has worked perfectly in megaman for years. At this point I had only updated SmartOS but none of the pkgsrc packages on the zone. My guess was that something was amiss between the application on the zone and the new system libraries. I used pstack to glance that the process stack of the core file.
[root@megaman /opt/local/sbin]# pstack core
core 'core' of 37746: ./httpd
fffffd7fff1fbd3a _lwp_kill () + a
fffffd7fff191160 raise (6) + 20
fffffd7ffec47e14 umem_do_abort () + 44
fffffd7ffec48065 ???????? ()
fffffd7ffec4ad80 process_free (59c810, 1, 0) + a0
fffffd7ffec4aecd umem_malloc_free (59c810) + 1d
fffffd7ffed48425 __IPv6_cleanup (5867c0) + 25
fffffd7ffed46e35 getipnodebyname (fffffd7fffdff420, 1a, 8007, fffffd7fffdff53c) + 115
fffffd7ffea597ea get_addr (0, 53e9e8, fffffd7fffdff5f0, fffffd7fffdff5f0, 0, 0) + 10a
fffffd7ffea59044 _getaddrinfo (53e9e8, 0, fffffd7fffdff700, fffffd7fffdff6f8, 0) + 2a4
fffffd7ffea596a0 getaddrinfo (53e9e8, 0, fffffd7fffdff700, fffffd7fffdff6f8) + 20
fffffd7ffb4dc7b5 call_resolver.isra.0 () + 65
0000000000467e9f ap_set_listener () + 1bf
000000000045cf3d invoke_cmd () + 61d
000000000045f37f ap_walk_config () + af
0000000000460560 ap_process_config_tree () + c0
0000000000477d16 main () + 6a6
000000000043ef2c _start () + 6c
getipnodebyname appeared to be causing the ABRT of the httpd process. I decided to capture parameter information for getipnodebyname to determine what caused it to go haywire. MDB to the rescue!
[root@megaman /opt/local/sbin]# mdb ./httpd
> ::run
mdb: stop on SIGABRT
mdb: target stopped at:
libc.so.1`_lwp_kill+0xa:jae +0xc <libc.so.1`_lwp_kill+0x18> mdb: You've got symbols!
Loading modules: [ ld.so.1 libumem.so.1 libc.so.1 ]
Cool, ABRT in mdb too. Now on to the interesting part. Run the ::stack dcmd and find the hostname that was sent to getipnodebyname.
> ::stack
libc.so.1`_lwp_kill+0xa()
libc.so.1`raise+0x20(6)
libumem.so.1`umem_do_abort+0x44()
0xfffffd7ffec48065()
libumem.so.1`process_free+0xa0(59e810, 1, 0)
libumem.so.1`umem_malloc_free+0x1d(59e810)
libnsl.so.1`__IPv6_cleanup+0x25(5877c0)
libnsl.so.1`getipnodebyname+0x115(fffffd7fffdff420, 1a, 8007, fffffd7fffdff53c)
libsocket.so.1`get_addr+0x10a(0, 5409e8, fffffd7fffdff5f0, fffffd7fffdff5f0, 0, 0)
libsocket.so.1`_getaddrinfo+0x2a4(5409e8, 0, fffffd7fffdff700, fffffd7fffdff6f8, 0)
libsocket.so.1`getaddrinfo+0x20(5409e8, 0, fffffd7fffdff700, fffffd7fffdff6f8)
libapr-1.so.0.5.1`call_resolver.isra.0+0x65()
ap_set_listener+0x1bf()
invoke_cmd+0x61d()
ap_walk_config+0xaf()
ap_process_config_tree+0xc0()
main+0x6a6()
_start+0x6c()
> fffffd7fffdff420::dump
\/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef
fffffd7fffdff420: 3a3a00ff 7ffdffff 3eb33bff 7ffdffff ::……>.;…..
See the first two bytes of the string? In hexidecimal it’s 3a3a. getipnodebyname appears to be failing while trying to look up “::”. A double colon is shorthand in IPv6 that means the equivolent of 0.0.0.0 in IPv4.
My next thought was to disable the IPv6 interface to see if that made any difference. First run ipadm show-addr to find any IPv6 addresses.
[root@megaman /opt/local/sbin]# ipadm show-addr
ADDROBJ TYPE STATE ADDR
lo0/_a static ok 127.0.0.1/8
net0/_a from-gz ok 192.168.1.201/24
lo0/_b static ok ::1/64
Ah, an IPv6 localhost entry. Disable the IPv6 interface lo0/_b.
[root@megaman /opt/local/sbin]# ipadm disable-addr -t lo0/_b
Test apache again.
[root@megaman /opt/local/sbin]# ./httpd
It works! No abort and the daemon started without trouble. svcadm enable apache also worked without a problem.
Obviously, disabling IPv6 is decidedly not a permanent solution. A few months from now if httpd dies again, I don’t want to be scratching my head trying to remember what could be causing it. This is a good example of Technical Debt. Trading long term difficulties for short term “just getting it to work”. Best avoided.
Luckily, this httpd daemon is running a simple web application that’s trivial to migrate to a new smart machine. After a simple imgadm import and vmadm create, I had a brand new zone. It just took a few minutes to migrate the application to it (now with LZ4 compression!).