Sunday, April 03, 2011

reboot? - this is linux!

Just had a scary few minutes, this morning I was unable to logon the home mailserver (running mandriva 2010.1) across the local network - the process connected, opened up a terminal window and just hung, never getting to a prompt! Sunday lunch over, I tried opening up a term when sitting at the machine with the problems and there too, no xterm! Fortunately I was already logged in and had root (admin) access. Looking back in the logs I saw a crash in ssh (which I use to get onto the mailserver from other machines) - in ssh-block-Allow. I could understand ssh problems causing a problem with remote access - but why was local access a problem? Even a Control - C failed to interrupt whatever was running - and I could check that nothing was hogging the CPU.
I looked through any recent updates to the machine and saw nothing likely, googling only gave me my tweet for help(!) so I hurriedly connected the backup drive, did a machine backup - I was glad I already had root access and then tried a reboot. This failed to work - wouldn't shutdown with a message:
could not log bootup - already in use
so a magic key shutdown (using the sysRq key) was resorted to.
A few deep breaths later and the system rebooted and I logged in - then I realised the consequence of the ssh crash - one of the system scripts which run when you open up a terminal checks whether the machine is running ssh and, if so, asks you to unlock the key - for passwordless logins - if ssh is confused - as it evidently was - maybe that script was hanging there.
So I guess killing and restarting sshd would have avoided the reboot as to why ssh was broken that's another issue!
At least the server had been up for a month so it wasn't too unstable!


Jake said...

The machine had been up for a month? While that might be considered remarkable uptime for a windows pc, I suspect something is wrong when a linux box goes down, ever. Many of the linux mail servers at my day job have a current uptime of over 1600 days - and I can't recall ever seeing sshd crash on any of them, so that sounds suspicious. Check for foul play or flaky hardware.

Anonymous said...

Not all distros are created equal.

rajm said...

Yes I'll be giving the logs a good look through today - the machine was up to date - I think there was a recent openssh update and I had long ago disabled password authentication

rajm said...

ssh version is

Name : openssh Relocations: (not relocatable)
Version : 5.5p1 Vendor: Mandriva
Release : 2.1mdv2010.1 Build Date: Tue 23 Nov 2010 12:53:22 GMT
Install Date: Thu 20 Jan 2011 21:34:29 GMT Build Host:
Group : Networking/Remote access Source RPM: openssh-5.5p1-2.1mdv2010.1.src.rpm

Looks like a kernel oops -
Apr 3 04:10:51
faure kernel: BUG: unable to handle kernel paging request at fffffff4
kernel: IP: [] 0xf481e434
kernel: *pde = 00017067 *pte = 00000000
kernel: Oops: 0002 [#1] SMP
kernel: last sysfs file: /sys/devices/virtual/hwmon/hwmon0/temp1_input
kernel: Modules linked in:
kernel: Pid: 5081, comm: ssh-block-Allow Tainted: P B #1 SiS-661/
kernel: EIP: 0060:[] EFLAGS: 00010292 CPU: 1
kernel: EIP is at 0xf481e434
kernel: EAX: fffffff4 EBX: f5553d74 ECX: f481e400 EDX: f55649f8
kernel: ESI: b780b000 EDI: f55649f8 EBP: f481e400 ESP: f4a0bf94
kernel: DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
kernel: Process ssh-block-Allow (pid: 5081, ti=f4a0a000 task=f5bc1940 task.ti=f4a0a000)
kernel: Stack:
kernel: b780a000 f4a0bfac c01e8972 b780a000 0989ef70 00000000 f4a0a000 c0103bdf
kernel: <0> b780a000 00001000 b767bff4 0989ef70 00000000 bfc4c980 0000005b 0000007b
kernel: <0> 0000007b 00000000 00000033 0000005b ffffe424 00000073 00000206 bfc4c95c
kernel: Call Trace:
kernel: [] ? sys_munmap+0x42/0x60
kernel: [] ? sysenter_do_call+0x12/0x28
kernel: Code: 9d 7d f5 70 6a 1e c0 b0 6a 1e c0 00 a0 83 b7 00 00 00 c0 00 00 00 00 00 a0 80 b7 00 20 7a f6 01 00 00 00 01 00 00 00 2b 00 00 00 <01> 00 ff ff 00 00 00 00 3c e4 81 f4 3c e4 81 f4 28 28 00 00 48
kernel: EIP: [] 0xf481e434 SS:ESP 0068:f4a0bf94
kernel: CR2: 00000000fffffff4
kernel: ---[ end trace 6625c8823095f49f ]---

ssh-block-Allow is a python script I'm running (I'd forgotten what it was yesterday) which blocks ip addresses adding then to iptables via shorewall which engage in systematic attempts to login vvia ssh