[nSLUG] Hard drive or power supply?

Joel Maxuel j.maxuel at gmail.com
Wed Dec 17 20:40:25 AST 2014


Update.  Doing the final backup before hardware replacement.  New parts
have arrived.  Bit a pinch in the wallet though.  At least this stuff is
meant to last...

I can hear the hard drive start and stop now, and this is what the tail of
my /var/log/messages looks like:

Dec 17 20:33:15 cybaryme kernel: [686118.816044] ata3: lost interrupt
(Status 0x50)
Dec 17 20:33:15 cybaryme kernel: [686118.816068] ata3.01: limiting speed to
UDMA/33:PIO4
Dec 17 20:33:15 cybaryme kernel: [686118.816116] ata3: soft resetting link
Dec 17 20:33:15 cybaryme kernel: [686119.004324] ata3.01: configured for
UDMA/33
Dec 17 20:33:15 cybaryme kernel: [686119.004334] ata3: EH complete

Since I have already replaced the SATA cable, it looks like a good thing
I'm pulling the PSU as well.

Now lets hope I can burn the new Debian Install CD (as I don't have one for
amd64).  Think I will try my hand at Jessie.  As for partitioning, I am
going to try for ext4 for the SSD, and LVM2 for the new HDD.



--
Cheers,
Joel Maxuel

"One should strive to achieve, not sit in bitter regret."
 - Ronan Harris / Mark Jackson

On Mon, Dec 15, 2014 at 10:12 AM, George N. White III <gnwiii at gmail.com>
wrote:

> On Sun, Dec 14, 2014 at 11:17 AM, Joel Maxuel <j.maxuel at gmail.com> wrote:
>
>> Another hardware question, be it having to do with a hard drive!
>>
>> Last week I was browsing on my PC, I opened a page, and my browser locked
>> up, I had a heck of a time to kill iceweasel, to restart it, as it turned
>> zombie.  I ended up logging out and logging back in.
>>
>> A day later I was still having troubles, so I threw into console mode,
>> and I found (as I have a mining script running there) a bunch of DRDY
>> statuses!
>>
>> Rebooted my PC, 0.5TB HDD wouldn't come up from the dead.  A few reboots
>> later, a shutdown, replaced the SATA cable and tried a different power
>> connector off the PSU, I was back in business, no more freezes or DRDY
>> statuses.  I suspected it was a bad connector on my power supply.
>>
>> That was a week ago.  Today, fed up with iceweasel complaining that adobe
>> was out of date I did an aptitude update && aptitude upgrade.  Halfway
>> through the second portion I got another freeze.  My full transcript of the
>> errors are below.
>>
>> I don't like the smartctl stats, but could the pre-fail values be due to
>> the fact that the PSU was messing up for a day and a half (at least) before
>> anything was done about it?  Which part is to blame at this point, or would
>> it be both the PSU and HDD?
>>
>> Running backups more often in the meantime.
>>
>> My tail of kern.log:
>>
>> Dec 14 10:40:52 cybaryme kernel: [391375.808041] ata3: lost interrupt
>> (Status 0x50)
>> Dec 14 10:40:52 cybaryme kernel: [391375.808062] ata3.01: exception Emask
>> 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>> Dec 14 10:40:52 cybaryme kernel: [391375.808129] ata3.01: failed command:
>> WRITE DMA
>> Dec 14 10:40:52 cybaryme kernel: [391375.808192] ata3.01: cmd
>> ca/00:08:48:c9:ad/00:00:00:00:00/f0 tag 0 dma 4096 out
>> Dec 14 10:40:52 cybaryme kernel: [391375.808194]          res
>> 40/00:01:09:4f:c2/00:00:00:00:00/10 Emask 0x4 (timeout)
>> Dec 14 10:40:52 cybaryme kernel: [391375.808329] ata3.01: status: { DRDY }
>> Dec 14 10:40:52 cybaryme kernel: [391375.808385] ata3: soft resetting link
>> Dec 14 10:40:52 cybaryme kernel: [391375.996347] ata3.01: configured for
>> UDMA/133
>> Dec 14 10:40:52 cybaryme kernel: [391375.996367] ata3: EH complete
>>
>> SmartCTL:
>>
>> joel at cybaryme:~$ sudo smartctl --all /dev/sda
>> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
>> Copyright (C) 2002-11 by Bruce Allen,
>> http://smartmontools.sourceforge.net
>>
>> === START OF INFORMATION SECTION ===
>> Model Family:     Seagate Barracuda 7200.12
>> Device Model:     ST3500418AS
>> Serial Number:    9VMTNDK4
>> LU WWN Device Id: 5 000c50 02d3772ce
>> Firmware Version: CC46
>> User Capacity:    500,107,862,016 bytes [500 GB]
>> Sector Size:      512 bytes logical/physical
>> Device is:        In smartctl database [for details use: -P show]
>> ATA Version is:   8
>> ATA Standard is:  ATA-8-ACS revision 4
>> Local Time is:    Sun Dec 14 10:54:27 2014 AST
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>>
>
> SMART catches maybe 50% of drive faiures, so this is only
> mildly encouraging.
>
> Whenever a drive has problems you should run the long self-test
> "sudo smartctl -t long /dev/sda"
>
>
>
>
>> General SMART Values:
>> Offline data collection status:  (0x82)    Offline data collection
>> activity
>>                     was completed without error.
>>                     Auto Offline Data Collection: Enabled.
>> Self-test execution status:      (   0)    The previous self-test routine
>> completed
>>                     without error or no self-test has ever
>>                     been run.
>> Total time to complete Offline
>> data collection:         (  600) seconds.
>> Offline data collection
>> capabilities:              (0x7b) SMART execute Offline immediate.
>>                     Auto Offline data collection on/off support.
>>                     Suspend Offline collection upon new
>>                     command.
>>                     Offline surface scan supported.
>>                     Self-test supported.
>>                     Conveyance Self-test supported.
>>                     Selective Self-test supported.
>> SMART capabilities:            (0x0003)    Saves SMART data before
>> entering
>>                     power-saving mode.
>>                     Supports SMART auto save timer.
>> Error logging capability:        (0x01)    Error logging supported.
>>                     General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time:      (   1) minutes.
>> Extended self-test routine
>> recommended polling time:      (  85) minutes.
>> Conveyance self-test routine
>> recommended polling time:      (   2) minutes.
>> SCT capabilities:            (0x103f)    SCT Status supported.
>>                     SCT Error Recovery Control supported.
>>                     SCT Feature Control supported.
>>                     SCT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 10
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
>> UPDATED  WHEN_FAILED RAW_VALUE
>>   1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail
>> Always       -       229176154
>>   3 Spin_Up_Time            0x0003   099   097   000    Pre-fail
>> Always       -       0
>>   4 Start_Stop_Count        0x0032   099   099   020    Old_age
>> Always       -       1150
>>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
>> Always       -       0
>>   7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail
>> Always       -       139281993
>>   9 Power_On_Hours          0x0032   069   069   000    Old_age
>> Always       -       27401
>>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
>> Always       -       0
>>  12 Power_Cycle_Count       0x0032   100   100   020    Old_age
>> Always       -       575
>> 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age
>> Always       -       0
>> 184 End-to-End_Error        0x0032   100   100   099    Old_age
>> Always       -       0
>> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age
>> Always       -       0
>> 188 Command_Timeout         0x0032   100   096   000    Old_age
>> Always       -       363
>> 189 High_Fly_Writes         0x003a   100   100   000    Old_age
>> Always       -       0
>> 190 Airflow_Temperature_Cel 0x0022   074   059   045    Old_age
>> Always       -       26 (Min/Max 26/27)
>> 194 Temperature_Celsius     0x0022   026   041   000    Old_age
>> Always       -       26 (0 11 0 0)
>> 195 Hardware_ECC_Recovered  0x001a   029   014   000    Old_age
>> Always       -       229176154
>> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age
>> Always       -       0
>> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
>> Offline      -       0
>> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
>> Always       -       0
>> 240 Head_Flying_Hours       0x0000   100   253   000    Old_age
>> Offline      -       228397770895185
>> 241 Total_LBAs_Written      0x0000   100   253   000    Old_age
>> Offline      -       1399011614
>> 242 Total_LBAs_Read         0x0000   100   253   000    Old_age
>> Offline      -       3373996330
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
>>
>>
>> SMART Selective self-test log data structure revision number 1
>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>     1        0        0  Not_testing
>>     2        0        0  Not_testing
>>     3        0        0  Not_testing
>>     4        0        0  Not_testing
>>     5        0        0  Not_testing
>> Selective self-test flags (0x0):
>>   After scanning selected spans, do NOT read-scan remainder of disk.
>> If Selective self-test is pending on power-up, resume after 0 minute
>> delay.
>>
>
> When a drive is failing you often get error reports here.   If possible
> you should
> run the vendor's diagnostics (these often come as a bootable CD with
> FreeDOS).
>
>
> _______________________________________________
> nSLUG mailing list
> nSLUG at nslug.ns.ca
> http://nslug.ns.ca/mailman/listinfo/nslug
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://nslug.ns.ca/pipermail/nslug/attachments/20141217/035ecbcb/attachment.html>


More information about the nSLUG mailing list