[nSLUG] Hard drive or power supply?

George N. White III gnwiii at gmail.com
Mon Dec 15 10:12:00 AST 2014


On Sun, Dec 14, 2014 at 11:17 AM, Joel Maxuel <j.maxuel at gmail.com> wrote:

> Another hardware question, be it having to do with a hard drive!
>
> Last week I was browsing on my PC, I opened a page, and my browser locked
> up, I had a heck of a time to kill iceweasel, to restart it, as it turned
> zombie.  I ended up logging out and logging back in.
>
> A day later I was still having troubles, so I threw into console mode, and
> I found (as I have a mining script running there) a bunch of DRDY statuses!
>
> Rebooted my PC, 0.5TB HDD wouldn't come up from the dead.  A few reboots
> later, a shutdown, replaced the SATA cable and tried a different power
> connector off the PSU, I was back in business, no more freezes or DRDY
> statuses.  I suspected it was a bad connector on my power supply.
>
> That was a week ago.  Today, fed up with iceweasel complaining that adobe
> was out of date I did an aptitude update && aptitude upgrade.  Halfway
> through the second portion I got another freeze.  My full transcript of the
> errors are below.
>
> I don't like the smartctl stats, but could the pre-fail values be due to
> the fact that the PSU was messing up for a day and a half (at least) before
> anything was done about it?  Which part is to blame at this point, or would
> it be both the PSU and HDD?
>
> Running backups more often in the meantime.
>
> My tail of kern.log:
>
> Dec 14 10:40:52 cybaryme kernel: [391375.808041] ata3: lost interrupt
> (Status 0x50)
> Dec 14 10:40:52 cybaryme kernel: [391375.808062] ata3.01: exception Emask
> 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> Dec 14 10:40:52 cybaryme kernel: [391375.808129] ata3.01: failed command:
> WRITE DMA
> Dec 14 10:40:52 cybaryme kernel: [391375.808192] ata3.01: cmd
> ca/00:08:48:c9:ad/00:00:00:00:00/f0 tag 0 dma 4096 out
> Dec 14 10:40:52 cybaryme kernel: [391375.808194]          res
> 40/00:01:09:4f:c2/00:00:00:00:00/10 Emask 0x4 (timeout)
> Dec 14 10:40:52 cybaryme kernel: [391375.808329] ata3.01: status: { DRDY }
> Dec 14 10:40:52 cybaryme kernel: [391375.808385] ata3: soft resetting link
> Dec 14 10:40:52 cybaryme kernel: [391375.996347] ata3.01: configured for
> UDMA/133
> Dec 14 10:40:52 cybaryme kernel: [391375.996367] ata3: EH complete
>
> SmartCTL:
>
> joel at cybaryme:~$ sudo smartctl --all /dev/sda
> smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family:     Seagate Barracuda 7200.12
> Device Model:     ST3500418AS
> Serial Number:    9VMTNDK4
> LU WWN Device Id: 5 000c50 02d3772ce
> Firmware Version: CC46
> User Capacity:    500,107,862,016 bytes [500 GB]
> Sector Size:      512 bytes logical/physical
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  ATA-8-ACS revision 4
> Local Time is:    Sun Dec 14 10:54:27 2014 AST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>

SMART catches maybe 50% of drive faiures, so this is only
mildly encouraging.

Whenever a drive has problems you should run the long self-test
"sudo smartctl -t long /dev/sda"




> General SMART Values:
> Offline data collection status:  (0x82)    Offline data collection activity
>                     was completed without error.
>                     Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0)    The previous self-test routine
> completed
>                     without error or no self-test has ever
>                     been run.
> Total time to complete Offline
> data collection:         (  600) seconds.
> Offline data collection
> capabilities:              (0x7b) SMART execute Offline immediate.
>                     Auto Offline data collection on/off support.
>                     Suspend Offline collection upon new
>                     command.
>                     Offline surface scan supported.
>                     Self-test supported.
>                     Conveyance Self-test supported.
>                     Selective Self-test supported.
> SMART capabilities:            (0x0003)    Saves SMART data before entering
>                     power-saving mode.
>                     Supports SMART auto save timer.
> Error logging capability:        (0x01)    Error logging supported.
>                     General Purpose Logging supported.
> Short self-test routine
> recommended polling time:      (   1) minutes.
> Extended self-test routine
> recommended polling time:      (  85) minutes.
> Conveyance self-test routine
> recommended polling time:      (   2) minutes.
> SCT capabilities:            (0x103f)    SCT Status supported.
>                     SCT Error Recovery Control supported.
>                     SCT Feature Control supported.
>                     SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail
> Always       -       229176154
>   3 Spin_Up_Time            0x0003   099   097   000    Pre-fail
> Always       -       0
>   4 Start_Stop_Count        0x0032   099   099   020    Old_age
> Always       -       1150
>   5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
> Always       -       0
>   7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail
> Always       -       139281993
>   9 Power_On_Hours          0x0032   069   069   000    Old_age
> Always       -       27401
>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
> Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   020    Old_age
> Always       -       575
> 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age
> Always       -       0
> 184 End-to-End_Error        0x0032   100   100   099    Old_age
> Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age
> Always       -       0
> 188 Command_Timeout         0x0032   100   096   000    Old_age
> Always       -       363
> 189 High_Fly_Writes         0x003a   100   100   000    Old_age
> Always       -       0
> 190 Airflow_Temperature_Cel 0x0022   074   059   045    Old_age
> Always       -       26 (Min/Max 26/27)
> 194 Temperature_Celsius     0x0022   026   041   000    Old_age
> Always       -       26 (0 11 0 0)
> 195 Hardware_ECC_Recovered  0x001a   029   014   000    Old_age
> Always       -       229176154
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age
> Always       -       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
> Always       -       0
> 240 Head_Flying_Hours       0x0000   100   253   000    Old_age
> Offline      -       228397770895185
> 241 Total_LBAs_Written      0x0000   100   253   000    Old_age
> Offline      -       1399011614
> 242 Total_LBAs_Read         0x0000   100   253   000    Old_age
> Offline      -       3373996330
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
>
>
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>

When a drive is failing you often get error reports here.   If possible you
should
run the vendor's diagnostics (these often come as a bootable CD with
FreeDOS).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://nslug.ns.ca/pipermail/nslug/attachments/20141215/59d3ed70/attachment-0001.html>


More information about the nSLUG mailing list