Google Whitepaper on Disk Failures
December 14, 2010
http://labs.google.com/papers/disk_failures.pdf
Well after reading the Google study, I have to question the containment of the drives or the way temperature was measured. I have to say that I am 100% convinced that temperature does indeed affect hard drives. The question at this point is how and when.
I have had hard drives in with obvious heat damage, arms and heads deformed due to heat. I have chips that are burnt and physical damage to the platters caused by heat. I know temperature does greatly affect recovery as well. I think this requires more review and that there may be something wrong with the way the temperature is collected. It appears they were using SMART to collect that data. What if there is something so wrong with SMART that it is bad data? That is indicated by the fact they knew in their report that some data reported by the devices was false, but then they still use SMART to gather that data? I would question that!
On the other items, I certainly know that SMART is worthless, and I am not even sure the items it is tracking have correct data. In addition my understanding is that occasionally the SMART data is cleared just because there is only so much space allocated in the SA area for it, that it has to clear it to store more data. So how can it be accurate. It is just a complete waste.
I am also certain there are things missing. For instance, every drive has errors all the time, and if ECC can compensate for it, the drive will not mark it bad. Did they export G-Lists and compare them over time? I am not sure how they did this but would exporting the bad block tables and comparing them over time not give more precise results rather than the reallocation flag in the sectors? This data again looks like it came from SMART. I want the data in the tables that is not being reported if SMART is lying.
Ultimately I guess the point is prediction of failure using SMART. And yes, on that they are correct, SMART fails and possibly provides false data, and maybe on purpose from the manufacture. We already know the manufactures lie, why not report data wrong too? Manufactures do not want you to return a drive every two months because SMART reported it, and certainly not until the warrantee runs out.