Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
Which computer hard drives are most reliable? And how often should you replace a hard drive? Those are some of the questions I hope to answer in exploring and analyzing data about hard drive tests. A company called Backblaze, which offers online data backup services, generously makes this data available to the public via its website. According to the description of the Backblaze Hard Drive Test Data, the company needs to have adequate hard drives that are both reasonably reliable and economically feasible. Over the last two years, Backblaze collected daily data on hard drives that were in service. Company researchers have published some conclusions on the best drive and a replacement schedule for drives, but they also were curious about what else could be found in the data. So let's see if my analysis agrees, disagrees, and/or finds something else.
My first impression of the data is: It is big. How big? There are 631 daily files, each is about 4 MB or 9 MB. Newer files are bigger. The total size of all CSV files is around 3.5 GB. Each CSV file seems to have the same format, which is very good. They all look like this:
Instead of concatenating the files in SQLite, I do that in JMP. The resulting JMP table has more than 17 million rows. The columns are: “date”, when the row was recorded; “serial_number”, which is the hard drive ID; “capacity_bytes” which is the size of the hard drive; “failure” which indicates whether that hard drive failed on that day; 40 columns of raw SMART statistics about the hard drive on that day; and 40 columns of normalized SMART statistics.
And the resulting JMP table is more than 12 GB! Hmm, what just happened? The description of the data source mentioned that many SMART statistics in the data are missing; missing values in JMP table are double precision floating numbers, but occupy no spaces between two commas in a CSV file. I guess there are a lot of missing values in the CSV files. That explains why the binary data file ends up much larger than the total of all CSV files. We will come back to the missing value issue later.
Now I want to look at the life distribution of a randomly selected hard drive available in the Backblaze warehouse, regardless of manufacturer and model. (I am assuming the recorded hard drives form a good sample that represents all the hard drives that they have.)
From the resulting data, we can collect the number of days to failure or censoring, for every hard drive. In case you are not familiar with the terminology “censoring,” it means that the hard drive did not fail at the time when its last record was saved. We usually say that failure and censoring are events. We use “time-to-event” to refer time to failure or censoring. The calculation of time-to-event is carried out by computing the date range by serial number. Using the calculated time-to-event values, we can compute a nonparametric estimate of the failure distribution from all the hard drives that they have used.
The next screenshot shows the estimate. How do we interpret the plot? Each dot in the plot has two coordinates, the x-coordinate is the time in Days, and the y-coordinate is the expected probability that a randomly chosen hard drive will fail before that time. For example, a point around 300 days has a probability around 0.03. So if we have 100 randomly chosen hard drives start running at day one, we should expect 3 of them to fail by the time of 300 days.
As a statement, we say the failure probability (sometimes also called failure rate) at 300 days is 0.03. Notice that the failure probability (failure rate) here is different from the failure rate that Backblaze Hard Drive Test Data web page discusses; see their best hard drive blog. Their failure rate is related to, but does not seem exactly is, the recurrent rate in a renewal process in our terminology. I believe that I can use my failure rate to derive something similar to their rate, but not the other way around. I will look closer later.
If life were always easy, we could probably fit the data using a parametric distribution, e.g., Weibull, Lognormal, etc. If so, we should see the nonparametric estimate riding along a smooth curve without bumps. But we're not so lucky this time!
However, it is not surprising to see there are bumps or turning points in the estimate. According to the Wikipedia page on S.M.A.R.T., hard drive failures are of two types: predictable failures and unpredictable failures. It sounds like they are talking about failure modes: wear outs, and all others. We could assume that each failure mode can be modeled by a single distribution, e.g., Weibull, Lognormal. Then we can apply the famous bathtub failure rate.
This is not that easy, either. It is surprising to see at least two obvious turning points in the nonparametric estimate. One is around 150 days, and the other one around 550 days. If the later one divides the predictable and unpredictable ones, then does the first turning point tell us that we may classify the unpredictable failures into more than one mode?
Assuming we are considering the turning point around 550 days as the place where failures become more predictable, it is surprising that the failures are building up gradually faster before that point. Isn’t that counter intuitive? I would expect unexpected failures to slow down, according to what we should expect from the bathtub failure rate phenomenon.
Before we dive deeper, I want to take a look about the SMART statistics in the data. I chose to look at the statistic labelled as SMART 187, because it is highlighted in a Backblaze blog post, Hard Drive SMART Stats, as one of the most promising variables for deciding whether a hard drive needs replacement.
I’ve made a simple scatterplot of the SMART 187 raw values against the ages of the hard drives when the corresponding SMART 187 values were recorded. The left panel draws a scatterplot for the good hard drives that did not fail through their last records. The right panel draws a scatterplot for the failed hard drives. According to the explanations on Hard Drive SMART Stats, the differences in the two plots are expected, i.e., the SMART 187 values in the right hand panels appear to be larger than those in the left hand panel. That is consistent with the claim that the higher the SMART 187 value is, the more likely the hard drive will fail.
No hard-to-understand surprises so far. Good. But those are probably well-known facts. Now what? We can’t be afraid to get our hands dirty, I guess.