Exploring the Backblaze Hard Drive Data

1 Kudo

Abstract

This paper explores the various steps we have undertaken to understand hard drive performance data from a large data set. The data set is provided by BackBlaze, a cloud storage company. The data captures the daily performance of approximately 50,000 hard drives, over a period of nearly two years. The data consists of one record per day for each hard drive in service. Each record includes a time stamp, several demographics, a failure indicator, and eighty Self-Monitoring, Analysis and Reporting Technology (SMART) indicators. The first challenge involved is the sheer size of the data table. The second challenge is to evaluate and ensure that the data quality is adequate prior to analysis. The third challenge is to organize the data into different shapes that can be utilized by existing analytical and graphical tools in JMP. Furthermore, and of upmost importance, analysis will be performed with the appropriate tools in order to understand what the data reveals. We attempt to find answers to two key questions: which hard drive is the best, and whether or not hard drive failure is predictable. This work will showcase JMP’s far reaching capabilities, including data manipulation, exploration, and reliability platforms.

Bio

Peng Liu, PhD, is a Principal Research Statistician Developer for JMP. He is responsible for maintaining and developing reliability, survival and time series platforms. His work includes researches, data analysis, and software architecting.

Leo Wright, CQE, is Principal Product Manager of Reliability and Quality for JMP. He has extensive experience in manufacturing, focused on quality and reliability engineering. Prior to joining SAS, he managed the quality organizations for several Fortune 500 organizations.

Discovery Summit Europe 2016 Resources

Discovery Summit 2016 is over, but it's not too late to participate in the conversation!