Please help transcribe this video using our simple transcription tool. You need to be logged in to do so.


With the growth of system size and complexity, reliability has become of paramount importance for petascale systems. Reliability, Availability, and Serviceability (RAS) logs have been commonly used for failure analysis. However, analysis based on just the RAS logs has proved to be insuf?cient in understanding failures and system behaviors. To overcome the limitation of this existing methodologies, we analyze the Blue Gene/P RAS logs and the Blue Gene/P job logs in a cooperative manner. From our co-analysis effort, we have identi?ed a dozen important observations about failure characteristics and job interruption characteristics on the Blue Gene/P systems. These observations can signi?cantly facilitate the research in fault resilience of large-scale systems.

Questions and Answers

You need to be logged in to be able to post here.