如果一味強調安全,而缺乏一個與之匹配的可靠性流程,這相當于為災難性錯誤打開了大門。
2018 年 3 月 18 日,世界上首起自動駕駛汽車致行人死亡事故在美國亞利桑那州坦佩市發生。該事件引起了巨大轟動,全球范圍內有關本次事故的文章達到了近一萬篇,其中大多數均探討了本次事故對優步(Uber)、自動駕駛汽車、公共道路自動駕駛汽車測試及更廣泛社會的影響。
然而,沒有多少文章真正探討了自動駕駛汽車的傳感器、軟件和平臺技術可以從這一悲慘事件中吸取哪些教訓。事實上,自動駕駛汽車要想真正實現經濟可行性,就必須從事故中吸取經驗教訓。
無論是為了從坦佩事故中吸取教訓,還是真正理解 ISO 26262(道路車輛功能安全標準)的價值,我們其實面臨著一個共同的基本挑戰:清楚地認識“可靠性”和“安全性”之間的互補和矛盾之處。這并不單純指字面意義:每位經理都明白,在任何一個軟件和硬件設計周期中,流程、權力和責任的劃分至關重要:誰做什么工作?向誰報告?何時進行?這些問題的處理方式不同都會導致截然不同的結果。
可靠性是什么?安全性又是什么?這兩者在企業環境中又應保持何種關系?從可靠性工程師的視角來看,安全性不過是可靠性的一部分。為什么?因為可靠性團隊關注的是故障發生的概率,而安全性團隊則關注故障發生且導致災難性后果(損失、受傷或死亡)的概率。
對于可靠性團隊而言,預防并處理這些災難性事件的概率,僅是他們工作中的一小部分而已。因此,在一個以可靠性為核心的環境中,安全工程師直接接受可靠性團隊的管理,且在完整可靠性設計(DfR)流程走完前,不會采取行動。
可靠性和安全性的相互作用
顯而易見,安全工程師并不認同這一觀點。從他們的視角來看,可靠性分析只能提供特定失效機制(可靠性物理學)或部件(經驗學)失效的概率??煽啃苑治霾粫婕肮收习l生的具體后果——這會是災難性的嗎?因此,可靠性分析只有深入到系統最下層時,才往往是最有效的。只有這時,分析人員才更能了解系統或用戶對故障的反應,從而分析每個故障可能引發的后果嚴重性。因此,可靠性工程師應當接受安全團隊的管理。
可靠性工程師的主要職責是計算故障率和基本故障模式。如果有時這些失敗率不過只是數字而已,那么可靠性工程師有什么存在的必要呢?
此外,第三種觀點是,可靠性和安全性之間的聯系并沒有人們想象的那么緊密。我們可以用這兩個學科分別“如何解決風扇性能”的問題更好地陳述這兩者之間的差別??煽啃怨こ處煏扇】煽啃晕锢矸治觯≧PA)、降速或加速壽命試驗(ALT)等措施,確保將風扇在預期環境中的故障率降至目標水平之下。對比之下,安全性工程師則會首先判斷風扇故障是否會引發災難性事件(及這將給系統其他部分帶來哪些影響),然后采用“漂移”(drift)增加冗余或調整關鍵參數(如電流消耗、轉速表、噪音)等方式,降低事故的嚴重程度。
這些不同觀點恰好反映了科技公司在“如何處理可靠性和安全性之間關系”方面的猶豫。在一家正在向自動駕駛汽車轉型的大型消費者技術公司中,可靠性和安全性團隊匯報給同一位總監。另一家自動駕駛領導者公司則將安全性和可靠性團隊完全分開,不過這兩個部門主管的職位大致類似。我們了解的第三家公司,則是汽車電子領域中一家大力投入自主控制單元研發的中流砥柱。這家公司也將安全性和可靠性團隊完全分開,但安全團隊主管的職位明顯更高,相較而言可靠性團隊中職位最高的員工不過是經理或組長,這也反映了這家公司在這兩支團隊中的“偏重”。
如果無法清晰理解可靠性和安全性之間的相互作用和相互依賴,汽車行業可能會出現一些本可避免的沖突和誤解,進而將顧客置于本不必要的風險之中,或導致自動駕駛系統的成本過高,甚至兩者兼而有之。如果對可靠性過分缺乏信心,或者公司安全性團隊的權力過大,自動駕駛汽車制造商往往會在整個車輛系統中引入大量冗余(包括傳感、控制、動力、制動等)。據估算,一輛普通汽車的電子元器件成本超過 12000 美元,這些設計并不一定可以讓車內人員或整個交通環境更加安全,但卻一定會顯著增加成本。
事實上,我們還可以用另一個很好的例子探討安全性和可靠性之間的差異:那就是如何計算失敗率。從 20 世紀 50 年代到 90 年代,在一些電子硬件公司中,大多數可靠性團隊都是憑經驗來估算故障率。這些手冊只是現場故障數據的簡單匯總,按零件類型(電阻器、電容器、二極管等等)進行區分。盡管概念簡單、使用方便,但多項研究均表明這些手冊在實際產品的應用上非常不準確,整體估算結果偏向保守,也往往因此導致預測的故障率過高。
原因很簡單——這些手冊的分析并不是基于導致失敗真正發生的實際原因。進入 21 世紀之后,大多數有經驗的可靠性領域專業人員也不再僅僅依靠經驗數據來預測失敗率。故障手冊等過時的方法開始被可靠性物理分析(RPA)和加速壽命測試(ALT)等手段取代,這種趨勢在汽車行業中最為明顯。直到 ISO 26262 問世。
避免脫節
作為一項功能安全標準,ISO 26262 將根據“用一定方式計算出的故障率”以及“系統所采取的緩解措施”,預測評估車輛的安全完整性等級(SIL)。與可靠性工程師不同,安全性工程師強烈鼓勵,甚至直接要求將經驗手冊作為 SIL 計算的基礎。這種脫節的原因很明顯——安全性和可靠性分屬兩個獨立團隊,也匯報給不同的管理層,雙方缺乏最基本的溝通,溝通完全脫節,以至安全工程師仍在使用過時的方法來計算故障率。
如果兩個團隊之間不能進行合理的平衡,安全性團隊往往傾向于給出更高的失敗率,并因此要求采取更多的安全分析和安全威脅緩解措施,包括增加冗余等。此外,安全性團隊過分專注于經驗手冊,也會導致他們忽略一些關鍵故障模式,使得安全威脅緩解機制不再有效。
不過,一切仍有改進的機會。無論主營半導體元件、電子模塊還是完整的系統,所有自動駕駛技術價值鏈上的公司都必須認識到,如果一味強調安全,而缺乏一個與之匹配的可靠性流程,這相當于為災難性錯誤打開了大門。
為了避免這種情況,我們第一步可以做的就是打破可靠性和安全性團隊的物理障礙,將這兩支團隊放在同一支領導團隊之下。雙方應同意共同實施最佳做法,包括使用最先進的模擬、建模及可靠性物理學等,為適當且有效的風險識別和緩解奠定基礎。
An overemphasis on safety without a robust and equivalent reliability process and organization will result in errors that could be catastrophic.
On March 18, 2018, the first pedestrian fatality due to the operation of an autonomous vehicle occurred in Tempe, Arizona. Since then, almost 10,000 articles have been published on this accident, with most of them espousing an opinion on what it all means for future of Uber, autonomous vehicles, public-roads AV testing, and even the larger society.
What is missing from this cauldron of debate is the lessons learned that designers of autonomous sensor, software and platform technologies can extract from this tragic event. Learning from it will be pivotal to the financial success of autonomous vehicles.
A fundamental challenge in learning from the Tempe fatality and in determining the value of ISO 26262 (the functional safety standard for road vehicles) is in identifying the complimentary and contradictory roles of reliability and safety. This is not a matter of semantics: Every manager realizes that process, authority, and responsibility are the core of every software and hardware design cycle. Who does what, who reports to whom and when they do it can it result in dramatically different outcomes.
What is reliability, what is safety, and how should they relate to each other in a corporate environment? From the perspective of reliability engineers, safety is a subset of reliability. Why? While reliability focuses on the probability that a failure will occur, safety assumes the probability that a failure will occur and result in a catastrophic (loss, injury, or death) event.
Catastrophic events are just a small portion of the overall outlook being managed and tracked by the reliability team. Thus, in a reliability-centric world, safety engineers are managed by the reliability team and do not act until a thorough design-for-reliability (DfR) activity is complete.
Reliability and Safety interact
As one would expect, safety engineers do not share the same vision. From their viewpoint, reliability analyses only provide probability of failure for a particular failure mechanism (reliability physics) or part (empirical approach). Reliability analyses have no context as to the consequence of failure—will it be catastrophic? Such analyses are therefore most effective when performed at the lowest level of the system. Because consequences are only clear at the system-level, where the response of the system or the user to the failure can be considered, reliability engineers should report into the safety team.
The key function of reliability engineers is to calculate failure rate and basic failure modes. And since, sometimes, these failure rates are only numbers, why have a reliability engineer at all?
A third viewpoint is that reliability and safety are not as related as one would expect. A prime example of this philosophy is how the two disciplines would address fan performance. From a reliability perspective, the actions might be to ensure the fan meets failure rate goals for the expected environment, either through reliability physics analysis (RPA), derating, or accelerated life testing (ALT).From a safety perspective, the actions might be to determine if fan failure would induce a catastrophic event (how it interacts with the rest of the system) and then introduce potential mitigations, such as redundancy or prognostics using drift or change in key parameters (current draw, tachometer, noise).
These different viewpoints highlight the uncertainty among technology companies on how to handle reliability and safety. One major consumer technology company that is transitioning to autonomous vehicles has Reliability and Safety reporting into the same Director. A second company, a leader in the autonomous field, has Safety and Reliability reporting into two different organizations, even though the leaders in both departments have roughly equivalent titles. A third company, a mainstay in automotive electronics that is aggressively targeting autonomous control units, also has Safety and Reliability in two different organizations, but clearly has a favorite through the numerous executive titles assigned to Safety (while the highest reliability staffer is either Manager or Leader).
Without a clear and consistent construct in how reliability and safety interact and build upon each other, the automotive industry is creating avoidable conflict and potential miscommunication that will either put customers under unnecessary risk, create autonomous systems that are excessively expensive, or both. One autonomous vehicle manufacturer had such uncertain confidence in reliability, or such unlimited authority of the safety team, that it introduced redundancy throughout the vehicle (including sensing, control, power, braking, etc.). Given that the average car has, by some estimates, over $12,000 of electronics, this intro-duces significant costs without necessarily making the occupants, or the traffic around them, that much safer.
A perfect example of this issue is the divergence between safety and reliability in how to calculate failure rates. From the 1950s through the 1990s, most reliability practitioners in electronic hardware organizations used empirical handbooks to calculate failure rates. These handbooks were simply aggregations of field failure data, sorted by part technology (resistor, capacitor, diode, etc.). While simple in concept and execution, repeated studies demonstrated that these handbooks were wildly inaccurate when used on actual product, with the error leaning towards the conservative—over-predicting failure rate.
The reason was straightforward - these handbooks were not based on the actual mechanisms that cause failure. Fast forward to the 21st century and most skilled reliability practitioners no longer rely exclusively on empirical field data to predict failure rates. Reliability physics analysis (RPA) and accelerated life testing (ALT) replaced these outmoded approaches and nowhere was this truer than in the automotive industry. Until ISO 26262 came along.
Avoiding the disconnect
As a functional safety standard, ISO 26262 requires the computation of failure rates and the appropriate mitigations to predict the safety integrity level (SIL).And the safety community, unlike the reliability engineers, strongly encourage or even require empirical prediction handbooks to be the basis of SIL calculations. This disconnect is driven by the lack of a universal construct between reliability and safety. Creating separate organizations reporting into separate management has led to a breakdown in communication, causing safety engineers to use outmoded approaches for failure rate calculations.
In addition, without a balance between the two groups, safety teams will tend to prefer higher failure rates, which requires additional safety analyses and safety mitigations including redundancy. Safety’s focus on simple handbook calculations will also result in overlooking critical failure modes, such that safety mitigations are no longer effective.
There is still an opportunity for improvement. Players in autonomous technology, from semiconductors to electronic modules to overall systems, must realize that an overemphasis on safety without a robust and equivalent reliability process and organization will result in errors that will be difficult to untangle.
A good first step is to make sure that reliability and safety are within the same organization, reporting to a neutral observer. Both sides should agree to implement best practices, including use of state-of-the-art simulation and modeling and reliability physics to lay the ground work on appropriate and effective risk identification and mitigation.
Author: Craig Hillman
Source: SAE Automotive Vehicle Engineering Magazine