Knowledge sources for RE

Literature Review: Reliability Engineering — Current Themes and Advances

--- PLEASE HELP COMPILING/EDITING/FORMATTING/STYLING THIS DOC

Please use the suggested sources or add new sources *must have open license and relate to reliability engineerint o knowledge/natural language/AI/LLM

Towards a Science of AI Agent Reliability 2026 https://arxiv.org/p df/2602.16666

Consistency (repeatability), Robustness (stability under perturbation), Predictability (calibrated failure), and Safety (bounded error severity).

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents 2026 https://arxiv.org/p df/2603.29231

Reliability Decay (performance vs. task duration), Variance Amplification (stochastic failure scaling), and Graceful Degradation.

LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance 2025 https://arxiv.org/p df/2510.11905 Knowledge Reliability (robustness of internal truth representations) and Generalizability.

Evaluating large language models: a systematic review...

2025 https://www.fronti ersin.org/journals/c omputer- science/articles/10. 3389/fcomp.2025.1 523699/full

Auditability (governance, model, and application layers), Ethical Standards, and Temporal Accuracy.

FUTURE-AI: trustworthy and deployable AI in healthcare

2025 https://www.bmj.co m/content/388/bm j-2024-081554

Fairness, Universality, Traceability, Usability,

Robustness, and Explainability.

Human–AI Complementarity in Peer Review

2026 https://www.mdpi. com/2304- 6775/14/1/1

Semantic Convergence (consistency of themes) and Syntactic Complexity.

Synthesized Main Categories 1. Technical Reliability Dimensions  Consistency & Reproducibility: Ability to produce repeatable stable outcomes.  Robustness & Generalization: Stability under prompt perturbations or distribution shifts.  Predictability: The AI's ability to self-calibrate and signal likely failure points.  Knowledge Stability (Anti-Brittleness): Ensuring truthfulness is based on reasoning rather than superficial resemblance. 2. Socio-Technical & Ethical Dimensions  Safety: Ensuring failures remain within bounded, non-severe limits.  Fairness & Equity: Absence of bias across demographic groups.  Explainability & Transparency: Capacity to provide understandable logic for decisions.  Accountability & Traceability: Audit logs and versioning for failure identification.

Agent-Specific Reliability Metrics (Rabanser et al., 2026) Focusing specifically on agentic systems, research now formally decomposes reliability into four primary mathematical components: Consistency (R_Con): Measures repeatability of outcomes across identical scenarios. Robustness (R_Rob): Measures stability when faced with variations in prompts or environments. Predictability (R_Pred): Assesses how well a system's confidence aligns with its actual success rate. Safety (R_Saf): Quantifies the risk and severity of harmful failure modes.

Reliability engineering has evolved considerably since its origins in the 1950s aerospace and defence industries. At its core, the field addresses a deceptively persistent problem: quantifying and managing uncertainty about whether a system will perform its intended function under specified conditions over a given time. A large-scale bibliometric analysis covering over 30,000 reliability papers found that **the fundamental challenge of uncertainty in system representation has remained essentially unchanged** for seven decades, even as the mathematical tools available to address it have grown dramatically (Brown, 2024). What has shifted is the scale, complexity, and data richness of the systems being studied — and consequently, the methods deployed to analyse them.

This review synthesises recent open-access literature across five interrelated themes: probabilistic modelling and distributions; system reliability function and configuration; availability and maintainability; statistical inference and parameter estimation; and the emerging frontier of data-driven and intelligent reliability methods.

---

2. Probabilistic Foundations: Distributions and Failure Modelling

The mathematical backbone of reliability engineering rests on probability distributions that describe time-to-failure behaviour. The **Weibull distribution** remains the workhorse of the field due to its flexibility across the full failure-rate bathtub curve: decreasing hazard (infant mortality), constant hazard (useful life), and increasing hazard (wear-out). The **log-normal** and **normal distributions** are widely used for fatigue and degradation phenomena, while the **exponential distribution** underpins the memoryless failure models central to Markov chain analysis.

Structural reliability analysis, particularly in civil and mechanical engineering, relies heavily on **load-strength interference** models. Here the probability of failure is computed as the probability that the applied load exceeds the structural capacity — a problem requiring joint distributional assumptions. Modern work increasingly couples these with **digital twin frameworks**, where a Bayesian dynamic model is continuously updated from sensor observations to reduce parameter uncertainty over the system's operational life (dynamic Bayesian network approach, validated on fatigue crack growth in metallic structures; cf. structural digital twin literature, 2023–2024).

The **Fisher-Tippett theorem** and extreme value theory underpin reliability assessments in safety-critical domains such as flood defence, wind loading, and offshore structures, where the distribution of the maximum (or minimum) load over a long period is the key quantity of interest.

---

3. System Reliability: Configurations and Complexity

A fundamental task in system reliability engineering is computing the **system reliability function** — the probability that the system functions — given knowledge of component-level reliabilities and the system architecture.

3.1 Series and Parallel Configurations

Series systems, where every component must function for the system to succeed, exhibit reliability that is strictly lower than any individual component. Parallel systems (active redundancy) tolerate component failure, raising system reliability at the cost of added components. Real engineering systems are overwhelmingly **combinations of series and parallel sub-systems**, sometimes requiring decomposition methods such as the **conditional probability method** or the **delta-star transformation** to evaluate architectures that do not reduce to simple series-parallel trees.

The **reliability block diagram (RBD)** is the canonical graphical tool for encoding system structure. Its modern descendant, the **universal generating function (UGF)**, extends analysis to multi-state systems — components that can operate at partial capacity — enabling more realistic modelling of degraded operation rather than binary function/failure.

3.2 Redundancy Strategies

Redundancy is the primary engineering lever for enhancing system reliability beyond what individual component quality can achieve. The literature distinguishes:

- **Active (hot) redundancy**: all redundant units operate simultaneously and share the load. - **Standby (cold) redundancy**: backup units remain dormant until activated by a switching mechanism. - **Shared-load parallel configurations**: load redistributes among surviving units, altering each survivor's failure rate.

The **redundancy allocation problem (RAP)** — choosing the type and number of redundant components per subsystem to maximise reliability subject to cost, weight, or volume constraints — has attracted sustained optimisation research. A comprehensive survey of the RAP history (Tandfonline, 2024) documents progression from early mathematical programming formulations through metaheuristic approaches (genetic algorithms, particle swarm optimisation) to recent multi-objective and multi-state extensions. Open-access work demonstrates improved particle swarm optimisation for classic series, series-parallel, and bridge configurations (Marouani, 2021), while more recent contributions handle time-dependent failure rates using Erlang distributions and genetic algorithms (MDPI Mathematics, 2023).

3.3 Complex System Analysis

For systems whose architecture cannot be expressed as series-parallel combinations, methods such as the **conditional probability decomposition**, **binary decision diagrams**, and **Monte Carlo simulation** are standard. The **system-of-systems (SoS)** paradigm introduces a further layer of complexity: constituent systems are themselves independently managed, and reliability properties emerge from network-level interactions, propagation of disturbances, and dynamic reconfiguration. A Springer Nature review (2025) identifies modelling of SoS reliability, network reliability within SoS, disturbance propagation, and SoS reliability management as the key open challenges in this domain.

---

4. Availability and Maintainability

Reliability — the probability of failure-free operation over a fixed mission duration — is distinct from **availability**, which captures the proportion of time a repairable system is in an operational state. The relationship between the two depends on the **maintainability function**, which describes the distribution of repair times.

Three availability metrics are commonly distinguished:

- **Intrinsic availability**: considers only active repair time (excludes logistics and administrative delays). - **Mission availability**: the probability of being operational throughout a specified mission. - **Steady-state availability**: the long-run fraction of time spent operational.

- Operational readiness** and **system effectiveness** are composite measures combining availability, dependability, and capability — critical for defence and aerospace procurement.

4.1 Markov Chain Methods

When components transition among multiple states (e.g., fully operational, degraded, failed) at exponentially distributed rates, **Markov chains** provide an exact framework for computing steady-state and transient availability. For systems with complex maintenance policies and age-dependent failure rates, the memoryless property of the exponential distribution becomes restrictive. Recent work addresses this through **semi-Markov processes** and **phase-type distributions**, which allow non-exponential dwell times while preserving the tractability of the Markov framework.

4.2 Integrating Reliability and Operations Management

A 2023 open-access review (Jin, Frontiers of Engineering Management / PMC) argues that reliability-redundancy allocation, preventive maintenance scheduling, and spare parts logistics are too often optimised in isolation — producing local optima. The paper advocates an integrated **product-service system** perspective, where the manufacturer retains responsibility for system availability throughout its operational life, creating incentives to co-optimise all three decision domains jointly.

---

5. Statistical Inference and Parameter Estimation

Translating reliability theory into engineering practice requires estimating distribution parameters from failure data — often sparse, expensive, and subject to censoring.

5.1 Maximum Likelihood Estimation and Confidence Intervals

- Maximum likelihood estimation (MLE)** is the dominant approach, providing consistent and asymptotically efficient parameter estimates for parametric families (Weibull, log-normal, exponential). **Confidence intervals** for reliability metrics — critical for engineering decisions and regulatory compliance — are typically derived from the Fisher information matrix or via bootstrap resampling. The **chi-squared distribution** plays a central role in constructing exact confidence bounds for exponential lifetime data.

- Failure censoring** (Type I and Type II) is ubiquitous: test programmes are stopped after a fixed time or number of failures, and field returns are reported only up to the observation date. MLE methods are well-adapted to censored data; Bayesian approaches offer additional flexibility through informative prior distributions.

5.2 Non-Parametric Methods

When distributional assumptions are uncertain, **non-parametric reliability analysis** — using the Kaplan-Meier estimator and Nelson-Aalen cumulative hazard estimator — provides model-free survival estimates. **Q-Q plots** serve as graphical diagnostics for distributional fit, plotting empirical quantiles against theoretical quantiles. The **statistical tolerance** and **tolerance interval** framework, addressing the proportion of a population covered by a statistical bound, is distinct from (but complementary to) point estimation and is relevant to design-for-reliability specifications.

5.3 Accelerated Life Testing

- Accelerated life testing (ALT)** applies elevated stress (temperature, voltage, vibration) to induce failures rapidly, then extrapolates to normal operating conditions using a physically motivated model (Arrhenius, inverse power law). The **Fisher-Tippett theorem** underpins extreme-value extrapolation in this context. Reliability **acceptance sampling plans** operationalise these ideas for production testing: specifying sample sizes and decision rules that control producer and consumer risks at stated confidence levels.

---

6. Data-Driven and Intelligent Reliability Methods

The most dynamic frontier in contemporary reliability engineering is the integration of machine learning and sensor data into the classical probabilistic framework.

6.1 Prognostics and Health Management

- Prognostics and Health Management (PHM)** aims to diagnose current system health and predict **remaining useful life (RUL)** in real time. A comprehensive 2024 review (Su & Lee, University of Maryland / International Journal of PHM, open access) surveyed ML approaches demonstrated in PHM Data Challenge Competitions from 2018 to 2023. Key findings include: deep learning models — particularly convolutional neural networks (CNN), recurrent architectures (LSTM), and transformer-based models — increasingly dominate RUL prediction; ensemble methods (XGBoost, LightGBM) remain competitive for classification tasks; and feature engineering retains high importance even in deep learning pipelines.

6.2 Physics-Informed and Hybrid Models

A tension exists between pure data-driven models and the physics-based models that reliability engineers have traditionally used. The emerging consensus favours **hybrid approaches**: fusing physics-based degradation models (crack growth, fatigue accumulation) with ML to improve extrapolation beyond the training distribution and enforce physical consistency. Such models are particularly important in safety-critical domains where training data are scarce.

6.3 Digital Twins

The **digital twin (DT)** — a continuously updated virtual replica of a physical system — has become a unifying concept connecting sensor data, physics models, and decision support. In structural applications, a DT assimilates real-time strain or vibration data using Bayesian inference to update a probabilistic structural model, which then informs optimal maintenance decisions. Dynamic Bayesian networks (DBN) provide a natural probabilistic graphical model for this assimilation loop. A 2025 PMC open-access review documents the rapid growth of DT research in civil infrastructure, covering roads, bridges, railways, and dams, with particular emphasis on remaining service life prediction and maintenance optimisation.

6.4 Industry 4.0 and IoT-Enabled Monitoring

The broader Industry 4.0 context — IoT sensors, cloud computing, edge processing — has transformed the data availability landscape for reliability engineers. Real-time condition monitoring, automated fault detection, and self-diagnosis capabilities that were previously confined to high-value aerospace assets are becoming economically viable across manufacturing and infrastructure sectors. An arXiv review (2024) identifies sensor-based monitoring supported by ML as the most impactful innovation in reliability engineering over the past decade, alongside advances in redundant and fault-tolerant system architectures.

---

7. Open Challenges and Future Directions

Several challenges recur across the reviewed literature:

1. **Uncertainty quantification at scale**: as systems grow in complexity, propagating and communicating uncertainty through multi-level models remains computationally and conceptually demanding. 2. **Data scarcity and class imbalance**: failures are rare by design; ML models trained on imbalanced datasets require careful handling (oversampling, cost-sensitive learning, transfer learning from related systems). 3. **System-of-systems reliability**: existing theory handles well-defined, static architectures poorly when constituent systems are independently governed and dynamically reconfigurable. 4. **Integration across the product lifecycle**: the gap between reliability-in-design (allocation, block diagram analysis) and reliability-in-operation (PHM, maintenance optimisation) remains wider than it should be; the integrated product-service system paradigm is a promising bridge. 5. **Open access to data and methods**: as Brown (2024) notes, the fragmented, publisher-controlled nature of the academic literature itself hinders meta-analysis and cumulative knowledge-building in the field.

---

8. Conclusion

Reliability engineering in the 2020s stands at an inflection point. The classical probabilistic toolkit — Weibull analysis, Markov chains, redundancy allocation, MLE — retains its central importance as the language through which system dependability is defined and communicated. But the field is being rapidly augmented by data-driven methods, digital twins, and physics-informed ML models that leverage the unprecedented sensor data now available from operational systems. The greatest opportunity lies in connecting these traditions: using classical structural knowledge to regularise data-hungry models, and using real operational data to calibrate and update probabilistic models that would otherwise rest on sparse laboratory testing.

---

Key Open-Access Sources

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ OPEN-ACCESS SOURCES & URLs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[1] Bazargan-Harandi, H. — Reliability Engineering: Theory and Practice

   BCcampus Open Textbook, 2023
   https://opentextbc.ca/oerdiscipline/wp-content/uploads/sites/213/2018/09/Reliability-Engineering.pdf

[2] Jiang et al. — Forty Years of QREI: Trends and Patterns 1985–2024

   Quality & Reliability Engineering International (Wiley), 2024
   https://onlinelibrary.wiley.com/doi/full/10.1002/qre.70114

[3] Jin, T. — Bridging Reliability and Operations Management for Superior System Availability

   Frontiers of Engineering Management / PMC, 2023
   https://pmc.ncbi.nlm.nih.gov/articles/PMC9990580/

[4] Su, H. & Lee, J. — ML Approaches for Diagnostics and Prognostics of Industrial Systems

   arXiv 2312.16810 (Int. Journal of PHM), 2024
   https://arxiv.org/abs/2312.16810

[5] Marouani, H. — Optimization for the Redundancy Allocation Problem Using Improved PSO

   Journal of Optimization (Wiley OA), 2021
   https://onlinelibrary.wiley.com/doi/10.1155/2021/6385713

[6] Zio, E. & Gholinezhad, H. — Redundancy Allocation with Time-Dependent Failure Rates

   MDPI Mathematics 11(16), 2023
   https://www.mdpi.com/2227-7390/11/16/3534

[7] Si, S., Zhao, J., Cai, Z., Dui, H. — System Reliability Optimisation Driven by Importance Measures

   Frontiers of Engineering Management, 2020
   https://link.springer.com/article/10.1007/s42524-020-0112-6

[8] Sun, Z. et al. (RMIT) — Digital Twin for Structural Health Monitoring of Civil Infrastructure

   Sensors / PMC, 2025
   https://pmc.ncbi.nlm.nih.gov/articles/PMC11723349/

[9] Tordeux, A. et al. — System Reliability Engineering in the Age of Industry 4.0

   arXiv 2411.08913, 2024
   https://arxiv.org/abs/2411.08913

[10] Li, Y.-F. — Review of the Redundancy Allocation Problem to Optimise System Reliability

    Engineering Optimisation (Tandfonline), 2024
    https://doi.org/10.1080/0305215X.2024.2447078

==================================

SUMMARY: = ~136 total terms 9 categories