Infusing Knowledge into Data-Driven Modelling of Complex Systems for Improved Quality and Interpretability

Christel Sirocchi

Across various domains, specialised applications are deployed to collect and generate data that model or measure the behaviour of complex systems, whether natural, such as biological systems, or engineered, such as power grids. While numerous data analytics tools have been developed to uncover trends within the data, ranging from traditional pattern mining techniques to advanced Machine Learning (ML) architectures, challenges remain regarding interpretability, robustness, and adherence to domain-specific knowledge. Beyond collected data, nearly all sectors of human expertise have a reservoir of established knowledge, specific to the system being studied or general to the domain, which often remains underutilised in data-driven approaches. This thesis focuses on enhancing data processing and analysis by integrating available knowledge into data-driven methods. Specifically, two types are explored: relational knowledge, which captures relationships between entities and is typically represented using graphs, and declarative knowledge, which expresses facts about the domain and is formalised using logic formulæ. On the one hand, graph-based encoding enables the use of techniques from graph theory and network science; on the other, rule-based knowledge representation harnesses the formalism and reasoning capabilities of computational logic.

The first half of the thesis examines how relational knowledge can be applied to data processing tasks, targeting domains where the structural relationships between entities can predict system properties or are integral to understanding system behaviour but remain underutilised: distributed computing and biological systems. In distributed computing systems, the focus is on distributed averaging algorithms and the relationship between the topology of the communication network (structure) and the convergence rate of the algorithm operating on that network (function). Key contributions include modelling the effect of graph topology on algorithmic convergence across four graph families and confirming the predictive power of both global and local metrics. A novel distributed approach is introduced, enabling nodes to estimate convergence rates, with applications in dynamic topologies and outlier detection. Additionally, the impact of network modularity on performance is thoroughly explored, inspiring the design of a novel community-aware gossip scheme that outperforms traditional schemes. In biological systems, metabolomics emerges as a promising research area, as traditional pathway-based methods often fall short due to inaccuracies, incompleteness, or inconsistencies in annotations. Metabolic data analysis is augmented by leveraging the chemical structure of metabolites, extending techniques from drug discovery to metabolomics. Initially, ML models are trained on graph-based and vector-based encodings of metabolites, achieving satisfactory performance and demonstrating the predictive power of metabolite structures over metabolic outcomes. Then, feature importance analysis is conducted on the trained models to identify the most predictive chemical configurations, providing insights into affected metabolic pathways that corroborate previous findings while opening new avenues for exploration. Furthermore, the limitations of existing structural encodings are examined and addressed by proposing novel representations tailored to metabolomics that enhance both resolution and interpretability.

The second half of the thesis shifts to domains where knowledge is encoded as rules. The clinical domain emerges as an ideal field for knowledge-driven augmentation due to the abundance of rule-based protocols and the limitations of ML in terms of accuracy, interpretability, and robustness. First, available integration strategies from the literature are mapped onto the corresponding phases of the ML pipeline, offering structured guidelines for knowledge integration. Several of these methods are applied to a benchmark dataset and compared in terms of accuracy, data efficiency, interpretability, and adherence to domain knowledge. To address the lack of metrics for comparing fully data-driven models and hybrid models, new evaluation metrics are introduced, focusing on alignment with established knowledge. Evaluation using these metrics demonstrates that hybrid models offer more coherent predictions and explanations, making them better suited for clinical practice. Symbolic knowledge injection and extraction frameworks are also explored, showing potential for improving performance and explainability in clinical applications. A novel combination of the two in a feedback loop further enhances performance.

This thesis presents additional contributions around two main areas. First, the potential for integrating graph-based and rule-based approaches is explored in unsupervised learning settings, particularly in disease subtyping. In this application, rule-based learning enhances clustering performance by grouping patients into meaningful subgroups, while graph-based methods improve interpretability by constructing feature graphs that capture relationships between features, thus providing insights into relevant biomarkers. Second, positional information is leveraged to augment traditional data analysis, particularly in crowdsourcing applications where data often include geographic coordinates. Two case studies are presented: a crowdsensing application for road quality mapping and a crowdmapping initiative for computing education. In these contexts, integrative approaches are developed to achieve key objectives, including evaluating data quality and measuring user retention and satisfaction.

keywords Data Processing, Knowledge Integration, Complex Systems, Informed Machine Learning, Interpretability

thesis talk

Infusing Knowledge into Data-Driven Modelling of Complex Systems for Improved Quality and Interpretability (Final PhD Discussion, 20/02/2025) — Christel Sirocchi (Christel Sirocchi)

works as

reference thesis for

Infusing Knowledge into Data-Driven Modelling of Complex Systems for Improved Quality and Interpretability (Final PhD Discussion, 20/02/2025) — Christel Sirocchi (Christel Sirocchi)