(last built on: 2025-07-02)
Giovanni Ciatto and Matteo Magnini
Dipartimento di Informatica — Scienza e Ingegneria (DISI), Sede di Cesena,
Alma Mater Studiorum—Università di Bologna
Mini-school @ WOA 2025, the 26th Workshop From Objects to Agents
2nd July 2025, Trento, Italy
Quick overview on symbolic vs. sub-symbolic AI
Two broad categories of AI approaches:
Local $\approx$ “symbolic”: each symbol has a clear, distinct meaning
"bear"
is a symbol denoting a crisp category (either the animal is a bear or not)Distributed $\approx$ “non-symbolic”: symbols do not have a clear meaning per se, but the whole representation does
"swim"
is fuzzy capability: one animal may be (un)able to swim to some extentLet’s say we need to represent $N$ classes, how many columns would the tables have?
According to Tim van Gelder in 1990:
Symbolic representations of knowledge
- involve a set of symbols
- which can be combined (e.g., concatenated) in (possibly) infinitely many ways,
- following precise syntactical rules,
- where both elementary symbols and any admissible combination of them can be assigned with meaning
There exist approaches where symbols are combined with numbers, e.g.:
These approaches are not purely symbolic, but they are not purely numeric either, so we call the overall category “sub-symbolic”
grab(X)
: grabs block X
from the tableput(X)
: puts block X
on the tablestack(X, Y)
: stacks block X
on top of block Y
unstack(X, Y)
: un-stacks block X
from block Y
Structured representations: knowledge (I/O data) is represented in a structured, formal way (e.g., logic formulas, ontologies)
Algorithmic manipulation of representations: each approach relies on algorithms that manipulate these structured representations following exact rules
Crisp semantics: the meaning of the representations is well-defined, and the algorithms produce exact results
Model-driven: algorithms may commonly work in zero- or few-shot settings, humans must commonly model and encode knowledge in the target structure
Clear computational complexity: the decidability, complexity, and tractability of the algorithms are well understood
Machine learning: supervised, unsupervised, and reinforcement learning
Probabilistic reasoning: Bayesian networks, Markov models, probabilistic logic programming
Data separation vs. curve fitting:
Focus on the target feature:
Numeric representations: knowledge (I/O data) is represented in a less structured way, often as vectors/matrices/tensors of numbers
Differentiable manipulation of representations: algorithms rely on mathematical operations involving these numeric representations, most-commonly undergoing some optimization process
Fuzzy/continuous semantics: representations are from continuous spaces, where similarities and distances are defined in a continuous way, and algorithms may yield fuzzy results
Data-driven + Usage vs. training: algorithms are often trained on data, to be later re-used on other data
Unclear computational complexity: strong reliance on greedy or time-limited optimization methods, lack of theoretical guarantees on the quality of the results
Symbolic Neuro-Symbolic
: symbols $\rightarrow$ vectors $\rightarrow$ NNs $\rightarrow$ vectors $\rightarrow$ symbolsSymbolic[Neuro]
: symbolic module $\xrightarrow{invokes}$ NN $\rightarrow $ outputNeuro | Symbolic
: NN $\xrightarrow{cooperates}$ symbolic module $\xrightarrow{cooperates}$ NN $\rightarrow$ …Neuro-Symbolic → Neuro
: symbolic knowledge $\xrightarrow{influences}$ NNNeuroSymbolic
: symbolic knowledge $\xrightarrow{constrains}$ NNNeuro[Symbolic]
: symbolic module $\xrightarrow{embedded in}$ NN(cf. Ciatto et al., 2024)
Symbolic Knowledge Extraction (SKE): extracting symbolic knowledge from sub-symbolic models
Symbolic Knowledge Injection (SKI): injecting symbolic knowledge into sub-symbolic models
Both require some basic understanding of how supervised machine learning works
Remark: because of the “no free lunch” theorem, no single algorithm is the best in the general case
- i.e., each learning problem may be better addressed by a different learning algorithm
Single neuron
(Feed-forward)
Neural network $\equiv$ cascade of layers
Remark: NN are universal approximators, provided that they have enough neurons
NNs (and other ML algorithms) explore the hypothesis space by using a gradient descent algorithm
In this case:
Depending on its start, training may yield different results:
Solution is coss-validation:
(Ordinal encoding)
(One-hot encoding)
Non-linearly-separable data …
$(x, y)$
… may be separable in another space
$(\rho = \sqrt{x^2 + y^2}$, $\theta = \arctan(y/x))$
How to extract symbolic knowledge from sub-symbolic predictors
any algorithmic procedure accepting trained sub-symbolic predictors as input and producing symbolic knowledge as output, so that the extracted knowledge reflects the behaviour of the predictor with high fidelity.
Explainable AI (XAI): SKE methods are often used to provide explanations for the decisions made by sub-symbolic predictors, making them more interpretable and understandable to humans (a.k.a. post-hoc explainability)
Knowledge discovery: SKE methods can help discover patterns and relationships in the data that may not be immediately apparent, thus providing insights into the underlying processes
Model compression: SKE methods can simplify complex sub-symbolic models by extracting symbolic rules that approximate their behaviour, thus reducing the model’s size and complexity
They are not synonyms in spite of the fact that they are often used interchangeably!
elicits relevant aspects of objects (to ease their interpretation)
it is an operation that transform poorly interpretable objects into more interpretable ones
search of a surrogate interpretable model
binds objects with meaning (what the human mind does)
it is subjective
it does not need to be measurable, only comparisons
Main entities and how to extract symbolic knowledge from sub-symbolic predictors
Logic Rule |
---|
Class = setosa ← PetalWidth ≤ 1.0 |
Class = versicolor ← PetalLength > 4.9 ∧ SepalWidth ∈ [2.9, 3.2] |
Class = versicolor ← PetalWidth > 1.6 |
Class = virginica ← SepalWidth ≤ 2.9 |
Class = virginica ← SepalLength ∈ [5.4, 6.3] |
Class = virginica ← PetalWidth ∈ [1.0, 1.6] |
if the method needs to inspect (even partially) the internal parameters of the underlying black-box predictor, e.g., neuron biases or connection weights for NNs, or support vectors for SVMs
if the algorithm does not need to take into account any internal parameter, but it can extract symbolic knowledge by only relying on the predictor’s outputs.
SKE methods: theory and practice
Classification and regression trees (cf. Breiman et al., 1984)
An example decision tree estimating the probability of kyphosis after spinal surgery, given the age of the patient and the vertebra at which surgery was start ed (rf. wiki:dt-learning). Notice that all decision trees subtend a partition of the input space, and that those trees themselves provide intelligible representations of how predictions are attained.
generate a synthetic dataset by using the predictions of the sub-symbolic predictor
train a decision tree on the synthetic dataset
compute the fidelity and repeat step 2 until satisfied
[optional] rewrite the tree as a set of symbolic rules
The Adult dataset (cf. Becker Barry and Kohavi Ronny, 1996)
The Adult dataset contains the records (48,842) of individuals based on census data (this dataset is also known as Census Income). The dataset has many features (14) related to the individuals’ demographics, such as age, education, and occupation. The target feature is whether the individual earns more than $50,000 per year.
age | workclass | education | … | hours-per-week | native-country | income |
---|---|---|---|---|---|---|
39 | State-gov | Bachelors | … | 40 | United-States | <=50K |
50 | Self-emp-not-inc | Bachelors | … | 13 | United-States | <=50K |
38 | Private | HS-grad | … | 40 | United-States | <=50K |
53 | Private | 11th | … | 40 | United-States | <=50K |
28 | Private | Bachelors | … | 40 | Cuba | <=50K |
37 | Private | Masters | … | 40 | United-States | <=50K |
49 | Private | 9th | … | 16 | Jamaica | <=50K |
52 | Self-emp-not-inc | HS-grad | … | 45 | United-States | >50K |
31 | Private | Masters | … | 50 | United-States | >50K |
42 | Private | Bachelors | … | 40 | United-States | >50K |
Download the Adult dataset and preprocess it
Train a sub-symbolic predictor – a neural network – on the Adult dataset
Use a pedagogical SKE method – CART – to extract symbolic knowledge from the trained neural network
Visualise the extracted symbolic knowledge as a decision tree and as a set of rules
GitHub repository at github.com/MatteoMagnini/demo-2025-woa-nesy/blob/master/notebook/extraction.ipynb
classification
$f: 𝒳 ⊆ ℝⁿ → 𝒴 s.t. |𝒴| = k$
regression
$f: 𝒳 ⊆ ℝⁿ → 𝒴 ⊆ ℝᵐ$
binary
$𝒳 ≡ {0, 1}ⁿ$
discrete
$𝒳 ∈ {x₁, …, xₙ}ⁿ$
continuous
$𝒳 ⊆ ℝⁿ$
rule list, ordered sequences of if-then-else rules
decision tree, hierarchical set of if-then-else rules involving a comparison among a variable and a constant
decision table, 2D tables summarising decisions for each possible assignment of the input variables
propositional, boolean statements + logic connectives, including arithmetic comparisons among variables and constants
fuzzy, hierarchical set of if-then-else rules involving a comparison among a variable and a constant
oblique, boolean statements + logic connectives + arithmetic comparisons
M-of-N, any of the above + statements of the form “at least $k$ of the following statements are true”
discretisation of the input space
discretisation of the output space
features should have semantic meaning
rules constitutes global explanations
tabular data as input, no images
high dimensional datasets could lead to poorly readable rules
high variable input spaces could do the same
target images or highly dimensional data in general
target reinforcement learning (when based on NN)
target unsupervised learning
design and prototype your own extraction algorithm
How to inject symbolic knowledge into sub-symbolic predictors
Any algorithmic procedure affecting how sub-symbolic predictors draw their inferences in such a way that predictions are either computed as a function of, or made consistent with, some given symbolic knowledge.
Improve predictive performance: by injecting symbolic knowledge, we can
Enhance interpretability: with SKI we can make predictors that are
Robustness to data degradation: symbolic knowledge can help sub-symbolic models maintain performance even in the presence of noisy or scarcity of data
Enhance fairness: by incorporating symbolic knowledge about fairness constraints, we can ensure that sub-symbolic models make decisions that align with ethical considerations
And more: SKI can simplify the predictor’s architecture, in particular it can reduce the number of weights in a neural network, thus improving its efficiency and reducing the risk of overfitting
Main entities and how to inject symbolic knowledge into sub-symbolic predictors
Predictor: a sub-symbolic model that makes predictions based on input data, usually a neural network
Symbolic knowledge: structured, formal knowledge that can be represented in a symbolic form. The most common forms of symbolic knowledge are
Fuzzification: the process of converting symbolic knowledge into a form that can be used by sub-symbolic predictors, e.g. by assigning degrees of truth to symbolic statements
Injector: the main component that injects symbolic knowledge into the predictor, by modifying its architecture, its training process or by other means
SKI methods: theory and practice
(ref. Magnini et al., 2023)
Formula | C. interpretation | Formula | C. interpretation |
---|---|---|---|
$[[ \neg \phi ]]$ | $\eta(1 - [[ \phi ]])$ | $[[ \phi \le \psi ]]$ | $\eta(1 + [[ \psi ]] - [[ \phi ]])$ |
$[[ \phi \wedge \psi ]]$ | $\eta(\min([[ \phi ]], [[ \psi ]]))$ | $[[ class(\bar{X}, {y}_i) \leftarrow \psi ]]$ | $[[ \psi ]]^{*}$ |
$[[ \phi \vee \psi ]]$ | $\eta(\max([[ \phi ]], [[ \psi ]]))$ | $[[ \text{expr}(\bar{X}) ]]$ | $\text{expr}([[ \bar{X} ]])$ |
$[[ \phi = \psi ]]$ | $\eta([[ \neg( \phi \ne \psi ) ]])$ | $[[ \mathtt{true} ]]$ | $1$ |
$[[ \phi \ne \psi ]]$ | $\eta(| [[ \phi ]] - [[ \psi ]]|)$ | $[[ \mathtt{false} ]]$ | $0$ |
$[[ \phi > \psi ]]$ | $\eta(\max(0, \frac{1}{2} + [[ \phi ]] - [[ \psi ]]))$ | $[[ X ]]$ | $x$ |
$[[ \phi \ge \psi ]]$ | $\eta(1 + [[ \phi ]] - [[ \psi ]])$ | $[[ k ]]$ | $k$ |
$[[ \phi < \psi ]]$ | $\eta(\max(0, \frac{1}{2} + [[ \psi ]] - [[ \phi ]]))$ | $[[ p(\bar{X}) ]]^{**}$ | $[[ \psi_1 \vee \ldots \vee \psi_k ]]$ |
$^{*}$ encodes the value for the $i^{\text{th}}$ output $^{**}$ assuming $p$ is defined by $k$ clauses of the form: ${p}(\bar{X}) \leftarrow \psi_1,\ \ldots,\ {p}(\bar{X}) \leftarrow \psi_k$
(ref. Magnini et al., 2022)
Formula | C. interpretation | Formula | C. interpretation | |
---|---|---|---|---|
[[\neg \phi]] | $\eta(1 - [[\phi]])$ | [[\phi \le \psi]] | $\eta([[\phi]] - [[\psi]])$ | |
[[\phi \wedge \psi]] | $\eta(\max([[\phi]], [[\psi]]))$ | $[\mathrm{class}(\bar{X}, {y}_i) \leftarrow \psi]]$ | $[[\psi]]^{*}$ | |
[[\phi \vee \psi]] | $\eta(\min([[\phi]], [[\psi]]))$ | $[\text{expr}(\bar{X})]]$ | $\text{expr}([[\bar{X}]])$ | |
[[\phi = \psi]] | $\eta(\left\lvert [[\phi]] - [[\psi]] \right\rvert)$ | $[[\mathtt{true}]]$ | $0$ | |
[[\phi \ne \psi]] | $[[\neg(\phi = \psi)]]$ | $[[\mathtt{false}]]$ | $1$ | |
[[\phi > \psi]] | $\eta(\frac{1}{2} - [[\phi]] + [[\psi]])$ | $[[X]]$ | $x$ | |
[[\phi \ge \psi]] | $\eta([[\psi]] - [[\phi]])$ | $[[{k}]]$ | $k$ | |
[[\phi < \psi]] | $\eta(\frac{1}{2} + [[\phi]] - [[\psi]])$ | $[\mathrm{p}(\bar{X})]]^{**}$ | $[[\psi_1 \vee \ldots \vee \psi_k]]$ |
$^{*}$ encodes the penalty for the $i^{\text{th}}$ neuron $^{**}$ assuming predicate $p$ is defined by $k$ clauses of the form: ${p}(\bar{X}) \leftarrow \psi_1,\ \ldots,\ {p}(\bar{X}) \leftarrow \psi_k$
Cost function: whenever the neural network wrongly predicts a class and violates the prior knowledge a cost proportional to the violation is added.
In this way the output of the network differs more from the expected one and this affects the back propagation step.
The poker hand data set (PHDS) (cf. Cattral Robert and Oppacher Franz, 2002)
Each record represents one poker hand
5 cards identified by 2 values: suit and rank
Classes: 10
Training set: 25,010
Test set: 1,000,000
id | S1 | R1 | S2 | R2 | S3 | R3 | S4 | R4 | S5 | R5 | class |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 10 | 1 | 11 | 1 | 13 | 1 | 12 | 1 | 1 | 9 |
2 | 2 | 11 | 2 | 13 | 2 | 10 | 2 | 12 | 2 | 1 | 9 |
3 | 3 | 12 | 3 | 11 | 3 | 13 | 3 | 10 | 3 | 1 | 9 |
4 | 4 | 10 | 4 | 11 | 4 | 1 | 4 | 13 | 4 | 12 | 9 |
5 | 4 | 1 | 4 | 13 | 4 | 12 | 4 | 11 | 4 | 10 | 9 |
6 | 1 | 2 | 1 | 4 | 1 | 5 | 1 | 3 | 1 | 6 | 8 |
7 | 1 | 9 | 1 | 12 | 1 | 10 | 1 | 11 | 1 | 13 | 8 |
8 | 2 | 1 | 2 | 2 | 2 | 3 | 2 | 4 | 2 | 5 | 8 |
9 | 3 | 5 | 3 | 6 | 3 | 9 | 3 | 7 | 3 | 8 | 8 |
10 | 4 | 1 | 4 | 4 | 4 | 2 | 4 | 3 | 4 | 5 | 8 |
11 | 1 | 1 | 2 | 1 | 3 | 9 | 1 | 5 | 2 | 3 | 1 |
12 | 2 | 6 | 2 | 1 | 4 | 13 | 2 | 4 | 4 | 9 | 0 |
13 | 1 | 10 | 4 | 6 | 1 | 2 | 1 | 1 | 3 | 8 | 0 |
14 | 2 | 13 | 2 | 1 | 4 | 4 | 1 | 5 | 2 | 11 | 0 |
15 | 3 | 8 | 4 | 12 | 3 | 9 | 4 | 2 | 3 | 2 | 1 |
Class | Logic Formulation |
---|---|
Pair | class(R₁, ..., S₅, pair) ← pair(R₁, ..., S₅) pair(R₁, ..., S₅) ← R₁ = R₂ pair(R₁, ..., S₅) ← R₁ = R₃ pair(R₁, ..., S₅) ← R₁ = R₄ pair(R₁, ..., S₅) ← R₁ = R₅ pair(R₁, ..., S₅) ← R₂ = R₃ pair(R₁, ..., S₅) ← R₂ = R₄ pair(R₁, ..., S₅) ← R₂ = R₅ pair(R₁, ..., S₅) ← R₃ = R₄ pair(R₁, ..., S₅) ← R₃ = R₅ pair(R₁, ..., S₅) ← R₄ = R₅ |
Class | Logic Formulation |
---|---|
Two Pairs | class(R₁, ..., S₅, two) ← two(R₁, ..., S₅) two(R₁, ..., S₅) ← R₁ = R₂ ∧ R₃ = R₄ two(R₁, ..., S₅) ← R₁ = R₃ ∧ R₂ = R₄ two(R₁, ..., S₅) ← R₁ = R₄ ∧ R₂ = R₃ two(R₁, ..., S₅) ← R₁ = R₂ ∧ R₃ = R₅ two(R₁, ..., S₅) ← R₁ = R₃ ∧ R₃ = R₅ two(R₁, ..., S₅) ← R₁ = R₅ ∧ R₂ = R₃ two(R₁, ..., S₅) ← R₁ = R₂ ∧ R₄ = R₅ two(R₁, ..., S₅) ← R₁ = R₄ ∧ R₂ = R₅ two(R₁, ..., S₅) ← R₁ = R₅ ∧ R₂ = R₄ two(R₁, ..., S₅) ← R₁ = R₃ ∧ R₄ = R₅ two(R₁, ..., S₅) ← R₁ = R₄ ∧ R₃ = R₅ two(R₁, ..., S₅) ← R₁ = R₅ ∧ R₃ = R₄ two(R₁, ..., S₅) ← R₂ = R₃ ∧ R₄ = R₅ two(R₁, ..., S₅) ← R₂ = R₄ ∧ R₃ = R₅ two(R₁, ..., S₅) ← R₂ = R₅ ∧ R₃ = R₄ |
Class | Logic Formulation |
---|---|
Three of a Kind | class(R₁, ..., S₅, three) ← three(R₁, ..., S₅) three(R₁, ..., S₅) ← R₁ = R₂ ∧ R₁ = R₃ three(R₁, ..., S₅) ← R₁ = R₂ ∧ R₁ = R₄ three(R₁, ..., S₅) ← R₁ = R₂ ∧ R₁ = R₅ three(R₁, ..., S₅) ← R₁ = R₃ ∧ R₁ = R₄ three(R₁, ..., S₅) ← R₁ = R₃ ∧ R₁ = R₅ three(R₁, ..., S₅) ← R₁ = R₄ ∧ R₁ = R₅ three(R₁, ..., S₅) ← R₂ = R₃ ∧ R₂ = R₄ three(R₁, ..., S₅) ← R₂ = R₃ ∧ R₂ = R₅ three(R₁, ..., S₅) ← R₂ = R₄ ∧ R₂ = R₅ three(R₁, ..., S₅) ← R₃ = R₄ ∧ R₃ = R₅ |
Flush | class(R₁, ..., S₅, flush) ← flush(R₁, ..., S₅) flush(R₁, ..., S₅) ← S₁ = S₂ ∧ S₁ = S₃ ∧ S₁ = S₄ ∧ S₁ = S₅ |
Download the PHDS dataset and preprocess it
Define the symbolic knowledge to inject
Train a sub-symbolic predictor – a neural network – on the PHDS dataset
Train a second neural network with the symbolic knowledge injected
We will inject the knowledge in the loss function
Visualise and compare the results of the two predictors
GitHub repository at github.com/MatteoMagnini/demo-2025-woa-nesy/blob/master/notebook/injection.ipynb
input knowledge: how is the knowledge to-be-injected represented?
target predictor: which predictors can knowledge be injected into?
strategy: how does injection actually work?
purpose: why is knowledge injected in the first place?
Knowledge should express relations about input-output pairs
embedding implies extensional representation of knowledge
guided learning and structuring support intensional knowledge
propositional knowledge implies binarising the I/O space
Recursive data structures are natively not supported
extensional representation cost storage
guided learning works poorly with lacking data
foundational
address recursion
practical
find a language that is a good balance between expressiveness and ease of use
target
apply to large language models
The talk is over, I hope you enjoyed it