Co-Constructing Explanations for AI Systems using Provenance

Co-Constructing Explanations
for AI Systems using Provenance
Prof. Paul Groth | @pgroth | pgroth.com | indelab.org
Thanks to Jan-Christoph Kalo, Fina Polat, Shubha Guha, Enrico Daga
XAI-KG Workshop - ESWC 2025

Explanations increasingly required for AI systems

“While the explanation process involves the assimilation of knowledge,
it also transforms the learner’s knowledge and is thus an
accommodative process.
Thus, rather than referring to "explanations" (and assuming that the
property of being an explanation is a property of statements), it might
be prudent to refer to explaining, and regard explaining an active
exploration process. ”
03.04.23 3
Explanation by Exploration or Self Explanation
S.T. Mueller, R.R. Hoffman, W. Clancey, A. Emrey and G. Klein, Explanation in
Human-AI Systems: A Literature Meta-Review, Synopsis
of Key Ideas and Publications, and Bibliography for Explainable AI, arXiv, 2019.
https://arxiv.org/abs/ 25 1902.01876

“The process of explaining, and human-human communication
broadly, is a co-adaptive ‘tuning’ process, which requires that
the explainer and learner have a capacity to take each other's
perspective.”
03.04.23 4
A collaborative and co-adaptive process
S.T. Mueller, R.R. Hoffman, W. Clancey, A. Emrey and G. Klein, Explanation in
Human-AI Systems: A Literature Meta-Review, Synopsis
of Key Ideas and Publications, and Bibliography for Explainable AI, arXiv, 2019.
https://arxiv.org/abs/ 25 1902.01876

Trace-Based Explanation
Shruthi Chari, Daniel M Gruen, Oshani Seneviratne, and Deborah L McGuinness. 2020.
Directions for explainable knowledge-enabled systems. In Knowledge Graphs for eXplainable
Arti
fi
cial Intelligence: Foundations, Applications and Challenges. IOS Press, 245–261.

Provenance is all you need!
def load_data(…):
input1 = pd.read_csv(…)
input1 = input1[input1[‘attr’] > 10]
input2 = pd.read_csv(…)
return input1.join(input2, …)
def featurise(…):
return ColumnTransformer(
[(‘categorical’), …, …)
(‘numerical’), …, …)])
all_data = load_(data)
train = all_data[all_data[‘date’] < …]
test = all_data[all_data[‘date’] >= …]
y_train = label_binarize(train[‘label’])
y_test = label_binarize(test[‘label’])
Pipeline = Pipeline([
(‘features’, featurise(…)),
(‘learner’, KerasClassifier(…))])
model = pipeline.fit(train, y_train)
quality = model.score(test, y_test)
input1
σattr>10
input2
⋈id=fk_id
1-hot
πattr
σdate<… σdate>=…
scale
πval
concat
FitClassifier
Score Xtrain
ytrain
D1
D2
Xtest
ytest
id attr label date
fk_id val
πlabel
binarize 1-hot
πattr
scale
concat
πlabel
binarize
πval
0, 1, 0, …, 0.23
1, 0, 0, …, 0.11
[ ]
0, 0, 1, …, 0.46
0, 1, 0, …, 0.17
[ ]
{(1,1)}
…
{(1,...)}
{(2,1)}
...
{(2,...)}
0
1
[ ]
1
0
[ ] {(1,1), (2,3)}
{(1,5), (2,7)}
{(1,1), (2,3)}
{(1,5), (2,7)}
{(1,4), (2,7)}
{(1,3), (2,1)}
{(1,4), (2,7)}
{(1,3), (2,1)}
User-defined ML pipeline Extracted DAG representation Materialised artifacts with their provenance
1 2 3
Grafberger, et al. Data distribution debugging in ML pipelines. VLDBJ (2022).

Data Journeys
- Importance of the entire pipeline
- Data Journey: a multi- layered,
semantic representation of a data
processing activity, linked to the
digital assets involved (code,
components, data).
- Can we provide a compact
representation?
Enrico Daga and Paul Groth. "Data journeys: Explaining AI work
fl
ows
through abstraction." Semantic Web 15.4 (2024): 1057-1083.

DAG representation of a Random Forests Python
Notebook

03.04.23 12
Initial user survey

Exploration to co-construction
Katharina J Rohl
fi
ng et al. 2020. Explanation as a social practice: toward a
conceptual framework for the social design of ai systems. IEEE Transactions on
Cognitive and Developmental Systems, 13, 3, 717–728.

What would it look like for this system?

XAI tools and techniques are available
Sina Mohseni, Niloofar Zarei, and Eric D. Ragan. 2021. A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI Systems.
ACM Trans. Interact. Intell. Syst. 11, 3–4, Article 24 (December 2021), 45 pages. https://doi.org/10.1145/3387166

XAI - Evaluation of Explanations

▫ Metrics [1]:
▿ functionally-grounded - metrics that do require human feedback and measure
properties of the explanation (e.g. faithfulness - how accurately does the explanation
correspond to the thing being explained);
▿ human-grounded metrics - metrics that involve human participation either through
feedback, observation or proxy tasks (e.g. how interpretable is an explanation to an
end user);
▿ application-grounded - metrics that measure explanations through their usage in an
application (e.g. does the performance of the human-AI system improve on a
downstream task);
▫ Challenges:
▿ Interactivity
▿ Many personas
Evaluation of Explanations
[1] Gesina Schwalbe and Bettina Finzel. A comprehensive taxonomy for
explainable arti
fi
cial intelligence: a systematic survey of surveys on
methods and concepts. Data Mining and Knowledge Discovery,
38(5):3043–3101, 2024.

Virtual Personas
Suhong Moon, Marwa Abdulhai, Minwoo Kang, Joseph
Suh, Widyadewi Soedarmadji, Eran Kohen Behar, and
David Chan. 2024. Virtual Personas for Language Models
via an Anthology of Backstories. In Proceedings of the 2024
Conference on Empirical Methods in Natural Language
Processing, pages 19864–19897, Miami, Florida, USA.
Association for Computational Linguistics.

Virtual Personas - Critique
Lindia Tjuatja, Valerie Chen, Tongshuang Wu,
Ameet Talwalkwar, Graham Neubig; Do LLMs
Exhibit Human-like Response Biases? A Case
Study in Survey Design. Transactions of the
Association for Computational Linguistics 2024;
12 1011–1026. doi: https://doi.org/10.1162/
tacl_a_00685

LLMs as judges
Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. 2024.
Leveraging Large Language Models for NLG Evaluation: Advances and Challenges. In Proceedings of
the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16028–16045,
Miami, Florida, USA. Association for Computational Linguistics.

LLMs as judges
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang,
Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric
P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023.
Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In
Proceedings of the 37th International Conference on Neural
Information Processing Systems (NIPS '23). Curran Associates Inc.,
Red Hook, NY, USA, Article 2020, 46595–46623.

https://github.com/JanKalo/enexa_explanation/ 18.03.25 25

1. Clarity & Structure: Does the explanation flow logically? Is it easy to follow?
2. Depth & Completeness: Does the explanation o
ﬀ
er su
ﬃ
cient detail without
omitting crucial points?
3. Correctness & Fidelity: Are facts accurate, and does the explanation
remain faithful to the original query/context?
4. Relevance & Focus: Does the content stay on-topic and address user
queries directly?
5. Appropriateness for the Persona: Is the style/tone appropriate for the
user’s persona (e.g., an AI engineer, business strategist, etc.)?
6. Transparency: Does the explanation clarify its reasoning or highlight
uncertainties?
7. Engagement & Intuition: Is the conversation engaging, and does it address
the user’s interests intuitively?
18.03.25 26
Criteria

Steps Forward
• A persona database
• Validation of LLM-as-a-judge
• Multi-modal provenance explanations
• Leverage detailed provenance information with visual and interactive outputs
• Interactive AI agents
• Develop agents that can illustrate, gesture to diagrams, and walk users
through data
fl
ow across AI system components in real-time
• Enhanced provenance systems:
• Implement retrieval augmented generation over provenance stores
• Integrate LLMs directly into provenance collection and preparation
processes
• Appropriate abstraction levels for user interaction
• Interactive explanation environments

Co-Constructing Explanations for AI Systems using Provenance

from https://www.dagstuhl.de/25051

Complex Use Cases & Common Evaluation Pitfalls
- Law
- Clinical Trials
- Science
- Missing or Incorrect Ground Truth Data
- Data Leakage
- Confirmation Bias
- Deployment mismatch

Desiderata for Evaluation Approaches
1. be able to evaluate such multifaceted and complex outputs
2. no ground truth is available
3. run in a continuous manner
4. cope with changes in outputs
5. efficiently make use of human effort
6. readily applied to new problems, tasks and domains with a
minimal amount of effort
7. cope with variation in the tailored outputs

De
fi
ning performance in terms of explanation quality
AI System performance:
Given a set of tasks, and corresponding outputs and their explanations created by
an AI system. AI System performance is the aggregation of the quality of the
explanations.
Claim:
By evaluating through explanations we cover the desiderata
on the prior slide

Design and Development
objective, task, and
context
Explanation
Dimension Selection
AI System
Development
Deployment
AI System Execution execution results
AI system
Explanation
Generation
execution traces
Evaluation
explanation
Explanation
Assessment
dimensions and
metrics
evaluation result

Conclusion
• AI systems build explanations together with users. It’s a process.
• We need better ways to evaluate such processes
• LLMs as proxy personas and LLMs as judges may allow for extensive
and reproducible evaluations of explanations.
• Explanation as evaluation for AI Systems
Paul Groth | @pgroth | pgroth.com | indelab.org

Co-Constructing Explanations for AI Systems using Provenance

Recommended

More Related Content

Similar to Co-Constructing Explanations for AI Systems using Provenance (20)

More from Paul Groth (20)

Recently uploaded (20)

Co-Constructing Explanations for AI Systems using Provenance