Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Please note that all times are shown in the time zone of the conference. The current conference time is: 14th Aug 2025, 03:52:12am BST

 
 
Session Overview
Session
PSG 1 - e-Government_B
Time:
Thursday, 28/Aug/2025:
8:30am - 10:30am

Session Chair: Dr. Shirley KEMPENEER, Tilburg University

"Large language models and chatbots in the public sector"

 


Show help for 'Increase or decrease the abstract text size'
Presentations

Evaluating Public Servants’ Perceptions and Regulatory Implications of LLM-Based Chatbots in Street-Level Organizations

Raimund LEHLE

University of Applied Sciences – Public Administration and Finance, Ludwigsburg, Germany

The integration of generative artificial intelligence (genAI) into public administration is a contemporary focus, driven by the rapid advancements in AI technology and the increasing demand for more efficient and responsive public service. This study examines the application and perceptions of large language model (LLM)-based chatbots among public servants in an exemplary street-level organization with its double binding obligation to also protect citizens from potential algorithmic harm ​(Kuziemski und Misuraca 2020)​. We conducted an experimental study to explore the use cases, as well as perceived chances and risks surrounding this technology from the perspective of these key public administrators, and supplement this study with expert interviews on the gathered responses.

This study employs categories derived from public value theory ​(Andersen et al. 2012)​ to explore the sentiments expressed regarding AI-enhanced service provision. This theoretical framework is complemented by the employment of principles of responsible AI ​(Papagiannidis et al. 2025)​. These additional perspectives complement public value theory with explicit ethical implications of AI integration in public services. The present study involved a sample of public servants from a municipal-level organization. To gather in-depth insights into the regulatory and ethical dimensions of AI deployment, expert interviews were conducted with legal scholars and municipal AI strategists. The collection of data was executed as preceding survey, experiment and semi-structured interviews, and was subsequently analyzed through the employment of a combination of qualitative and quantitative methodologies.

Preliminary findings and literature indicate a mixed response; while many public servants recognize the potential efficiency gains of LLM-based chatbots, aligning with ​Cantens​ ​(2024)​, concerns persist regarding data privacy, decision-making transparency, and potential biases inherent in AI systems. We identify regulatory gaps and incoherences concerning AI deployment in public sector contexts, highlighting the need for coherent guidelines to govern the integration and operation of AI tools in public service and enable case workers to proficiently use these tools.

This work contributes to the discourse on the practical implications of AI in public administration, providing practical insights into the use cases and implications for public value generation. Preliminary findings suggest that while AI has the potential to significantly enhance service delivery, it is crucial to address the identified risks through coherent regulatory frameworks and targeted training programs. We recommend the development of specific policy adjustments, such as the establishment of AI ethics boards and the implementation of regular audits to ensure compliance with ethical standards ​(Desouza et al. 2020)​ as well as coherent digital procedural legislation. Additionally, training initiatives for public servants should focus on enhancing their understanding of AI technologies and their ethical implications, thereby fostering a more informed and responsible approach to AI integration ​(Ahn und Chen 2022)​, ensuring sensible AI deployment in public services.



Towards a Benchmark for LLM-Based Agents in Public-Sector Institutions

Jonathan Rystrøm1, Chris Schmitz2, Karolina Korgul1, Jan Batzner3,4

1Oxford Internet Institute, University of Oxford, UK; 2Centre for Digital Governance, Hertie School, Germany; 3Weizenbaum Institute, Germany; 4Technical University of Munich, Germany

Large Language Model (LLM)-based agents offer significant potential for public sector organizations by streamlining processes, improving processing speed, and increasing consistency and transparency when implemented effectively (Straub et al., 2024). However, current work examining their impact on the public sector is insufficient to guide both research on their impacts and practical implementation. Empirical analyses of AI adoption lag behind the technological frontier and focus too narrowly on the small group of early-adopting institutions, with insufficient attention to what is technologically possible. Theoretical approaches lack grounding in actual technological capabilities. Neither adequately addresses the "jagged frontier" of progress—what is theoretically automatable today versus what is not. We posit that this knowledge gap severely inhibits meaningful analysis and forward-looking policy formulation – particularly regarding “downstream” effects of agent integration, such as organizational change and the shifting role of human bureaucrats.

We argue here that benchmarking, the systematic evaluation of LLM-based agents against sets of tasks (Wang et al., 2024), is a promising avenue of research to remediate these problems. We derive essential criteria for effective public sector agent benchmarking from theories of public management and automation. First, benchmarks must be based on authentic public sector work, and reflect the wide variety of subject knowledge, media formats, and administrative freedom this work may entail (Zacka, 2022). Second, they should reflect processes with several interdependent subtasks that feed into each other, as proposed in leading models of automation (Acemoglu and Restrepo, 2018). These tasks should require interaction with complex systems and precise interpretation of regulations. Third, benchmarks must allow for meaningful translation of technical performance metrics into human-compatible metrics (Thomas & Uminsky, 2022). Evaluation must extend beyond simple performance metrics to include robustness to environmental changes, cost-effectiveness compared to human baselines, and fairness assessments to identify potential biases.

Using these criteria, we evaluate 874 existing agent benchmarks through LLM-assisted distant reading. We employ a systematic approach where LLMs analyze whether each benchmark's title and abstract satisfy our specified criteria, providing written justifications followed by binary valid/invalid determinations. Our comprehensive review reveals significant gaps: a complete lack of realistic public sector-relevant processes, no conceptualization of fairness metrics, very limited measurement beyond simple performance, and almost no translation to human-relevant metrics. These findings highlight the need for benchmarks that enable more direct comparisons with human performance, better assessment of automation potential, and guide AI development toward solutions more beneficial for actual public sector tasks. This approach will provide researchers and policymakers with tools to better understand the current and future impacts of AI in public administration, supporting evidence-based workforce planning and organizational development.



Ground-truth is law: A Systematic Review of Evaluation Methods for Legal Case Retrieval Systems

Julian Michael Quintijn LEEFFERS

Tilburg University, Netherlands, The

The digitalisation of public sector information is making large volumes of legal decisions publicly available, creating opportunities for Legal Case Retrieval (LCR) systems to enhance transparency and consistency in judicial and administrative decision-making. Yet assessing whether these systems work effectively depends on well-constructed ground-truth datasets—labelled collections of legal documents indicating which cases are considered relevant for a given query or reference case. Current practices vary widely and often fail to reflect the nuanced legal information needs of practitioners. This study systematically reviews 28 academic works covering 31 datasets, examining evaluation frameworks, labelling methods, and the relevance dimensions they embody. Findings reveal a dominance of topical and algorithmic relevance, with situational, cognitive, and domain-specific aspects underrepresented. The paper calls for transparent, multidimensional, and legally grounded evaluation practices to ensure LCR systems align with the broader goals of public administration and legal information seeking behaviour. Recommendations include leveraging large language models for explainable annotations and incorporating diverse relevance dimensions.



How Generative LLMs Work and Their Strengths and Limitations for Policy Consultation

Gerald Zhiyong LAN, Dongquan Li

HK University of Science and Technology(GZ), China, People's Republic of

AI is the catchword of our times. Among the piles of AI related publications on AI's function in policy and governance support, few discussed the logic of how its LLMs work. They are essentially language models. However, these languages models have attempted to exhaust the potentials of languages in describing and articulating human ideas which one way or another could affect behavior and action. Using the theories of Ludwig Wittgenstein, Michel Foucault, and Habermas on power, knowledge, language, and communication discourse, the paper shall discuss how large language models could have the potential to significantly transform human life in spite of its lack of real life logic and human feelings. At the same time, the paper shall also show how the limitation of LLMs can be deadly if not understood and used properly. With theoretical reasoning and empirical arguments, the paper shall attest to the utmost importance of AI governance and make suggestions on how governance can be achieved. After all, AI is a tool, and the tool needs to be properly used to help achieve ultimate human goals such as integrity, trust, peace, justice, and human dignity.