Alberta Quality Assessment Tools

Toolkits for evaluating clinical artificial intelligence systems – hosted at University of Alberta

AQAT:RoB (Alberta Quality Assessment Tool: Risk of Bias)

Evaluation framework specifically designed for risk assessment in medical diagnostic AI systems, focusing on reliability and clinical safety standards.

Authors

Carrie Ye, J Ross Mitchell, Daniel C Baumgart, Zechen Ma, Angela Lim Fung, Daniela Garcia Orellana, Juel Chowdhury, Abass Abdullah, Steven Katz, Jacob L Jaremko, Pierre Boulanger, Claire E H Barber, Gillian Lemermeyer, Hosna Jabbari, Lili Mou, Maryam Mirzaei, Mary Waithera Beckett Githumbi, Puneeta Tandon, Randy Goebel, Rhys Clark, Whitney Hung, Marjan Abbasi, Farhad Maleki, Scott Klarenbach, Mohamed Abdalla

CIRE (LLM-based Clinical Information Retention Evaluation)

An LLM-based metric that measures whether clinically meaningful information has been changed, removed, or preserved in clinical texts.

Authors

Kiana Aghakasiri, Noopur Zambare, JoAnn Thai, Carrie Ye, Mayur Mehta, J Ross Mitchell, Mohamed Abdalla

Read Paper →

BERT-MultiCulture-DEID

A de-identification model based on BERT that is designed to improve equity across diverse linguistic and cultural identifiers in clinical text.

Authors

Noopur Zambare, Kiana Aghakasiri, Carissa Lin, Carrie Ye, J Ross Mitchell, Mohamed Abdalla

Link to the Tool → Read Paper →

BEAM: Benchmarking Memory Capabilities of LLMs

A large-scale benchmark of long, coherent multi-domain conversations (up to 10 million tokens) with probing questions designed to measure diverse long-term memory abilities in LLMs.

Authors

Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, J Ross Mitchell

Link to the Tool → Read Paper →

LIGHT: Improving Memory Capabilities of LLMs

A cognitively inspired memory-augmentation framework that equips LLMs with episodic, working, and scratchpad memory systems to improve performance on long-context tasks evaluated by BEAM.

Authors

Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, J Ross Mitchell

Link to the Tool → Read Paper →

Type of Bias	Potential Source of Bias	Support for Judgement
Questions
Question Source	• If questions were created/generated specifically for the study, describe the method used to create the question dataset, including who created the questions and if the questions are reflective of the intended study objective • If questions were selected from an existing question source, adequately describe the source to allow an assessment of whether it addresses the intended research question	• Expectancy bias due to creation of questions reflective of researcher's expectations • Representation bias due to question source not being representative of the target population • Construct-validity bias due to questions not matching the research aim
Question Selection	• If questions were selected from an existing question source, describe the method used to select the questions from the original source (eg, random, consecutive, all, or by certain factors)	• Sampling bias due to question not being representative of the intended setting
Question Manipulation	• If any questions were manipulated from the original source, describe and justify the rationale for the manipulation. • Or if any prompting was provided in addition to the index question, report the exact wording of the prompt(s)	• Construct-validity bias due to questions not matching the research aim • Expectancy bias due to question manipulation that may reflect researcher's expectations
Reference Answers
Reference Answer Source	• If reference answers were generated specifically for the study, describe the method used to create the reference answer dataset, including who created the reference answers and if the answers are reflective of a true reference standard • If reference answers were selected from an existing reference answer source, adequately describe the source to allow an assessment of whether it is reflective of a true reference standard	• Construct-validity bias due to answers not matching the true reference standard study was designed to evaluate • Representation bias due to question source not being representative of the target population
Reference Answer Selection	• If not all reference answers to a given question were used, describe the method by which reference answers were selected	• Sampling bias due to answers not being representative of the intended setting
LLM Answers
LLM Answer Selection	• Describe how many answers were generated for each question and if not all answers were assessed, describe how answers were selected for assessment	• Sampling bias due to answers not being representative of all LLM answers
Evaluators
Evaluator Selection	• Describe the method used to select evaluators, and assign evaluators to specific LLM qualities	• Sampling bias due to evaluators not being representative of the intended setting • Observation bias due to inadequate or inappropriate evaluator expertise for a specific LLM quality
Blinding of Evaluators	• Describe all measures used, if any, to blind trial evaluators and researchers from knowledge of the answer source. Provide information relating to whether the intended blinding was effective.	• Detection bias due to knowledge of the answer source
Outcomes
Performance Metrics	• Describe specific metrics used for each outcome quality • Describes if desired outcomes were pre-specified prior to conducting the study.	• Measurement bias due to the LLM qualities or evaluation metrics not matching the research aim

Type of Bias

Potential Source of Bias

Support for Judgement

Judgement (RoB)

Questions

Question Source

• If questions were created/generated specifically for the study, describe the method used to create the question dataset, including who created the questions and if the questions are reflective of the intended study objective
• If questions were selected from an existing question source, adequately describe the source to allow an assessment of whether it addresses the intended research question

• Expectancy bias due to creation of questions reflective of researcher's expectations
• Representation bias due to question source not being representative of the target population
• Construct-validity bias due to questions not matching the research aim

Question Selection

• If questions were selected from an existing question source, describe the method used to select the questions from the original source (eg, random, consecutive, all, or by certain factors)

• Sampling bias due to question not being representative of the intended setting

Question Manipulation

• If any questions were manipulated from the original source, describe and justify the rationale for the manipulation.
• Or if any prompting was provided in addition to the index question, report the exact wording of the prompt(s)

• Construct-validity bias due to questions not matching the research aim
• Expectancy bias due to question manipulation that may reflect researcher's expectations

Reference Answers

Reference Answer Source

• If reference answers were generated specifically for the study, describe the method used to create the reference answer dataset, including who created the reference answers and if the answers are reflective of a true reference standard
• If reference answers were selected from an existing reference answer source, adequately describe the source to allow an assessment of whether it is reflective of a true reference standard

• Construct-validity bias due to answers not matching the true reference standard study was designed to evaluate
• Representation bias due to question source not being representative of the target population

Reference Answer Selection

• If not all reference answers to a given question were used, describe the method by which reference answers were selected

• Sampling bias due to answers not being representative of the intended setting

LLM Answers

LLM Answer Selection

• Describe how many answers were generated for each question and if not all answers were assessed, describe how answers were selected for assessment

• Sampling bias due to answers not being representative of all LLM answers

Evaluators

Evaluator Selection

• Describe the method used to select evaluators, and assign evaluators to specific LLM qualities

• Sampling bias due to evaluators not being representative of the intended setting
• Observation bias due to inadequate or inappropriate evaluator expertise for a specific LLM quality

Blinding of Evaluators

• Describe all measures used, if any, to blind trial evaluators and researchers from knowledge of the answer source. Provide information relating to whether the intended blinding was effective.

• Detection bias due to knowledge of the answer source

Outcomes

Performance Metrics

• Describe specific metrics used for each outcome quality
• Describes if desired outcomes were pre-specified prior to conducting the study.

• Measurement bias due to the LLM qualities or evaluation metrics not matching the research aim

Alberta Quality Assessment Tools

Tools

AQAT:RoB (Alberta Quality Assessment Tool: Risk of Bias)

CIRE (LLM-based Clinical Information Retention Evaluation)

BERT-MultiCulture-DEID

BEAM: Benchmarking Memory Capabilities of LLMs

LIGHT: Improving Memory Capabilities of LLMs

AQAT: Alberta Quality Assessment Tool

Contact Us

Message Received