i4Health - Research

Position: Beyond Assistance – Reimagining LLMs as Ethical and Adaptive Co-Creators in Mental Health Care

This position paper argues for a fundamental shift in how Large Language Models (LLMs) are integrated into the mental health care domain. We advocate for their role as co-creators rather than mere assistive tools. While LLMs have the potential to enhance accessibility, personalization, and crisis intervention, their adoption remains limited due to concerns about bias, evaluation, over-reliance, dehumanization, and regulatory uncertainties. To address these challenges, we propose two structured pathways: SAFE-i (Supportive, Adaptive, Fair, and Ethical Implementation) Guidelines for ethical and responsible deployment, and HAAS-e (Human-AI Alignment and Safety Evaluation) Framework for multidimensional, human-centred assessment. SAFE-i provides a blueprint for data governance, adaptive model engineering, and real-world integration, ensuring LLMs align with clinical and ethical standards. HAAS-e introduces evaluation metrics that go beyond technical accuracy to measure trustworthiness, empathy, cultural sensitivity, and actionability. We call for the adoption of these structured approaches to establish a responsible and scalable model for LLM-driven mental health support, ensuring that AI complements, rather than replaces, human expertise.

npj Digital Medicine

The FAIIR conversational AI agent assistant for youth mental health service provision

Frontline crisis support plays a critical role in youth mental health services, where Crisis Responders (CRs) engage in conversations and assign issue tags to guide interventions. To enhance this process, we introduce FAIIR (Frontline Assistant: Issue Identification and Recommendation), an ensemble of domain-adapted transformer models trained on 780,000 conversations. FAIIR aims to reduce CR’s cognitive burden, enhance issue identification accuracy, and streamline post-conversation administrative tasks. Evaluated on retrospective data, FAIIR achieves an average AUC ROC of 94%, an average F1-score of 64%, and an average recall score of 81%. During the silent testing phase, its performance remained robust, with less than a 2% drop in all metrics. CRs exhibited 90.9% agreement with its predictions, and expert agreement with FAIIR exceeded their agreement with original labels. These findings highlight FAIIR’s potential to assist CRs in prioritizing urgent cases and ensuring appropriate resource allocation in crisis interventions.

Collaboration work with Vector Institute and Kids Help Phone

[arxiv] (accepted at MICCAI 2025)

Advancing Medical Representation Learning Through High-Quality Data

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains underexplored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC , along with the trained models and our codebase.

Collaboration work with Vector Institute

[arxiv]

Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning

We perform a comprehensive benchmarking of contrastive frameworks for learning multimodal representations in the medical domain. Through this study, we aim to answer the following research questions: (i) How transferable are general domain representations to the medical domain? (ii) Is multimodal contrastive training sufficient, or does it benefit from unimodal training as well? (iii) What is the impact of feature granularity on the effectiveness of multimodal medical representation learning? To answer these questions, we investigate eight contrastive learning approaches under identical training setups, and train them on 2.8 million image-text pairs from four datasets, and evaluate them on 25 downstream tasks, including classification (zero-shot and linear probing), image-to-text and text-to image retrieval, and visual question-answering. Our findings suggest a positive answer to the first question, a negative answer to the second question, and the benefit of learning fine-grained features. Finally, we make our code publicly available.

Collaboration work with Vector Institute

[arxiv]

Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition

Understanding disparities in the prevalence of Post COVID-19 Condition (PCC) amongst vulnerable populations is crucial to improving care and addressing intersecting inequities. This study aims to develop a comprehensive framework for integrating social determinants of health (SDOH) into PCC research by leveraging natural language processing (NLP) techniques to analyze disparities and variations in SDOH representation within PCC case reports. Following the construction of a PCC Case Report Corpus, comprising over 7,000 case reports from the LitCOVID repository, a subset of 709 reports were annotated with 26 core SDOH-related entity types using pre-trained named entity recognition (NER) models, human review, and data augmentation to improve quality, diversity and representation of entity types. An NLP pipeline integrating NER, natural language inference (NLI), trigram and frequency analyses were developed to extract and analyze these entities. Both encoder-only transformer models and RNN-based models were assessed for the NER objective. Fine-tuned encoder-only BERT models outperformed traditional RNN-based models in generalizability to distinct sentence structures and greater class sparsity. Exploratory analysis revealed variability in entity richness, with prevalent entities like condition, age, and access to care, and underrepresentation of sensitive categories like race and housing status. Trigram analysis highlighted frequent co-occurrences among entities, including age, gender, and condition. The NLI objective (entailment and contradiction analysis) showed attributes like ”Experienced violence or abuse” and ”Has medical insurance” had high entailment rates (82.4%-80.3%), while attributes such as ”Is female-identifying,” ”Is married,” and ”Has a terminal condition” exhibited high contradiction rates (70.8%-98.5%). We conclude that transformer-based NER models are effective in extracting SDOH information from PCC case reports; however, there is a scarcity of multiple sociodemographic factors in PCC-related academic case reports and an imbalanced representation of mentions aligned with groups of interest within dimensions such as gender, insurance status, and age

Collaboration work with Vector Institute

[medrxiv]

Diagnosing and remediating harmful data shifts for the responsible deployment of clinical AI models

Harmful data shifts occur when the distribution of data used to train a clinical AI system differs significantly from the distribution of data encountered during deployment, leading to erroneous predictions and potential harm to patients. We evaluated the impact of data shifts on an early warning system for in-hospital mortality that uses electronic health record data from patients admitted to a general internal medicine service, across 7 large hospitals in Toronto, Canada. To explore the robustness of the model, we evaluated potentially harmful data shifts across demographics, hospital types, seasons, time of hospital admission, and whether the patient was admitted from an acute care institution or nursing home, without relying on model performance. Interestingly, many of these harmful data shifts were unidirectional. Overall, our study is a crucial step towards the deployment of clinical AI models, by providing strategies and workflows to ensure the safety and efficacy of these models in real-world settings.

Collaboration work with Vector Institute and GEMINI

[IEEE Access]

Evaluating knowledge transfer in the neural network for medical images

In this study, we implement different experiments for standard transfer learning approaches as our baseline and introduce the adoption of a novel knowledge transfer approach, called teacher-student learning framework, to improve the performance of diagnostic predictive models in medical imaging. Specifically, we investigate various configurations in the teacher-student learning framework inspired by the activation attention transfer in computer vision models to help address some of the challenges faced in medical imaging, such as the limited availability of annotated data and limited computing resources. We show that the teacher-student learning approach has the potential to significantly improve the performance of diagnostic predictive models. Our findings could have a positive impact on healthcare accessibility and affordability, as they may enable the development of more cost effective and widely available medical imaging technologies under a limited data environment.

[NATURE]

Transitioning sleeping position detection in late pregnancy using computer vision from controlled to real-world settings: an observational study

Sleeping on the back after 28 weeks of pregnancy has recently been associated with giving birth to a small-for-gestational-age infant and late stillbirth, but whether a causal relationship exists is currently unknown and difficult to study prospectively. This study was conducted to build a computer vision model that can automatically detect sleeping position in pregnancy under real-world conditions. Real-world overnight video recordings were collected from an ongoing, Canada-wide, prospective, four-night, home sleep apnea study and controlled-setting video recordings were used from a previous study. Images were extracted from the videos and body positions were annotated. Five-fold cross validation was used to train, validate, and test a model using state-of-the-art deep convolutional neural networks. The dataset contained 39 pregnant participants, 13 bed partners, 12,930 images, and 47,001 annotations.

[YouTube],

Collaboration work with UHN

[PLoS Digital Health]

Sleep in Late Pregnancy: Artificial Intelligence for the Detection and Diagnosis of Disturbances and Disorders (SLeeP AID4)

Recognizing that approximately one-third of the pregnancy is spent asleep, there has been a surge of clinical and research interest over the last two decades in the potential roles that sleep (and poor sleep) during pregnancy might play in adverse pregnancy outcomes. In this project, we are conducting multi-night, multi-participant, in-home sleep apnea studies in late pregnancy and developing video-based sleep apnea diagnosis technology. We believe the old adage, "diagnosis is treatment." This technology will eventually be used to streamline the diagnosis of sleep apnea in pregnancy so that it can be urgently triaged for appropriate management. This technology will also be able to diagnose sleep apnea in the bed partner simultaneously.

[YouTube]

Collaboration work with UHN

[JMIR]

Using Social Media to Help Understand Patient-Reported Health Outcomes of Post–COVID-19 Condition: Natural Language Processing Approach

In response to the emergence of long COVID, we developed an NLP pipeline to facilitate extracting information from user-reported experiences on social media platforms. In this study, we examined the validity and effectiveness of our NLP pipeline to provide insights into patient-reported long COVID-related health outcomes across two popular social media platforms, Twitter and Reddit. In doing so, we extracted symptoms and conditions (SyCo) and estimated their occurrence frequency. We compared the outputs with human annotations and highly utilized clinical outcomes grounded in the medical literature. Lastly, we tracked occurrences of SyCo terms over time and geographies to explore the pipeline's potential to be used as a surveillance tool reflecting users’ opinions and experiences. The outcome of our social media-derived pipeline is comparable with the results of

peer-reviewed articles relevant to long COVID symptoms. Overall, this study provides unique

insights into patient-reported health outcomes from long COVID and valuable information about

the patient’s journey that can help healthcare providers anticipate future needs.

Collaboration work with Vector Institute, Roche, Deloitte, TELUS, and UHN

[JMIR]

Natural Language Processing for Clinical Laboratory Data Repository Systems: Implementation and Evaluation for Respiratory Viruses

This study explores the feasibility of using a deep learning-based natural language processing (NLP) model for information extraction from unstructured laboratory reports. The NLP model, trained on a large corpus of annotated laboratory reports, demonstrated strong performance in extracting clinically meaningful medical concepts. The model's stability and generalizability were evaluated across different test sets, showing consistently high accuracy. The study highlights the potential of deep learning-based NLP models to automate the parsing of laboratory data, enabling scalable and efficient access to valuable information for decision support and analysis.

Collaboration work with Vector Institute and ICES

[EMNLP - ACL Anthology]

Bringing the State-of-the-Art to Customers: A Neural Agent Assistant Framework for Customer Service Support

Building Agent Assistants that can help improve customer service support requires inputs from industry users and their customers, as well as knowledge about state-of-the-art Natural Language Processing (NLP) technology. We combine expertise from academia and industry to bridge the gap and build task/domain-specific Neural Agent Assistants (NAA) with three high-level components for: (1) Intent Identification, (2) Context Retrieval, and (3) Response Generation. In this paper, we outline the pipeline of the NAA's core system and also present three case studies in which three industry partners successfully adapt the framework to find solutions to their unique challenges. Our findings suggest that a collaborative process is instrumental in spurring the development of emerging NLP models for Conversational AI tasks in industry.

Collaboration work with Vector Institute, KPMG, PwC, and CIBC

[IEEE] Journal of Translational Engineering in Health and Medicine

A Computer Vision Approach to Identifying Ticks Related to Lyme Disease

In this work, we build an automated detection tool that can differentiate blacklegged ticks from other tick species using advanced computer vision approaches in real-time. Specially, we use convolution neural network models, trained end-to-end, to classify tick species. Also, advanced knowledge transfer techniques are adopted to improve the performance of convolution neural network models. Our best convolution neural network model achieves 92% accuracy on unseen tick species. Our proposed vision-based approach simplifies tick identification and contributes to the emerging work on public health surveillance of ticks and tick-borne diseases. In addition, it can be integrated with the geography of exposure and potentially be leveraged to inform the risk of Lyme disease infection. This is the first report of using deep learning technologies to classify ticks, providing the basis for automation of tick surveillance, and advancing tick-borne disease ecology and risk management.

Collaboration work with Vector Institute and Public Health Ontario

Page updated

Google Sites

Report abuse