Name: Toronto Machine Learning Summit (TMLS) 10th Annual Conference & Expo 2026
Start: 2026-06-17T08:30:00-04:00
End: 2026-06-18T16:00:00-04:00
Location: CIBC SQUARE

Toronto Machine Learning Summit (TMLS) 10th Annual Conference & Expo 2026

A unique experience to upskill and learn from industry and academics in our community.
About this Event

10th Annual Toronto Machine Learning Summit (TMLS)

TMLS is in it's 10th year, serving as a long-standing community conference bringing together academic research, industry applications, and business strategy in a safe, welcoming, and constructive environment for those working across ML, AI and agents.

Together, our events support a global community of 15,000+ practitioners, researchers, and industry leaders, exploring best practices, methodologies, and principles across three tracks : Research, Business Strategy and Technical Applications.

This year's program structure

This year, TMLS is organized into three core categories, each with focused topic areas designed to reflect how teams actually work, not how conference agendas are usually labeled.

ML/AI Technical / Engineering Talks/Workshops

Hands-on ML and GenAI implementation in real production environments. This is for practitioners building and operating systems, not slideware.

Topic areas include:

Agent design patterns & architecture
Agentic workflow automation & orchestration
AI-assisted development tools and workflows
Building and extending coding agents
Data engineering & RAG pipelines
Enterprise adoption & team design
Evaluation methods & capability benchmarking
Fine-tuning & training
Inference serving & optimization
LLM prompt engineering & evals
Monitoring & drift detection
Safety / governance / auditability
Search / Recommendation systems
Tooling extensions & model augmentation techniques

Business / Executive / Product Strategy Talks/Workshops

How organizations decide if, when, and how to deploy AI, and what it takes to make it stick.

Topic areas include:

AI Adoption & Organizational Change
AI Strategy & Executive Decision-Making
Operating Model & Governance
Product & Go-to-Market Strategy
Risk, Compliance & Trust
ROI, Value & Business Impact
Translating Emerging AI Capability into Deployable Products

Fundamental Research (Capability advancement or novel methods) Talks/Workshops

Novel research and technical exploration that pushes the field forward, without needing an immediate product outcome.

Topic areas include:

Agentic behavior
Evaluation frameworks
Model architecture
Optimization / search
Reinforcement learning & control
Safety / interpretability
Training methods

What to Expect

2 days of hands-on workshops, breakout sessions, and keynotes
60+ expert speakers
400+ attendees
Pre-event, app-based networking
Food, drinks, and social events
High-quality conversations with people who’ve actually built these systems

This is a working conference. You’ll leave with patterns, tradeoffs, and mental models you can apply immediately, not just inspiration.

Who TMLS Is For

TMLS is designed for data scientists, ML engineers, researchers, product leaders, and executives who are:

Building or deploying models and agents in production
Navigating real organizational and regulatory constraints
Looking to learn from peers, not pitches

Whether you want to deepen your technical judgment, pressure-test strategic decisions, or connect with others who’ve “been there,” TMLS is built to support your growth.

Community First

With a community of 20,000+ ML researchers, practitioners, and leaders, TMLS is rooted in long-term relationships and shared context. Geography matters. Constraints matter. And learning happens faster when people speak honestly with others who operate in the same environment.

We believe real AI progress comes from sharing what actually happened, especially when things were hard.

Accessibility & Values

We believe these conversations should be accessible. Our ticket pricing reflects that commitment.

TMLS is dedicated to advancing the responsible, effective deployment of AI and ML across industries, and helping practitioners fast-track their learning while building meaningful, durable careers in this fast-moving field.

Visit: www.torontomachinelearning.com

Steering Committee & Team

June 16

🕑: 09:10 AM - 09:40 AM
Humans and Continual Learning AI Agents: The Journey
Host: Manuela Veloso, Herbert A. Simon University Professor E

Info: I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent's perception as the robust processing and sharing of information, the agent's cognition as their planning and memory-based reasoning abilities, and the agent's action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.

🕑: 09:45 AM - 10:15 AM
Keynote
Host: Michael Levin, Professor / PI, Tufts University
🕑: 10:20 AM - 10:50 AM
From Single Agent Evolution to Multi-Agent Synergy
Host: Bang Liu, Associate Professor, University of Montreal &

Info: This talk explores how agentic intelligence can improve along two axes: evolving a single agent and coordinating multiple agents. First, I introduce Programmatic Skill Networks, a framework in which agents continually acquire, repair, stabilize, and refactor reusable skills, turning a static skill library into an evolving structure for long-horizon learning. Second, I examine when multi-agent systems are actually preferable to scaling up a single agent under the same budget. The key message is that scale-out only helps when communication is reliable, shared failures are low, and the organization gains exceed single-agent compute scaling. These results suggest that future agents should be not only larger, but self-improving, reusable, and principled in how they coordinate.

🕑: 10:30 AM - 12:00 PM
From Detection to Resolution: Multi-Head LSTM Anomaly Detection and Agentic Ex
Host: Ramin Mardani, Machine Learning Engineer, TELUS

Info: Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I'll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session's second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture gener

🕑: 11:30 AM - 12:00 PM
INSPIRE: Intent-aware Neural Sponsored Product Retrieval for E-commerce
Host: Shasvat Desai, Staff Machine Learning Scientist, Walmart

Info: "Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective

🕑: 11:30 AM - 12:00 PM
Why Your RAG Agent Is Confidently Wrong: Retrieval Choices That Actually Matte
Host: David vonThenen, Sr Ai/ML Engineer, Office of the CTO,

Info: Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.

We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather

🕑: 12:05 PM - 12:35 PM
Evaluating AI in Production - A Practical Guide
Host: Mengying Li, Head of Data, Braintrust

Info: This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they're a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.

🕑: 12:05 PM - 12:35 PM
Emulating Real-World PII with a Large-Scale Synthetic Dataset to Audit LLM Mem
Host: Sriram Selvam, Senior Software Engineer, Microsoft

Info: To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.

🕑: 12:40 PM - 01:10 PM
Keynote
Host: Dawn Song, Professor, Computer Science @ UC Berkeley
🕑: 01:15 PM - 01:45 PM
Encrypted Inferences Over Decision Trees
Host: Alex Shpurov, CTO, 01 Quantum

Info: This virtual talk will introduce the 01 Quantum AI Marketplace concept and demonstrate an example of an FHE-encrypted decision tree. The demo will show a client logging in with private data, encrypting that data using fully homomorphic encryption, sending it to a model owner, and receiving encrypted inference results.
Decision trees are a useful starting point for encrypted inference because they are deterministic and interpretable: for a given input, the model follows a clear path to a prediction. In an FHE setting, however, that path cannot be followed with normal plaintext branching. Instead, the tree logic is transformed into encrypted comparisons, path scoring, and leaf selection, allowing the model owner to evaluate the decision tree without accessing the client’s raw data.
The client then decrypts the result and obtains the final prediction.
The second part of the talk will explain why FHE development is fundamentally different from regular AI software development.

🕑: 01:15 PM - 01:45 PM
Evaluating Netflix Show Synopses with LLM-as-a-Judge
Host: Cameron Wolfe, Staff Research Scientist, Netflix
🕑: 01:50 PM - 02:20 PM
Deploying Multi-Agent Systems in Regulated Enterprise Workflows
Host: Zeke Miller, Director, Machine Learning Engineering

Info: Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.

🕑: 01:50 PM - 02:20 PM
Agentic Financial Reasoning with Knowledge Graphs and LLMs
Host: Abhinav Arun, Senior AI Research Scientist, Domyn

Info: Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliabi

🕑: 02:25 PM - 02:55 PM
Fine-Tuning LLMs for Real-World Tool Calling: Lessons from tau2-bench
Host: Kai Wei Tan, Senior Forward Deployed Engineer, Coreweave

Info: Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.

🕑: 02:25 PM - 02:55 PM
Optimizing Vector Search: Why You Should Flatten Structured Data
Host: Oleg Tereshin, Senior Software Engineer, Independent sof

Info: When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on T

🕑: 03:00 PM - 03:30 PM
The Vicious Loop: Why Stateless Agents Fail in Production and How We Built Ep
Host: Dippu Kumar Singh, Leader Of Emerging Technologies (Apps

Info: Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.

To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.

The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background """"Critic"""" process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline ap

🕑: 03:35 PM - 04:05 PM
Automated and Scalable RAG: Vector Stores, MCP, Clustering
Host: Matthew Mazzarell, AI Lead, Financial Services, America

Info: All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.

🕑: 04:10 PM - 04:40 PM
From Inventory to Accelerator: What Happens When a Large AI Org Builds Data Pr
Host: Mendelsohn Chan, Staff Deployment Architect, Applied AI,

Info: Most large AI organizations have a data problem that isn't what they think it is. The problem isn't missing data — it's data that exists, was built deliberately, is maintained by real people, and still isn't being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn't automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We'll walk through what the product-centric model solved, what it didn't, and what the discovery and trust gap actually costs

🕑: 04:10 PM - 04:40 PM
Jailbreaking the Blockchain: How I Used Game Theory to Map Prompt Injection At
Host: Naga Sujitha Vummaneni, Sr Security Engineer, Ripple

Info: Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this ses

🕑: 10:20 AM - 10:50 AM
Hallucination in LLMs: Detection, Mitigation, and Root Cause Awareness
Host: Ahmad Pesaranghader, Applied AI Research Scientist, CIBC

Info: Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.

June 17

🕑: 09:45 PM - 10:25 PM
Keynote
Host: Ion Stoica, Professor, UC Berkeley
🕑: 10:55 PM - 11:25 PM
Leading AI Change — The Human Side of Responsible Deployment
Host: Thena Sasitharan, Director, People Change Management - A

Info: As financial institutions accelerate AI adoption, the most persistent barriers to success aren't technical; they are organizational, cultural, and human. Governance frameworks stall without cultural buy-in. Risk policies go unimplemented when frontline teams don't understand them. Budgets are approved for AI initiatives that quietly fail at the adoption layer.

This session explores how CIBC is approaching AI transformation through a people-first change management lens—bridging the gap between executive strategy and ground-level execution. Drawing on the enterprise-wide rollout of CIBC's GenAI platform and the lessons now being applied to tailored AI solutions across the organization, this talk will address:

-The human gap in AI governance; where accountability breaks down and how leaders can close it

-How to build organizational readiness for AI adoption across regulated, risk-sensitive environments

-The intersection of policy, procurement, and people - navigating competing constr

🕑: 10:55 AM - 11:25 AM
dLLMs: Rethinking Generation Beyond Autoregressive Models
Host: Suhas Pai, CTO, Hudson Labs

Info: In this talk, we will rethink language generation beyond the autoregressive paradigm. We will introduce diffusion LLMs as a new family of models that generate through iterative denoising instead of strict left-to-right decoding. We will explain how masked diffusion corrupts text, learns to recover it, and enables parallel token generation. We will compare diffusion decoding with autoregressive decoding across speed, infilling, controllability, and planning. We will examine the practical challenges, including length control, denoising schedules, and blockwise generation. We will end by asking whether the future of language models is autoregressive, diffusion-based, or a hybrid of both.

🕑: 11:30 AM - 12:00 PM
Don't Fine-Tune Yet: When Prompt Optimization Wins (and When It Doesn't)
Host: Travis DePuy, AI Solution Engineer, Weights & Biases

Info: Fine-tuning feels like the natural next step when your model isn't performing — but it's often the wrong one. Before committing to a training run, it's worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we'll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We'll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There's no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.

🕑: 10:30 AM - 12:00 PM
Deploying with Purpose: Embedding Economic Evaluation Across the AI Lifecycle
Host: Dhari Gandhi, AI Project Manager, Vector Institute AI

Info: As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.

This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.

Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts

🕑: 12:05 PM - 12:35 PM
The Orchestration Stack for Observable, Debuggable, and Durable Agents
Host: Ketan Umare, Co-Founder and CEO, Union.ai

Info: Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.

🕑: 12:05 PM - 12:35 PM
The Meaning Gap: Your Agent Is Correct. Your Deployment Is Not.
Host: Mario Lazo, Principal Solution Architect for Data and AI

Info: Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren't technical, they were organizational.

This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.

LLMs optimize for statistical similarity; humans require meaning. """"Top 10,"""" """"Best 10,"""" and """"Highest 10"""" can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expe

🕑: 12:05 PM - 12:35 PM
Fireside Chat: Quantum as part of the AI Stack.
Host: Dr. Christian Weedbrook, CEO, Xanadu

Info: Fireside Chat: Quantum as part of the AI Stack. A conversation with CEO of Xanadu
We will be discussing;

Quantum as part of the AI stack
Hardware-software co-design and what PennyLane looks like as a developer platform.
Real applications: chemistry, materials, drug discovery, optimization, and honesty about where quantum doesn’t help yet.
The Canadian deep-tech story: built in Toronto, dual-listed on Nasdaq and TSX.
Founder journey: evictions, failed first chip, to $16B valuation.
Regulated industries - finance, pharma, defence, where Xanadu has partnerships

🕑: 01:35 PM - 02:05 PM
Encrypted AI for Optimized Security and Performance
Host: Tyson Macaulay, COO, 01 Quantum

Info: This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.

🕑: 01:35 PM - 02:05 PM
What it takes to build production-grade foundation models in Finance
Host: Freddy Lecue, Managing Director, Head of Frontier AI Mod
🕑: 02:10 PM - 02:40 PM
Event Sequence Classification and Generation
Host: Karthik Guruswamy, Financial AI Strategy Lead, Teradata

Info: Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the ""customer task"" or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren't traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core

🕑: 02:10 PM - 02:40 PM
Expanding the Capabilities of Tabular Foundation Models
Host: Anthony Caterini, Senior Research Machine Learning Scie

Info: Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, pow

🕑: 02:10 PM - 02:40 PM
Leading Trustworthy AI Engineering in Legal: Alignment, Trade-offs, and the Gl
Host: Zahra Shekarchi, Lead Research Engineer, Thomson Reuters

Info: In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.

This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.

We will show how establishing clear metrics and progressive target values defines what 'Good' means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress bey

🕑: 02:45 PM - 03:15 PM
Reasoning Robots: Open World Navigation and Memory for Agentic Robots
Host: Steven Waslander, Professor, University of Toronto

Info: Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory - spatial, descriptive and visual - to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.

🕑: 02:45 PM - 03:15 PM
PANEL
Host: Ozge Yeloglu, VP, Advanced Analytics & AI, CIBC
🕑: 02:45 PM - 03:15 PM
Lean by Design: How a Three-Person Nonprofit AI Team Shipped a Production-Gra
Host: Luis Ticas, Head of AI, Climate Resilient Communities

Info: Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don't speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure.

🕑: 03:45 PM - 04:15 PM
The Agentic Flow I Designed Versus the Actual Flow: And How I Discovered It Us
Host: Michael Havey, Principal Data Architect, OpsGuru

Info: An agent has a flow, and getting the flow right is critical. We can trust the agent's result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.

Most agent tools provide observability traces of the agent's execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?

I present results from an agent I built on AWS's AgentCore service.

🕑: 03:45 PM - 04:15 PM
Long Context Training and Inference on AMD GPUs
Host: Mehdi Rezagholizadeh, Principal Research Scientist, AMD

Info: Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving tra

🕑: 03:45 PM - 04:15 PM
A Multi-Stage Framework for Instruction-Based Evaluation of LLM Outputs
Host: Hannah Arjmand, Lead AI Engineer, Chubb

Info: Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.

We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.

The framework addresses three core problems. First, instruction-aware label

🕑: 04:20 PM - 04:50 PM
From SaaS to Agentic Platforms: Where the Next Software Advantage Lies
Host: Afsaneh Fazly, Founder and Principal, Astaria AI

Info: For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects

🕑: 04:20 PM - 04:50 PM
From Day 2 to Day 10: Operationalizing Evals for Real-World LLM Systems
Host: Korede Adegboye, Machine Learning Engineer, Priceline

Info: Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.

This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.

The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLM

🕑: 04:20 PM - 04:50 PM
Is Your Eval Lying to You? Catching Hidden Failures in Agent Evaluation
Host: Abhimanyu Anand, Sr. Data Scientist, Elastic

Info: Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you're evaluating AI agents, you've likely encountered hidden failures such as:

1. Improving accuracy with a tool change also increases tool calls and latency, and a single positive metric masks overall degradation.
2. LLM-based evaluators are nondeterministic, so a score increase may only reflect sensitivity to a prompt change, not an improved user experience.
3. Without robust testing, you might be shipping coin-flip gains that will disappear on the next run.

In this session, I'll walk through how we addressed these at Elastic. Using a real experiment as an example, I'll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated aga

June 18

🕑: 10:55 AM - 11:25 AM
How Software Companies Become AI Companies
Host: Alet Blanken, Vice President, AI Engineering, Workday

Info: Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.

🕑: 10:55 AM - 11:25 AM
SONIC-O1: A Real-World Benchmark for Evaluating Multimodal LLMs on Audio-Video
Host: Ahmed Radwan, Machine Learning Specialist, Vector Instit

Info: Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source mod

🕑: 10:55 AM - 11:25 AM
Artificial Intelligence in Healthcare: From Promise to Practice
Host: Muhammad Mamdani, Professor and Director

Info: Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.

🕑: 11:30 AM - 12:00 PM
Scaling AI Impact: A Two-Pronged Operating Model for Enterprise Transformation
Host: Swanand Gupte, Director, Artificial Intelligence, TELUS

Info: As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the ""Bottom-Up"" approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the ""Top-Down"" approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.

Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the ""How"" (technical build) to the ""What"" (problem selection and change management) to bridge the ""value gap"" in enterprise AI.

🕑: 11:30 AM - 12:30 PM
Pre-RFP Pension Fund Prospect Ranking: Proxy Targets on Noisy Mandate Data,
Host: Anshuman Panwar, TD Asset Management, Director of AI

Info: Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.

🕑: 11:30 AM - 12:00 PM
LLM-Guided Calibration of Causal Discovery Models for Macroeconomic Analysis
Host: Nima Safaei, Sr. Data Scientist, Scotiabank

Info: Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.

We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LL

🕑: 12:05 PM - 12:35 PM
Squeezing More Juice Out of Your LLM API: Performance Optimizations and How to
Host: Hagay Lupesko, Senior Vice President of Engineering, Cer

Info: Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.

🕑: 12:05 PM - 12:35 PM
Scaling Production-Grade LLMs: Diagnosing Hidden Bottlenecks in Training and I
Host: Deepkamal Gill, Senior AI/ML Scientist, The Vanguard Gro

Info: While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.

In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.

Rather than introducing new modeling techniques, this session presents a prac

🕑: 01:35 PM - 02:05 PM
Reasoning Over Complex Documents
Host: Denys Linkov, Head of ML, Wisedocs

Info: The world runs on the data and processes stored in un structured documents. Models are becoming incredibly capable of summarizing and extracting contents, but how well do they perform fuzzy, human style synthesis tasks? This talk will cover our experience building complex internal document processing evals

🕑: 01:35 PM - 02:05 PM
One Does Not Simply Build a Quantum Computer Without AI
Host: Zy Niu, Head of AI, Xanadu Quantum Technologies

Info: Building a fault-tolerant quantum computer is one of the hardest engineering challenges of our time — spanning chip design and fabrication, system engineering and control, error correction, quantum architecture, quantum software, and quantum algorithms. At Xanadu, we found that the bottleneck isn't just physics; it's the sheer cognitive overhead of orchestrating research across dozens of interdependent teams, simulators, codebases, and datasets.

AI, with its recent advancements, has become a natural part of how we tackle this complexity -- not as a side project, but as core infrastructure. Agentic systems now operate in the background across our R&D pipeline, autonomously navigating codebases, running analyses, and connecting insights across teams. Combined with ML-driven tools built in-house for tasks like photonic chip characterization and design optimization, these systems are compressing feedback loops that once took days into minutes.

This talk explores how the convergence o

🕑: 01:35 PM - 02:05 PM
Human AI Coordination - A Progressive Trust Framework
Host: Mahsa Rouzbahman, Director, Advanced Analytics & AI

Info: "The session explores the evolving landscape of human AI coordination, focusing on the Progressive Trust Framework, a structured approach to building, calibrating, and scaling trust between humans and AI systems. As AI transforms business operations, organizations must navigate the inherent uncertainty in AI outcomes, characterized by aleatoric and epistemic factors, and ensure confidence scores are properly calibrated.

Three stages of human-AI interaction are detailed: Human-in-the-Loop (HITL), Human-on-the-Loop (HOTL), and Agentic AI, each with distinct roles, controls, and risk profiles. Technical guardrails, performance monitoring, and feedback loops enable safe progression toward greater AI autonomy,

The session bridges classic and modern AI paradigms, offering actionable strategies for technical leaders to advance trustworthy automation in complex enterprise environments."

🕑: 02:10 PM - 02:40 PM
Beyond NLP: Technical Challenges in Building a Foundation Model for Sequential
Host: Lin Liu, Director, Data Science, Wealthsimple

Info: Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.

In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:

- Tokenization for non-language sequences: Event data mixes categorical fields, continuous values, and irregular timestamps. We explore representations from naive text serialization to structured entity encodings, and the surprising impact tokenization strategy has on downstream performance
- Architecture trade-offs: Head-to-head comparisons across three approaches — off-the-shelf LLM embeddings, a custom set-aware transformer, and a hybrid fine-tuned LLM — and when each breaks down
- Multi-objective training: Combining ne

🕑: 02:10 PM - 02:40 PM
Model-Agnostic Feature Importance with Dependent Features: A Conditional Subgr
Host: Javeria Ahmed, Senior Manager, Retail Risk Modelling, Ro

Info: Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don't reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), """"unrestricted permutation forces extrapolation.""""

This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:

1. **Conditional Imputation**: Using Gaussian Copula and other statistical models to replace masked features while preserving the joint feature distribution, avoiding impossible feature combinations.

2. **Restricted Permutations**: Partitioning samples

🕑: 02:45 PM - 03:15 PM
High-precision targeted audience discovery with GenAI
Host: Nataliya Portman, Senior Data Scientist, CBC/Radio-Canad

Info: In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user's true interest is buried under other content. The prompt's true value lies in its ability to find genuine enthusiasts who don't engage through traditional channels like email - allowing us to reach out through a different channel.

🕑: 02:45 PM - 03:15 PM
Why Agentic AI Evaluation Break in Production
Host: Olivier Blais, VP of AI, Moov AI

Info: Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the """"Agentic Shift"""" is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of """"good,"""" processes fail, user trust evaporates, and compliance teams hit the brakes.

Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn't enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.

🕑: 02:45 PM - 03:15 PM
Implementing Retrieval Augmented Generation Technique on Unstructured and Stru
Host: Shariyar Murtaza, AVP AI and Applied Research, Manulife

Info: Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2

June 19

🕑: 09:30 AM - 11:00 AM
Accelerate Software Delivery with IBM Bob: Hands‑On Agentic AI for Real Engine
Host: Sofia Jia

Info: Modern engineering teams are slowed by legacy code, manual workflows, and growing operational complexity. IBM Bob is built to change that. As an enterprise‑grade, agentic AI engineering partner, Bob understands your existing systems, plans work, writes production‑quality code, and automates the tedious tasks that drain developer time.

In this 90‑minute hands‑on workshop, you’ll experience how Bob moves beyond code snippets to deliver true end‑to‑end assistance—modernizing outdated applications, building AI‑powered workflows, and orchestrating multi‑agent systems. Whether you're migrating Java 8 code, automating business processes, or assembling domain‑specific AI agents, Bob helps you deliver working solutions faster, smarter, and with enterprise‑level governance.

🕑: 09:10 AM - 11:00 AM
Enhancing Training Data Pipelines with Lance and the Multimodal Lakehouse
Host: Prashanth Rao, AI Engineer, LanceDB

Info: Modern machine learning and model training pipelines depend on petabytes of multimodal data — images, videos, point clouds, text and more — yet data I/O and storage remain critical bottlenecks when experimenting and doing research. This workshop session addresses that gap by introducing Lance, an open-source columnar format designed for ML workloads, and LanceDB, the multimodal retrieval library built on to of Lance. We begin with Lance's architecture and what makes it uniquely suited to multimodal training — fast random access, native blob storage, and built-in versioning, and then move into live integration examples with PyTorch and Hugging Face Datasets. From there, we work through a 3D world-model dataset case study, discuss benchmarks on I/O performance during data loading, and show how to add derived features like embeddings and annotations without rewriting existing data, and scale data loading for distributed training. Attendees will leave with working-level knowledge of how

🕑: 09:30 AM - 12:45 PM
Reinforcement Learning for Large Language Models: A Modern View
Host: David Rosenberg, Head of Machine Learning Strategy, CTO

Info: Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.

The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emph

🕑: 11:15 PM - 12:45 PM
Meta-Governance architectures for multi-agent system safety, alignment, govern
Host: Himanshu Joshi, AI Safety and Alignment Researcher

Info: Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managi

🕑: 01:00 PM - 02:30 PM
Building a Trusted API Marketplace for the AI Economy: Lessons from TELUS on P
Host: Areeb Khawaja, Technical Product Manager, TELUS

Info: As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it a

🕑: 01:00 PM - 02:30 PM
NemoClaw: Building a More Secure Runtime for Long-Running Autonomous Agents
Host: Chris Alexiuk, Sr. Product Research Engineer, NVIDIA

Info: In this workshop we'll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We'll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.

🕑: 02:45 PM - 04:15 PM
Humans + AI: Collaborative Intelligence for Complex Decision-Making
Host: Naman Goyal, Senior Machine Learning Engineer, Google De

Info: As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.