Skip to main content
Tutorial 14 Apr 2024

Overview of traditional (non-LLM) trustworthy machine learning based on the book “Trustworthy Machine Learning” by the presenter Definitions of trustworthiness and safety in terms of aleatoric and epistemic uncertainty AI fairness Human-centered explainability Adversarial robustness Control-theoretic view of transparency and governance What are the new risks Information-related risks Hallucination, lack of factuality, lack of faithfulness Lack of source attribution Leakage of private information Copyright infringement and plagiarism Interaction-related risks Hateful, abusive, and profane language Bullying and gaslighting Inciting violence Prompt injection attacks Brief discussion of moral philosophy How to change the behavior of LLMs Data curation and filtering Supervised fine tuning Parameter efficient fine tuning, including low-rank adaptation Reinforcement learning with human feedback Model reprogramming and editing Prompt engineering and prompt tuning How to mitigate risks in LLMs and make them safer Methods for training data source attribution based on influence functions Methods for in-context source attribution based on post hoc explainability methods Equi-tuning, fair infinitesimal jackknife, and fairness reprogramming Aligning LLMs to unique user-specified values and constraints stemming in use case constraints, social norms, laws, industry standards, etc. via policy elicitation, parameter-efficient fine-tuning, and red team audits Orchestrating multiple possibly conflicting values and constraints