Model Integrity

Executive Summary

We propose 'model integrity' as an overlooked challenge in aligning LLM agents. Model integrity refers to an AI system's consistent action based on coherent, well-structured, and inspectable values, providing predictability even in unforeseen circumstances where rules cannot be specified beforehand. This distinction becomes crucial as AI systems become more powerful - many would prefer a compliant assistant, but a co-founder with integrity. Current trends show two concerning paths: either accumulating complex rulebooks to handle every edge case (as seen with ChatGPT), or using vague values like "curiosity" that can be interpreted in problematic ways (as seen with Claude). Both approaches have concerning failure modes as market and legal pressures push toward either rigid compliance or engagement-maximizing behavior. We demonstrate a prototype, WiseLLaMa-8B, fine-tuned on values-laden conversations, which generates responses guided by explicit value considerations. Initial user studies suggest the model's values are legible, likable, and provide this kind of predictability. How to train models with integrity at scale, and how to reliably measure and evaluate model integrity, remain open research questions with low-hanging fruits.

Next
Next

New Paper: “What are human values, and how do we align AI to them?”