Testing and Validation
Testing is continuous, risk-tailored and documented - test for accuracy, robustness, reliability, privacy, interpretability, safety, security and bias. Watch the AI-unique failure modes brittleness and hallucinations, and match resources to risk.
Continuous, risk-tailored, documented. Test during development AND after production, with the depth determined by purpose, algorithm type, third-party integrations and sector regulation.
What to test for. Accuracy, robustness, reliability, privacy, interpretability, safety, security and bias → bias comes in three kinds → computational, cognitive and societal.
- Include edge cases and "unseen" data not in the training set
- Include potentially malicious data
- Run repeatability assessments → does it produce consistent outcomes?
- Adversarial testing and threat modelling → how does the system behave on harmful input, and what are the threats?
- Build multiple layers of mitigation to stop failures at different modules
Not every system gets equal scrutiny → allocate by risk → an airplane widget needs heavy testing, validation and security · an algorithm picking cat pictures for clicks needs little. Match testing to risk tolerance and acceptance.
Learn from incidents. Review databases of known AI incidents (the AI Incident Database) to grasp the breadth of potential problems → revisit the organisation's own documented analyses → tailor future testing to regulatory and industry requirements.
Brittleness → performing successfully in one instance yet failing in another · hallucinations → GenAI creating content that contradicts the source or is factually incorrect under the appearance of fact · plus embedded bias, uncertainty and false positives. PETs can protect training and testing data → Homomorphic encryption, differential privacy, deidentification and obfuscation, federated learning.
"You should test for accuracy, robustness, reliability, privacy, interpretability, as well as security and bias." - Jacqueline Acker, AIGP, CIPP/US, CIPM