AIGP Study Guide
Module 8: AI Governance Vocabulary

Data terms

Everything the model eats, before and after cooking. Know which dataset does which job: training data teaches, validation data tunes and checks generalisation, and input data is what the live system receives.

This cluster is everything the model eats - before and after cooking. The exam's favourite trap is the three datasets, so fix the roles first.

The three datasets - which does which job
DatasetRole
Training dataTeaches the model; must be representative, fair and compliant
Validation dataHeld out to tune the model and check it generalises before final testing
Input dataWhat the live system receives at use time and bases its output on
  • Ground truth → the verified, real-world correct answer labels are checked against and accuracy is scored on.
  • Corpus → a large structured collection of text or speech for training language models.
  • Data qualityaccurate, complete, relevant, representative and fit-for-purpose; garbage in, garbage out.
  • Data provenance → the documented history and origin of data; supports integrity and identifies applicable laws.
  • Data drift → input data's statistical properties change over time vs the training data, quietly degrading performance.
  • Synthetic dataartificially generated data mimicking real data's statistical properties; helps with privacy and scarcity but can inherit the source's bias.
  • Variables → the measurable attributes in the data used as inputs or predicted outputs.
  • Preprocessing (before training) vs Post processing (after inference), with Exploratory data analysis (EDA) as the upfront investigation.
Drift has no attacker

Data drift is natural statistical change over time - no attacker. That is exactly what separates it from data poisoning (covered under risks), where an attacker deliberately corrupts the training data.

Key terms - quick answers

What is “Training data”?
The dataset the model learns from; must be representative, fair and compliant.
What is “Validation data”?
Held-out data to tune the model and check generalisation before final testing.
What is “Input data”?
Data fed into the system at use time, on which it bases output.
What is “Ground truth”?
Verified real-world correct answer used to check labels and score accuracy.