AIGP Study Guide
Module 6: Governing AI Development · BoK III.B

Wrangling the Data

Five considerations turn raw data into model-ready data without trampling privacy: cleansing, labelling, anonymisation and minimisation. Master the two PETs - differential privacy blurs the data; federated learning moves the training, not the data.

Five considerations turn raw data into model-ready data without trampling privacy.

  • Data cleansingremoving erroneous and irrelevant data → protects performance and reliability, and cuts privacy risk by stripping unnecessary personal data → note that complete anonymisation is difficult because combining datasets can reidentify individuals
  • Data labellingtagging or annotating data → categorising images, transcribing audio, tagging text → accuracy directly affects learning → keep labelling consistent, by trained personnel, with collaborative tools for quality control
  • Anonymisation → removing items that could identify individuals → names, addresses
  • Purpose specification & minimisation → data unnecessary for the application should not train the model → minimising personal data protects privacy

The two PETs to know in depth:

PET 1 · Differential privacy

Blurs information within datasets using advanced algorithms → data stays meaningful for analysis but becomes nonspecific enough to prevent identifying individuals → valuable in healthcare and finance.

PET 2 · Federated learning

Local systems train a central model independently on their own datasets → results are aggregated centrally (e.g., the cloud) without exposing individual data → iterates until the global model is trained → suits diagnosing illnesses across multiple locations.

Exam flash

Definition pair to keep straight → Differential privacy blurs the data · Federated learning moves the training, not the data. If the scenario says "data never leaves the local site" → federated learning.

Key terms - quick answers

What is “Data cleansing”?
Removing erroneous and irrelevant data to protect performance and reduce privacy risk.
What is “Data labelling”?
Tagging or annotating data (images, audio, text); accuracy directly affects learning.
What is “Anonymisation”?
Removing items that could identify individuals, such as names and addresses; complete anonymisation is difficult.
What is “Purpose specification & minimisation”?
Keeping data unnecessary for the application out of model training to protect privacy.