Module 6: Governing AI Development · BoK III.B

Wrangling the Data

Five considerations turn raw data into model-ready data without trampling privacy: cleansing, labelling, anonymisation and minimisation. Master the two PETs - differential privacy blurs the data; federated learning moves the training, not the data.

Five considerations turn raw data into model-ready data without trampling privacy.

Data cleansing → removing erroneous and irrelevant data → protects performance and reliability, and cuts privacy risk by stripping unnecessary personal data → note that complete anonymisation is difficult because combining datasets can reidentify individuals
Data labelling → tagging or annotating data → categorising images, transcribing audio, tagging text → accuracy directly affects learning → keep labelling consistent, by trained personnel, with collaborative tools for quality control
Anonymisation → removing items that could identify individuals → names, addresses
Purpose specification & minimisation → data unnecessary for the application should not train the model → minimising personal data protects privacy

The two PETs to know in depth:

PET 1 · Differential privacy

Blurs information within datasets using advanced algorithms → data stays meaningful for analysis but becomes nonspecific enough to prevent identifying individuals → valuable in healthcare and finance.

PET 2 · Federated learning

Local systems train a central model independently on their own datasets → results are aggregated centrally (e.g., the cloud) without exposing individual data → iterates until the global model is trained → suits diagnosing illnesses across multiple locations.

Exam flash

Definition pair to keep straight → Differential privacy blurs the data · Federated learning moves the training, not the data. If the scenario says "data never leaves the local site" → federated learning.

Key terms - quick answers

What is “Data cleansing”?

Removing erroneous and irrelevant data to protect performance and reduce privacy risk.

What is “Data labelling”?

Tagging or annotating data (images, audio, text); accuracy directly affects learning.

What is “Anonymisation”?

Removing items that could identify individuals, such as names and addresses; complete anonymisation is difficult.

What is “Purpose specification & minimisation”?

Keeping data unnecessary for the application out of model training to protect privacy.

← Data Formats and the Five V's Features and Feature Engineering →