Wrangling the Data
Five considerations turn raw data into model-ready data without trampling privacy: cleansing, labelling, anonymisation and minimisation. Master the two PETs - differential privacy blurs the data; federated learning moves the training, not the data.
Five considerations turn raw data into model-ready data without trampling privacy.
- Data cleansing → removing erroneous and irrelevant data → protects performance and reliability, and cuts privacy risk by stripping unnecessary personal data → note that complete anonymisation is difficult because combining datasets can reidentify individuals
- Data labelling → tagging or annotating data → categorising images, transcribing audio, tagging text → accuracy directly affects learning → keep labelling consistent, by trained personnel, with collaborative tools for quality control
- Anonymisation → removing items that could identify individuals → names, addresses
- Purpose specification & minimisation → data unnecessary for the application should not train the model → minimising personal data protects privacy
The two PETs to know in depth:
Blurs information within datasets using advanced algorithms → data stays meaningful for analysis but becomes nonspecific enough to prevent identifying individuals → valuable in healthcare and finance.
Local systems train a central model independently on their own datasets → results are aggregated centrally (e.g., the cloud) without exposing individual data → iterates until the global model is trained → suits diagnosing illnesses across multiple locations.
Definition pair to keep straight → Differential privacy blurs the data · Federated learning moves the training, not the data. If the scenario says "data never leaves the local site" → federated learning.