Data Questions, Quality, Jurisdiction and Lineage
Without the right, enough and accurate data the system won't perform - garbage in, garbage out. Anticipate jurisdiction (data localisation laws, KYC), and keep data lineage and data provenance charted and documented.
If you don't have the right data, enough data or accurate data, the system won't perform. Four questions, then the quality and traceability layer.
- WHAT data is required?
- HOW MUCH data is needed?
- HOW is data collected?
- WHERE is data stored?
Quality checks. "Garbage in, garbage out" → is the data accurate? Is it representative of the data used in practice? Is it free from bias? Statistical sampling helps identify gaps.
Jurisdictional requirements. Anticipate → privacy requirements · data localisation laws · regulatory disclosures like KYC ("Know Your Customer"), the process by which financial institutions verify customers and check that funding sources are legitimate. Investigate compliance obligations now, build them into development.
| Data lineage | Data provenance |
|---|---|
| Tracks the flow of data over time → origin, how it changed, destination, across the life cycle | Tracks and logs the history and origin → creation and collection through transformation, incl. sources, processes, actors, methods |
| Used for historical context and tracing issues to a root cause | Used to ensure integrity and quality and to identify applicable laws tied to the data's origins |
Both Data lineage and Data provenance must be charted and documented → use datasheets or model inventory templates to record them.
"You really need to look at the quality of the data that is going into your AI design and your overall system and model." - Julie McEwen, AIGP, CIPM, CIPT, FIP