Module 6: Governing AI Development · BoK III.B
Data Formats and the Five V's
Know the three structure types (structured, unstructured, semi-structured), the static/streaming split, and the five V's of data preparation: Volume, Velocity, Variety, Veracity, Value. Unstructured data fuels GenAI.
Know the three structure types, the static/streaming split, and the five V's of data preparation.
| Structured | Unstructured | Semi-structured | |
|---|---|---|---|
| Structure | Fixed fields → rows and columns | No specific structure → doesn't fit database fields | Partially structured → tags, elements or markers describe content |
| Use | Easier to analyse → business intelligence, quantitative | Predictive analytics, qualitative insights → fuels GenAI | Easier to process than unstructured → good for diverse or evolving sources |
| Examples | Customer records · transaction dates in a ledger | Social media posts · video, audio, images | Email (standard format + free text) · XML files |
Static data → does not change, e.g., records of past sales · Streaming data → changes frequently, e.g., customer visits to a website updating with each visit.
The five V's of data preparation
Volume · Velocity · Variety · Veracity · Value → data preparation (data wrangling) turns raw data into valuable information; check all five.
- Volume → the sheer amount of data → plan storage, processing power, tooling
- Velocity → the speed data is generated and updated → static vs dynamic, real-time streams vs periodic refresh
- Variety → the different types → structured spreadsheets to unstructured images and video → drives tool selection
- Veracity → accuracy and trustworthiness → reliable sources, free of errors and bias → validation and cleansing protect integrity
- Value → usefulness toward the system's goals → low-value data wastes resources → prioritise datasets aligned to objectives
Key terms - quick answers
What is “Structured data”?
Data in fixed fields (rows and columns); easier to analyse, used for business intelligence.
What is “Unstructured data”?
Data with no specific structure (social posts, video, audio, images) that fuels GenAI.
What is “Semi-structured data”?
Partially structured data using tags, elements or markers (e.g. email, XML).
What is “Static data”?
Data that does not change, e.g. records of past sales.