AIGP Study Guide
Module 6: Governing AI Development · BoK III.B

Data Formats and the Five V's

Know the three structure types (structured, unstructured, semi-structured), the static/streaming split, and the five V's of data preparation: Volume, Velocity, Variety, Veracity, Value. Unstructured data fuels GenAI.

Know the three structure types, the static/streaming split, and the five V's of data preparation.

Three data structure types
StructuredUnstructuredSemi-structured
StructureFixed fields → rows and columnsNo specific structure → doesn't fit database fieldsPartially structured → tags, elements or markers describe content
UseEasier to analyse → business intelligence, quantitativePredictive analytics, qualitative insights → fuels GenAIEasier to process than unstructured → good for diverse or evolving sources
ExamplesCustomer records · transaction dates in a ledgerSocial media posts · video, audio, imagesEmail (standard format + free text) · XML files

Static data → does not change, e.g., records of past sales · Streaming data → changes frequently, e.g., customer visits to a website updating with each visit.

The five V's of data preparation

Volume · Velocity · Variety · Veracity · Value → data preparation (data wrangling) turns raw data into valuable information; check all five.

  • Volume → the sheer amount of data → plan storage, processing power, tooling
  • Velocity → the speed data is generated and updated → static vs dynamic, real-time streams vs periodic refresh
  • Variety → the different types → structured spreadsheets to unstructured images and video → drives tool selection
  • Veracityaccuracy and trustworthiness → reliable sources, free of errors and bias → validation and cleansing protect integrity
  • Valueusefulness toward the system's goals → low-value data wastes resources → prioritise datasets aligned to objectives

Key terms - quick answers

What is “Structured data”?
Data in fixed fields (rows and columns); easier to analyse, used for business intelligence.
What is “Unstructured data”?
Data with no specific structure (social posts, video, audio, images) that fuels GenAI.
What is “Semi-structured data”?
Partially structured data using tags, elements or markers (e.g. email, XML).
What is “Static data”?
Data that does not change, e.g. records of past sales.