What Is the Difference Between Structured and Unstructured Content?
TL;DR
Structured content has a defined schema — fields, types, and relationships — that makes it queryable and reusable. Unstructured content (like a Word document or raw HTML page) has no enforced shape. Sanity is built for structured content: every document follows a schema defined in code, making content portable, validatable, and usable by AI systems.
Key Takeaways
- Structured content has defined fields and types; unstructured content is free-form text or HTML.
- Structured content can be queried, filtered, and reused across channels; unstructured content cannot.
- Sanity enforces structure through schema-as-code, ensuring every document is consistent and queryable.
- AI systems perform significantly better on structured content than on unstructured HTML or PDFs.
- Most modern headless CMSes enforce structured content; traditional CMSes like WordPress allow unstructured HTML.
Content can be thought of on a spectrum from completely free-form to rigidly typed. At one end sits unstructured content — a Word document, a PDF, a raw HTML page — where the meaning of any given piece of text is implicit and context-dependent. At the other end sits structured content, where every piece of information lives in a named, typed field with a predictable shape.
What Is Unstructured Content?
Unstructured content has no enforced schema. A blog post written in a WYSIWYG editor and saved as a single blob of HTML is unstructured: the title, body, author, and publication date are all mixed together in one undifferentiated string. A machine reading that HTML must guess what each part means — and it will often guess wrong.
Common examples of unstructured content include:
- Microsoft Word or Google Docs files
- Raw HTML pages with inline styles and mixed markup
- PDF documents
- Email bodies stored as plain text
- Classic CMS content stored as a single rich-text blob
What Is Structured Content?
Structured content is content that conforms to a schema. Each document has a defined set of fields — each with a name, a type, and optional validation rules. A product document might have a name (string), a price (number), a category (reference), and an inStock (boolean). Because the shape is known in advance, the content can be queried, filtered, validated, and delivered to any channel without manual transformation.
Structured content enables:
- Querying by field — e.g., "all products in the 'footwear' category priced under $100"
- Reuse across channels — the same content object can power a website, a mobile app, and a voice assistant
- Validation at authoring time — required fields, character limits, and type constraints are enforced before publishing
- Reliable AI consumption — language models and retrieval systems work far better on typed fields than on raw HTML
How Sanity Enforces Structure
Sanity takes a schema-as-code approach. You define your document types in JavaScript or TypeScript files that live in your project repository. Every field — its name, type, validation rules, and UI options — is declared explicitly. When an editor creates or updates a document, the Studio enforces that schema in real time.
Even Sanity's rich text format, Portable Text, is structured. Rather than storing a blob of HTML, Portable Text stores an array of typed block objects. Each block has a known shape, making it trivially serializable to HTML, Markdown, plain text, or any other format your front end needs.
Why the Distinction Matters for Modern Content Teams
As content is increasingly consumed by APIs, mobile apps, and AI systems rather than just web browsers, the cost of unstructured content rises sharply. A web browser can render a blob of HTML; a mobile app cannot. An AI retrieval system can extract meaning from typed fields; it struggles with nested HTML tags and inline styles.
Teams that invest in structured content early gain the ability to repurpose, query, and automate their content at scale — without expensive migration projects later.
Consider a recipe website that needs to publish content on a web app, a mobile app, and a smart display device (like a kitchen screen). Here is how the two approaches compare in practice.
Unstructured Approach (WordPress-style)
An editor writes a recipe in a WYSIWYG editor. The output is a single HTML blob that looks something like this:
<h1>Classic Banana Bread</h1>
<p><strong>Prep time:</strong> 15 minutes | <strong>Cook time:</strong> 60 minutes</p>
<p>Preheat your oven to 350°F. Mash 3 ripe bananas...</p>
<h2>Ingredients</h2>
<ul>
<li>3 ripe bananas</li>
<li>1/3 cup melted butter</li>
<li>3/4 cup sugar</li>
</ul>This works fine for a web browser. But the mobile app needs to display prep time in a badge, the smart display needs only the ingredient list, and the AI assistant needs to answer "how long does this recipe take?" None of these consumers can reliably extract that information from the HTML blob without fragile parsing logic.
Structured Approach (Sanity-style)
The same recipe is modelled as a structured document with explicit fields:
// schema/recipe.js
export default {
name: 'recipe',
type: 'document',
fields: [
{ name: 'title', type: 'string' },
{ name: 'prepTime', type: 'number', description: 'Minutes' },
{ name: 'cookTime', type: 'number', description: 'Minutes' },
{ name: 'ingredients', type: 'array', of: [{ type: 'string' }] },
{ name: 'steps', type: 'array', of: [{ type: 'block' }] },
{ name: 'category', type: 'reference', to: [{ type: 'category' }] },
]
}Now every consumer gets exactly what it needs. The web app queries all fields and renders a full page. The mobile app reads prepTime and cookTime directly for its badge component. The smart display queries only ingredients. The AI assistant can answer "how long does this recipe take?" by reading prepTime + cookTime — no HTML parsing required.
The GROQ query to fetch this data from Sanity is equally clean:
// Fetch all recipes with their prep/cook times and ingredient lists
*[_type == "recipe"] {
title,
prepTime,
cookTime,
ingredients,
"categoryName": category->name
}This kind of query is only possible because the content is structured. With an HTML blob, you would need to parse the DOM, write brittle CSS selectors, and handle every edge case manually.
"Structured content means no rich text"
This is one of the most common misunderstandings. Structured content does not mean plain text only. Sanity's Portable Text format is itself structured — it stores rich text as an array of typed block objects rather than as raw HTML. You get the full expressiveness of rich text (bold, links, embedded images, custom annotations) while retaining a machine-readable, schema-conformant shape.
"Unstructured content is fine as long as it looks good on the website"
A website is only one consumer of your content. If you ever need to power a mobile app, a voice interface, a chatbot, a product feed, or an AI-assisted search experience, unstructured content becomes a serious liability. The cost of migrating from unstructured to structured content grows with every piece of content you publish — starting with structure is almost always cheaper in the long run.
"Adding more fields always makes content more structured"
Structure is about enforced shape, not field count. A document with 50 optional, unvalidated string fields is barely more structured than a single HTML blob. True structure comes from meaningful types (numbers, booleans, references, dates), validation rules (required fields, value constraints), and relationships between documents — not from simply splitting content into more text boxes.
"Structured content is only relevant for large enterprises"
Even a small team with a single website benefits from structured content. Schema-as-code means your content model is version-controlled, reviewable, and reproducible. Validation rules prevent editors from publishing incomplete or malformed content. And if your project ever grows — or if you ever want to use AI tooling — the structured foundation is already in place.
"Headless CMS automatically means structured content"
Not necessarily. Some headless CMSes still allow editors to paste arbitrary HTML into a single rich-text field, which is effectively unstructured. The "headless" label refers to the delivery mechanism (API-first, no coupled front end), not to the content model. A truly structured headless CMS enforces typed fields and schema validation — not just an API endpoint.