What is Unstructured Data?

What is Unstructured Data?

A conversation with a technology executive last week got me on my high horse. “Unstructured data isn’t unstructured at all”. “Tell me more, they said”

This distinction is worth arguing the toss. According to the world of IT, there are three categories of data; understanding them helps explain why 80% of enterprise data stays largely invisible to traditional analytics.

Structured data is the world of databases: Oracle, SQL, PostgreSQL. By design, databases play traffic cop to every piece of information that they hold, enforcing rigid schemas with no exceptions. You get extraordinary performance and scalability, but if data doesn’t fit the preordained mold, it simply doesn’t get in.

Semi-structured data occupied the middle ground. The file itself becomes the container; think a JSON or XML sidecar file carrying metadata or an SRT file that sits alongside every deliverable. There are patterns to the data inside the file, but those patterns aren’t set in stone, and any application that understands the format can read or write to it.

Then there’s unstructured data, the vast ocean where most production information lives. Camera raw footage, EXR plates from compositing, render files from VFX, color grading LUTs. Every file is its own container, but every container has its own internal layout. ARRI raw and RED raw files might sit in the same folder but they store the same live footage completely differently inside.

“Unstructured” as a name is a legacy of database or IT-centric thinking. Despite the name, it by no means equals disorganized. It simply means that data doesn’t conform to a rows-and-columns logic that relational databases want. A 4K camera file has incredibly precise structure inside it, but it’s considered “unstructured” because a database can’t parse it.

I’ll admit I get animated about taxonomy. But naming shapes budgeting; call something ‘unstructured’ and finance treats it as chaos to be tolerated rather than assets to be managed. The fact that this distinction makes people lean in suggests we’re ready to retire some inherited vocabulary.

Ready to Transform Your Data Management?

We’re in a Content Arms Race

Bring the Compute to the Data?