Structured & Unstructured Data

Data can be in three different forms: unstructured, semi-structured, and structured.

Unstructured Data

Unstructured data is, essentially, a bucket of content or data points that are not organized and categorized. A folder full of images and digitized texts are a form of unstructured data. (In both cases, steps can be taken to structure them, however.)

Examples of unstructured data:

  • Text files: such as word documents, PDFs, TXT files

  • Multimedia content: image files, such as TIFF, JPEG, audio/video files. such MP3, MP4

  • Qualitative data: such as survey responses, interview transcripts

Semi-structured Data

Semi-structured data lies midway between structured and unstructured data. It doesn't have a specific relational or tabular data model but includes tags and semantic markers that scale data into records and fields in a dataset. Common examples of semi-structured data are JSON and XML.

The following is an example of semi-structured data using JSON. The data describes an author's work.

{

    "name": 
    {
"surname": "Lee",

    "given-name": "Julia",
    
"viaf_id": 49595329
    
},

    "role": "author",

    "degrees": 
    "Ph.D.",

    "affiliation": 
        {
"class": "academic institution",

        "institution": "Acadimia as Colonialism"

        }

    }

Structured Data

Structured Data is data that is organized and categorized so that it can be more effectively analyzed, in particular by tools like databases and data visualization applications.

Understanding a little about structured data provides a lot of insight into how data works in various data tools. Data is structured in a tabular form (spreadsheets) or tables created using coding and markup languages. For the sake of simplicity, we will look at structured data through the lens of tabular data.

Structured Data Example

Tabular data, what we think of as spreadsheets, is structured data organized in rows. Rows represent a record (or unit of analysis) and each column represents a different attribute (also referred to as a variable or field).

An attribute describes everything that falls within it or, in this case, underneath it. Think of it like tagging. Everything in a column is tagged by the attribute. Each horizontal line is a row, and a single row makes up what is called a record, meaning a series of data points that go together.

To put tabular data or a spreadsheet into a more relatable context, here is an imaginary DMV database spreadsheet.

Notice that each data point falls under the appropriate attribute and each row represents a single driver's license (a record). Also notice that none of the driver's license numbers repeat. These are unique identifiers that help distinguish records from one another when information is the same or very similar. Moreover, the unique identifier is a datapoint by which the record can be searched.

As shown in the example, structured data is highly organized and easily understood by machine language. Those working within relational databases can input, search, and manipulate structured data relatively quickly using a relational database management system (RDBMS).

Last updated