Data can be in three different forms: unstructured, semi-structured, and structured.
Unstructured data is, essentially, a bucket of content or data points that are not organized and categorized. A folder full of images and digitized texts are a form of unstructured data. (In both cases, steps can be taken to structure them, however.)
Text files: such as word documents, PDFs, TXT files
Multimedia content: image files, such as TIFF, JPEG, audio/video files. such MP3, MP4
Qualitative data: such as survey responses, interview transcripts
Semi-structured data lies midway between structured and unstructured data. It doesn't have a specific relational or tabular data model but includes tags and semantic markers that scale data into records and fields in a dataset. Common examples of semi-structured data are JSON and XML.
The following is an example of semi-structured data using JSON. The data describes an author's work.
Structured Data is data that is organized and categorized so that it can be more effectively analyzed, in particular by tools like databases and data visualization applications.
Understanding a little about structured data provides a lot of insight into how data works in various data tools. Data is structured in a tabular form (spreadsheets) or tables created using coding and markup languages. For the sake of simplicity, we will look at structured data through the lens of tabular data.
Tabular data, what we think of as spreadsheets, is structured data organized in rows. Rows represent a record (or unit of analysis) and each column represents a different attribute (also referred to as a variable or field).
An attribute describes everything that falls within it or, in this case, underneath it. Think of it like tagging. Everything in a column is tagged by the attribute. Each horizontal line is a row, and a single row makes up what is called a record, meaning a series of data points that go together.
To put tabular data or a spreadsheet into a more relatable context, here is an imaginary DMV database spreadsheet.
Notice that each data point falls under the appropriate attribute and each row represents a single driver's license (a record). Also notice that none of the driver's license numbers repeat. These are unique identifiers that help distinguish records from one another when information is the same or very similar. Moreover, the unique identifier is a datapoint by which the record can be searched.
As shown in the example, structured data is highly organized and easily understood by machine language. Those working within relational databases can input, search, and manipulate structured data relatively quickly using a relational database management system (RDBMS).
While data-oriented scholarship is perhaps more often associated with the sciences and social sciences, it has as much purpose and relevance in the humanities.
Data visualization can be used to illustrate social networks, how information spreads over time and place, historical, literary, and intellectual trends, and much more. The visualizes literary networks and visualizes the spread of the US Postal Service in the nineteenth century.
Database creation also makes up a considerable amount of humanities data-related scholarship. Such databases often incorperate primary sources and facilitate the asking and answering of research questions. is a database created from a nineteenth-century Puget Sound Customs District ledger and , a highly collaborative and grant-funded project, is a database created from slavery-related records provided by different archives and datasets from existing projects like .
There are two types of data, quantitative and qualitative. Generally speaking, when you measure something and give it a number value, you create quantitative data. When you classify or judge something, you create qualitative data. There are also different types of quantitative and qualitative data. (Also see, Qualitative vs Quantitative Data article.)
Qualitative data is used to characterize objects or observations, which can be collected in a non-numerical and non-binary way, such as languages. Qualitative data can include:
Text
Audio and video recordings
Experiment notes, lab reports
Interview transcripts
Two types of qualitative data include categorical, meaning data that can be organized in groups, and ordinal, meaning qualitative data that follows a natural order.
Quantitative data, as the name suggests, relates to the quantity of something, and typical examples of quantitative data are numbers. Quantitative data can include:
Surveys data, including longitudinal and cross-sectional studies
Count frequency
Calculations such as calculating monthly gross margin
Quantification: converting descriptive data to numbers such as satisfaction rating from 1-4
Two types of quantitative data include continuous, meaning numbers that can be made more precise and divided, e.g, a 4.3 earthquake, and discrete, meaning numbers that cannot be divided, e.g., the number of people in a household cannot include a fraction such as 3.5.
Categorical
Ordinal
States (e.g., New York, Massachusetts, Arizona)
Economic class (e..g, lower class, middle class, higher class)
People names (e.g., Matt, Emily, Maria)
Satisfaction scale (e.g., extremely dislike, dislike, neutral, like, extremely like)
Brands (e.g., Coke, Pepsi, Dr. Pepper)
Sports medals (e.g., gold, silver, bronze)
Data is a collection of facts, statistics, measurements, and the like that are recorded (or should be recorded) using standardized methods. It is the smallest or rawest form of information and, as such, requires analysis and interpretation. A variety of means are used to collect data, some of which include questionnaire interviews, document analysis, machine measurements, and web scraping.
The terms "data" and "statistics" are often used interchangeably, however, in scholarly research, there is an important distinction between them. Data are individual pieces of factual information recorded and used for the purpose of analysis. It is the raw information from which statistics are created. Statistics are the results of data analysis, meaning its interpretation and presentation.
The following represent questions that would benefit from a data-oriented analysis and data DS methods, e.g., data visualization.
Where in the texts and how often do children speak in Virginia Woolf's novels?
What does Rodolfo Gonzales' correspondence reveal about his political networks?
How closely does the rate of heart disease in adults correlate with economic class, race, gender, and area type (i.e., urban, suburban, or rural)?
How do the rates of African American population increase in Philadelphia and Los Angeles between 1916-1940 correlate with changes in housing laws and redlining practices in both cities?