1 of 13

Introduction to Data

Introduction to Data explains fundamental concepts that inform data-related research, the use of data manipulation tools, and data project creation.

¶ What is Data?

Data is a collection of facts, statistics, measurements, and the like that are recorded (or should be recorded) using standardized methods. It is the smallest or rawest form of information and, as such, requires analysis and interpretation. A variety of means are used to collect data, some of which include questionnaire interviews, document analysis, machine measurements, and web scraping.

The terms "data" and "statistics" are often used interchangeably, however, in scholarly research, there is an important distinction between them. Data are individual pieces of factual information recorded and used for the purpose of analysis. It is the raw information from which statistics are created. Statistics are the results of data analysis, meaning its interpretation and presentation.

What Do Data Research Questions Look Like?

The following represent questions that would benefit from a data-oriented analysis and data DS methods, e.g., data visualization.

Where in the texts and how often do children speak in Virginia Woolf's novels?
What does Rodolfo Gonzales' correspondence reveal about his political networks?
How closely does the rate of heart disease in adults correlate with economic class, race, gender, and area type (i.e., urban, suburban, or rural)?

Structured & Unstructured Data

Data can be in three different forms: unstructured, semi-structured, and structured.

Unstructured Data

Unstructured data is, essentially, a bucket of content or data points that are not organized and categorized. A folder full of images and digitized texts are a form of unstructured data. (In both cases, steps can be taken to structure them, however.)

Quantitative & Qualitative Data

There are two types of data, quantitative and qualitative. Generally speaking, when you measure something and give it a number value, you create quantitative data. When you classify or judge something, you create qualitative data. There are also different types of quantitative and qualitative data. (Also see, Qualitative vs Quantitative Data article.)

Qualitative Data

Qualitative data is used to characterize objects or observations, which can be collected in a non-numerical and non-binary way, such as languages. Qualitative data can include:

Text
Audio and video recordings
Experiment notes, lab reports
Interview transcripts

Two types of qualitative data include categorical, meaning data that can be organized in groups, and ordinal, meaning qualitative data that follows a natural order.

Quantitative Data

Quantitative data, as the name suggests, relates to the quantity of something, and typical examples of quantitative data are numbers. Quantitative data can include:

Surveys data, including longitudinal and cross-sectional studies
Count frequency
Calculations such as calculating monthly gross margin

Two types of quantitative data include continuous, meaning numbers that can be made more precise and divided, e.g, a 4.3 earthquake, and discrete, meaning numbers that cannot be divided, e.g., the number of people in a household cannot include a fraction such as 3.5.

Humanities & Data

While data-oriented scholarship is perhaps more often associated with the sciences and social sciences, it has as much purpose and relevance in the humanities.

Data visualization can be used to illustrate social networks, how information spreads over time and place, historical, literary, and intellectual trends, and much more. The Belfast Group Poetry visualizes literary networks and Geography of the Post visualizes the spread of the US Postal Service in the nineteenth century.

Database creation also makes up a considerable amount of humanities data-related scholarship. Such databases often incorperate primary sources and facilitate the asking and answering of research questions. They Came on Waves of Ink is a database created from a nineteenth-century Puget Sound Customs District ledger and Enslaved.org, a highly collaborative and grant-funded project, is a database created from slavery-related records provided by different archives and datasets from existing projects like Voyages: The Trans-Atlantic Slave Trade Database.

¶ What is Data Visualization?

Data visualization refers to representing data in a visual context, like a chart or a map, to help people understand the significance of that data. Visualization is a frequent final output of research. Putting some time and strategic thought into data visualization at the beginning of a research project can help you create more effective visualization. (For more on data visualization, see the "Data Visualization" section in DS Methodologies Overview.)

Three Types of Data Visualizations

Data visualization is usually one of three types:

Scientific visualization, meaning the representation of scientific phenomena that tend to be tied to real-world objects with spatial properties e.g., modeling airflow over an airplane.
Information visualization under which falls most statistical charts and graphs and also includes other visual and spatial representations.
Infographic, meaning a specific sort of visualization that combines information visualization with narrative.

In the video below, David McCandless talks about how we can use visualizations to make data more meaningful. He explains who he turns complex data sets (like worldwide military spending, media buzz, Facebook status updates) into beautiful, simple diagrams that tease out unseen patterns and connections.

¶ DS Data Projects

What is a DS Data Project?

Digital scholarship data projects usually involve data visualization and/or the creation of databases for the purposes of making data more manageable, navigable, and intelligible. Depending on the tools and methods used, different types of visualizations can be achieved and queries run for asking and answering scholarly questions. (See examples.)

What Do DS Data Projects Involve?

Among other takes, data projects require planning, the acquisition of existing data or collecting of new data, data cleaning, and structuring, and, of course, analysis. (Also see, .)

Where can I get DS Data Project Support at BC?

BC Libraries' facilitates, supports, and consults on data acquisition, management, curation, and visualization as well as design and provide data-related in-class instruction and . BC's also provides support as well as for platforms like ArcGIS.

Getting Started Questions

The following questions are helpful to consider when beginning a data project.

Data Acquisition

Are you looking for data & statistics with a time period or geography focus?
Are you looking for a specific data type? e.g., qualitative, qualitative, GIS, multimedia
Are you collecting your own data for your research?
Have you started searching for data sources?
Do you need support on data management (DMP), preservation, or sharing?

Data Manipulation

What data format are you using? e.g., Excel, Stata, SPSS
What data tasks do you need to conduct?
- Data cleaning: the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

Data Visualization

What format does your data come in? e.g., Excel, text file, JSON, PDF, spatial
Do you have any preferred visualization tool you want to use?
Do you need help with choosing the visualization tools?

Project Examples

The following are examples of data-related projects that highlight. Note more complex projects, especially ones with custom platforms, are grant or institutionally funded, which enables scholars to create more robust public-facing works.

Humanities

Divisions of the Bible - a data visualization project that uses the tool
- a network visualization project that uses the tool .
- a mapping project that uses the tools , , and
Also see, for more projects

- a visualization project shows the historical immigration to the U.S. (1830 -2015)
- A visualization project that displays fertility rate, life expectancy, and population of countries in six world regions using
- brings technologists, government, and communities to rapidly prototype digital products—powered by federal open data—that solve real-world problems for communities across the country.

Sciences

- an international, collaborative research program whose goal was the complete mapping and understanding of all the genes of human beings
- a scientific collaboration of international physics institutes and research groups dedicated to the search for gravitational waves
- a BC science project that shows how much the ground moves in Weston, Massachusetts

Visualization Tools

There are a variety of data visualization tools available, many of them open source, to help you explore existing data visualization or to create your own. Below are a few examples.

Excel

Excel is a powerful tool for getting meaning out of vast amounts of data and offers a library of chart and graph types to help users visualize their spreadsheet data.

Tableau

is a data visualization and analytics platform that enables users to connect to a variety of data sources and explore the data in a simplified way. The drag and drop interface makes it very easy to visualize and create interactive dashboards without any programming skills. (Browse the Tableau to see examples of visuals and dashboards.)

Palladio

is a web-based data visualization tool for analyzing relationships across time and visualize historical or cultural networks.

Gephi

is free software for visualizing networks. The main website hosts official tutorials and also links to popular community-developed tutorials.

D3.JS

is a JavaScript library for producing dynamic, interactive data visualizations in web browsers. It is ideal for people who want to develop some JavaScript Programming skills and offers great power and flexibility.

¶ Research Data Lifecycle

Research data has a "life cycle" that describes and identifies the steps to be taken at the different stages of the research cycle to ensure successful data curation and preservation. The research data lifecycle can be divided into two main parts, Active Research Stage and Post-Active Research Stage.

During the active research stage, research activities mainly include data planning, acquiring, and analysis; while during the post-active research stage, the focus is on long-term data preservation, sharing, and re-use (Also see, Data Management Best Practices).

Active Research Stage

Planning - The stage it is determined how data will be managed. Typical considerations include:

The type and format of data will be used.
Whether any collected data will involve human subjects.
Where the data will be stored and whether it will be re-used or shared at the end of the project.

Acquire (or "Find") - The stage of when data is found or collected. There are a few steps that can help you develop your approach:

Define your topic as specifically as possible. For example:
- What is the average SAT score by race for the last 10 years?
Identify the unit of analysis, meaning what you will specifically be analyzing and by what measure. For example:

Collaborate and Analyze - The stage of your (and your collaborators') acitve use of the research data.

What data processing tool(s) are you using? e.g., Excel, Stata, SPSS, Python, R
What kind of data are you working on? e.g., numerical, categorical, text
What kind of data tasks are you performing? e.g., data cleaning, descriptive statistics.

Post-Active Research Stage

Store and Preserve - The planning stage for how the data will be archived for long-term preservation. Considerations include:

What archive/repository/database have you identified as a place to deposit data? e.g., Dataverse
How long will data be kept beyond the life of the project?
What metadata schema will you use? Established domain-specific repositories will usually only accept data that meet their standards for file formats, documentation and metadata, e.g.,

Share (or Publish) - The stage in which data is shared (or re-used) after a project. Some considerations include:

Through what resources/platforms the data be made available, e.g., a server or data repository
When the data will be made available, e.g., immediately or after a 12 month embargo
If the dataset was collected by the researchers, how it will be licensed to others e.g., a Creative Commons licenses

Discovery and Re-Use - this stage involves facilitating data sharing, which refers to publicly sharing data from completed (parts of) research, and having data reusable, i.e. outside your project or research team.

Whether any permission restrictions need to be placed on the data, e.g., non-commercial use
What are the intended or foreseeable uses of the data and who are the users

The following video explains the data management activities that can take place at different stages of the research process.

Data Management Best Practices

The following are some best practices that should be considered prior to starting a data project and provide guidance for managing data in the Research Data Lifecycle's post-active research stage.

Data Storage

To prevent data from being lost to incompatibility, store it as formats and on hardware that are open standard, not proprietary.

Data Documentation

In your documentation, use to record details about the data collection process (e.g., a study) such as:

its context
the dates of data collections
data collection methods, etc.

Sharing data makes it possible for researchers to validate research results and to reuse data for teaching and further research. Sharing is also required by an increasing number of funders and publishers. Funders seek to maximize the impact of the research they fund by encouraging or requiring data sharing.

Depositing to an established repository will help to ensure that data are consistently available and accessible, and preserved for future use. Choosing a data repository can be determined by various factors, such as discipline, accepted data format, data sharing policies and etc. You can obtain assistance from to identify a repository to publish your research data.

¶ Glossary

Attributes are the describing characteristics or properties that define all items pertaining to a certain category applied to all cells of a column.

Data is a collection of facts, statistics, measurements, and the like that are recorded (or should be recorded) using standardized methods.

Data collection is a systematic process of gathering observations or measurements.

Data Visualization is a graphical representation of data.

Metadata is often simply defined as "data about data" or "information about information".

¶ Research Data Lifecycle

Active Research Stage

Planning - The stage it is determined how data will be managed. Typical considerations include:

The type and format of data will be used.
Whether any collected data will involve human subjects.
Where the data will be stored and whether it will be re-used or shared at the end of the project.

Acquire (or "Find") - The stage of when data is found or collected. There are a few steps that can help you develop your approach:

Define your topic as specifically as possible. For example:
- What is the average SAT score by race for the last 10 years?
Identify the unit of analysis, meaning what you will specifically be analyzing and by what measure. For example:

Collaborate and Analyze - The stage of your (and your collaborators') acitve use of the research data.

What data processing tool(s) are you using? e.g., Excel, Stata, SPSS, Python, R
What kind of data are you working on? e.g., numerical, categorical, text
What kind of data tasks are you performing? e.g., data cleaning, descriptive statistics.

Post-Active Research Stage

Store and Preserve - The planning stage for how the data will be archived for long-term preservation. Considerations include:

What archive/repository/database have you identified as a place to deposit data? e.g., Dataverse
How long will data be kept beyond the life of the project?
What metadata schema will you use? Established domain-specific repositories will usually only accept data that meet their standards for file formats, documentation and metadata, e.g.,

Share (or Publish) - The stage in which data is shared (or re-used) after a project. Some considerations include:

Through what resources/platforms the data be made available, e.g., a server or data repository
When the data will be made available, e.g., immediately or after a 12 month embargo
If the dataset was collected by the researchers, how it will be licensed to others e.g., a Creative Commons licenses

Whether any permission restrictions need to be placed on the data, e.g., non-commercial use
What are the intended or foreseeable uses of the data and who are the users

The following video explains the data management activities that can take place at different stages of the research process.

Introduction to Data

hashtagContents

¶ What is Data?

hashtagWhat Do Data Research Questions Look Like?

Structured & Unstructured Data

hashtagUnstructured Data

Quantitative & Qualitative Data

hashtagQualitative Data

hashtagQuantitative Data

Humanities & Data

¶ What is Data Visualization?

hashtagThree Types of Data Visualizations

¶ DS Data Projects

hashtagWhat is a DS Data Project?

hashtagWhat Do DS Data Projects Involve?

hashtagWhere can I get DS Data Project Support at BC?

Getting Started Questions

hashtagData Acquisition

hashtagData Manipulation

hashtagData Visualization

Project Examples

hashtagHumanities

hashtagSocial Sciences

hashtagSciences

Visualization Tools

hashtagExcel

hashtagTableau

hashtagPalladio

hashtagGephi

hashtagD3.JS

¶ Research Data Lifecycle

hashtagActive Research Stage

hashtag Post-Active Research Stage

Data Management Best Practices

hashtagData Storage

hashtagData Documentation

hashtagSharing

¶ Glossary

Quantitative & Qualitative Data

hashtagQualitative Data

hashtagQuantitative Data

¶ What is Data Visualization?

hashtagThree Types of Data Visualizations

¶ What is Data?

hashtagWhat Do Data Research Questions Look Like?

Introduction to Data

hashtagContents

Humanities & Data

Structured & Unstructured Data

hashtagUnstructured Data

hashtagSemi-structured Data

hashtagStructured Data

hashtagStructured Data Example

¶ DS Data Projects

hashtagWhat is a DS Data Project?

hashtagWhat Do DS Data Projects Involve?

hashtagWhere can I get DS Data Project Support at BC?

Project Examples

hashtagHumanities

hashtagSocial Sciences

hashtagSciences

Visualization Tools

hashtagExcel

hashtagTableau

hashtagPalladio

hashtagGephi

hashtagD3.JS

Getting Started Questions

hashtagData Acquisition

hashtagData Manipulation

hashtagData Visualization

Data Management Best Practices

hashtagData Storage

hashtagData Documentation

hashtagSharing

¶ Research Data Lifecycle

hashtagActive Research Stage

hashtag Post-Active Research Stage

¶ Glossary

Contents