Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Data visualization refers to representing data in a visual context, like a chart or a map, to help people understand the significance of that data. Visualization is a frequent final output of research. Putting some time and strategic thought into data visualization at the beginning of a research project can help you create more effective visualization. (For more on data visualization, see the "Data Visualization" section in DS Methodologies Overview.)
Data visualization is usually one of three types:
Scientific visualization, meaning the representation of scientific phenomena that tend to be tied to real-world objects with spatial properties e.g., modeling airflow over an airplane.
Information visualization under which falls most statistical charts and graphs and also includes other visual and spatial representations.
Infographic, meaning a specific sort of visualization that combines information visualization with narrative.
In the video below, David McCandless talks about how we can use visualizations to make data more meaningful. He explains who he turns complex data sets (like worldwide military spending, media buzz, Facebook status updates) into beautiful, simple diagrams that tease out unseen patterns and connections.
Digital scholarship data projects usually involve data visualization and/or the creation of databases for the purposes of making data more manageable, navigable, and intelligible. Depending on the tools and methods used, different types of visualizations can be achieved and queries run for asking and answering scholarly questions. (See examples.)
Among other takes, data projects require planning, the acquisition of existing data or collecting of new data, data cleaning, and structuring, and, of course, analysis. (Also see, Research Data Lifecycle.)
BC Libraries' Data Services facilitates, supports, and consults on data acquisition, management, curation, and visualization as well as design and provide data-related in-class instruction and workshops. BC's Research Services also provides support as well as licenses for platforms like ArcGIS.
While data-oriented scholarship is perhaps more often associated with the sciences and social sciences, it has as much purpose and relevance in the humanities.
Data visualization can be used to illustrate social networks, how information spreads over time and place, historical, literary, and intellectual trends, and much more. The visualizes literary networks and visualizes the spread of the US Postal Service in the nineteenth century.
Database creation also makes up a considerable amount of humanities data-related scholarship. Such databases often incorperate primary sources and facilitate the asking and answering of research questions. is a database created from a nineteenth-century Puget Sound Customs District ledger and , a highly collaborative and grant-funded project, is a database created from slavery-related records provided by different archives and datasets from existing projects like .
Data can be in three different forms: unstructured, semi-structured, and structured.
Unstructured data is, essentially, a bucket of content or data points that are not organized and categorized. A folder full of images and digitized texts are a form of unstructured data. (In both cases, steps can be taken to structure them, however.)
Text files: such as word documents, PDFs, TXT files
Multimedia content: image files, such as TIFF, JPEG, audio/video files. such MP3, MP4
Qualitative data: such as survey responses, interview transcripts
Semi-structured data lies midway between structured and unstructured data. It doesn't have a specific relational or tabular data model but includes tags and semantic markers that scale data into records and fields in a dataset. Common examples of semi-structured data are JSON and XML.
The following is an example of semi-structured data using JSON. The data describes an author's work.
Structured Data is data that is organized and categorized so that it can be more effectively analyzed, in particular by tools like databases and data visualization applications.
Understanding a little about structured data provides a lot of insight into how data works in various data tools. Data is structured in a tabular form (spreadsheets) or tables created using coding and markup languages. For the sake of simplicity, we will look at structured data through the lens of tabular data.
Tabular data, what we think of as spreadsheets, is structured data organized in rows. Rows represent a record (or unit of analysis) and each column represents a different attribute (also referred to as a variable or field).
An attribute describes everything that falls within it or, in this case, underneath it. Think of it like tagging. Everything in a column is tagged by the attribute. Each horizontal line is a row, and a single row makes up what is called a record, meaning a series of data points that go together.
To put tabular data or a spreadsheet into a more relatable context, here is an imaginary DMV database spreadsheet.
Notice that each data point falls under the appropriate attribute and each row represents a single driver's license (a record). Also notice that none of the driver's license numbers repeat. These are unique identifiers that help distinguish records from one another when information is the same or very similar. Moreover, the unique identifier is a datapoint by which the record can be searched.
As shown in the example, structured data is highly organized and easily understood by machine language. Those working within relational databases can input, search, and manipulate structured data relatively quickly using a relational database management system (RDBMS).
Introduction to Data explains fundamental concepts that inform data-related research, the use of data manipulation tools, and data project creation.
The following questions are helpful to consider when beginning a data project.
Are you looking for data & statistics with a time period or geography focus?
Are you looking for a specific data type? e.g., qualitative, qualitative, GIS, multimedia
Are you collecting your own data for your research?
Have you started searching for data sources?
Do you need support on data management (DMP), preservation, or sharing?
What data format are you using? e.g., Excel, Stata, SPSS
What data tasks do you need to conduct?
Data cleaning: the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.
Data merging: the process of combining two or more data sets into a single data set.
Aggregation (Summarization): the process of gathering data and presenting it in a summarized format.
What format does your data come in? e.g., Excel, text file, JSON, PDF, spatial
Do you have any preferred visualization tool you want to use?
Do you need help with choosing the visualization tools?
There are two types of data, quantitative and qualitative. Generally speaking, when you measure something and give it a number value, you create quantitative data. When you classify or judge something, you create qualitative data. There are also different types of quantitative and qualitative data. (Also see, Qualitative vs Quantitative Data article.)
Qualitative data is used to characterize objects or observations, which can be collected in a non-numerical and non-binary way, such as languages. Qualitative data can include:
Text
Audio and video recordings
Experiment notes, lab reports
Interview transcripts
Two types of qualitative data include categorical, meaning data that can be organized in groups, and ordinal, meaning qualitative data that follows a natural order.
Quantitative data, as the name suggests, relates to the quantity of something, and typical examples of quantitative data are numbers. Quantitative data can include:
Surveys data, including longitudinal and cross-sectional studies
Count frequency
Calculations such as calculating monthly gross margin
Quantification: converting descriptive data to numbers such as satisfaction rating from 1-4
Two types of quantitative data include continuous, meaning numbers that can be made more precise and divided, e.g, a 4.3 earthquake, and discrete, meaning numbers that cannot be divided, e.g., the number of people in a household cannot include a fraction such as 3.5.
The following are examples of data-related projects that highlight. Note more complex projects, especially ones with custom platforms, are grant or institutionally funded, which enables scholars to create more robust public-facing works.
- a data visualization project that uses the tool
- a network visualization project that uses the tool .
- a mapping project that uses the tools , , and
Also see, for more projects
- a visualization project shows the historical immigration to the U.S. (1830 -2015)
- A visualization project that displays fertility rate, life expectancy, and population of countries in six world regions using
- brings technologists, government, and communities to rapidly prototype digital products—powered by federal open data—that solve real-world problems for communities across the country.
- utilizing Facebook data, and featuring on providing insights on the topics including social connections, relative wealth, COVID-19 impact, climate change, etc.
- an international, collaborative research program whose goal was the complete mapping and understanding of all the genes of human beings
- a scientific collaboration of international physics institutes and research groups dedicated to the search for gravitational waves
- a BC science project that shows how much the ground moves in Weston, Massachusetts
- a science project for showcasing astronomical data and knowledge
Attributes are the describing characteristics or properties that define all items pertaining to a certain category applied to all cells of a column.
Data is a collection of facts, statistics, measurements, and the like that are recorded (or should be recorded) using standardized methods.
Data collection is a systematic process of gathering observations or measurements.
Data Visualization is a graphical representation of data.
Metadata is often simply defined as "data about data" or "information about information".
Data points are single units of data or single observations, e.g., a single measurement or a single geolocation point.
A Database is a systematic collection of data.
Dataset (or data set) is a collection of data. Typically, it is structured and housed in a tabular form (e.g., a spreadsheet).
The data life cycle represents all of the stages of data throughout its life from its creation for a study to its distribution and reuse. The data lifecycle begins with a researcher(s) developing a concept for a study; once a study concept is developed, data is then collected for that study.
Data Literacy is the ability to read, understand, create, and communicate data as information.
Geospatial data is defined in the series of standards as data and information having an implicit or explicit association with a location relative to Earth.
Quantitative data relates to the quantity of something, and typical examples of quantitative data are numbers.
Qualitative data is used to characterize objects or observations, which can be collected in a non-numerical and non-binary way, such as languages.
Structure data refers to data that resides in a fixed field within a file or record, e.g., spreadsheet.
Unstructured data refers to a bucket of content or data points that are not organized and categorized, e.g., PDF files, image files.
The following are some best practices that should be considered prior to starting a data project and provide guidance for managing data in the post-active research stage.
To prevent data from being lost to incompatibility, store it as formats and on hardware that are open standard, not proprietary.
In your documentation, use to record details about the data collection process (e.g., a study) such as:
its context
the dates of data collections
data collection methods, etc.
Sharing data makes it possible for researchers to validate research results and to reuse data for teaching and further research. Sharing is also required by an increasing number of funders and publishers. Funders seek to maximize the impact of the research they fund by encouraging or requiring data sharing.
Research data has a "life cycle" that describes and identifies the steps to be taken at the different stages of the research cycle to ensure successful data curation and preservation. The research data lifecycle can be divided into two main parts, Active Research Stage and Post-Active Research Stage.
During the active research stage, research activities mainly include data planning, acquiring, and analysis; while during the post-active research stage, the focus is on long-term data preservation, sharing, and re-use (Also see, ).
Planning - The stage it is determined how data will be managed. Typical considerations include:
The type and format of data will be used.
Whether any collected data will involve human subjects.
Where the data will be stored and whether it will be re-used or shared at the end of the project.
Acquire (or "Find") - The stage of when data is found or collected. There are a few steps that can help you develop your approach:
Define your topic as specifically as possible. For example:
What is the average SAT score by race for the last 10 years?
Identify the unit of analysis, meaning what you will specifically be analyzing and by what measure. For example:
Geographic unit, e.g., local, national, international
Frequency, e.g., annual, quarterly, daily
Unit of analysis, e.g., individual, institution
Time series, e.g., cross-sectional, longitudinal (or panel)
Identify data sources. For example:
Government agencies, e.g., census
Organization, e.g., International Monetary Fund (IMF)
Commercial Subscription Services, e.g., Inter-University Consortium for Political and Social Research (ICPSR), Statista
Collaborate and Analyze - The stage of your (and your collaborators') acitve use of the research data.
What data processing tool(s) are you using? e.g., Excel, Stata, SPSS, Python, R
What kind of data are you working on? e.g., numerical, categorical, text
What kind of data tasks are you performing? e.g., data cleaning, descriptive statistics.
Are you working in a team and is there a designated project manager?
Are you looking for a web-based tool for working on your data?
Store and Preserve - The planning stage for how the data will be archived for long-term preservation. Considerations include:
What archive/repository/database have you identified as a place to deposit data? e.g., Dataverse
How long will data be kept beyond the life of the project?
Share (or Publish) - The stage in which data is shared (or re-used) after a project. Some considerations include:
Through what resources/platforms the data be made available, e.g., a server or data repository
When the data will be made available, e.g., immediately or after a 12 month embargo
If the dataset was collected by the researchers, how it will be licensed to others e.g., a Creative Commons licenses
Discovery and Re-Use - this stage involves facilitating data sharing, which refers to publicly sharing data from completed (parts of) research, and having data reusable, i.e. outside your project or research team.
Whether any permission restrictions need to be placed on the data, e.g., non-commercial use
What are the intended or foreseeable uses of the data and who are the users
The following video explains the data management activities that can take place at different stages of the research process.
Depositing to an established repository will help to ensure that data are consistently available and accessible, and preserved for future use. Choosing a data repository can be determined by various factors, such as discipline, accepted data format, data sharing policies and etc. You can obtain assistance from to identify a repository to publish your research data.
What metadata schema will you use? Established domain-specific repositories will usually only accept data that meet their standards for file formats, documentation and metadata, e.g.,
Categorical
Ordinal
States (e.g., New York, Massachusetts, Arizona)
Economic class (e..g, lower class, middle class, higher class)
People names (e.g., Matt, Emily, Maria)
Satisfaction scale (e.g., extremely dislike, dislike, neutral, like, extremely like)
Brands (e.g., Coke, Pepsi, Dr. Pepper)
Sports medals (e.g., gold, silver, bronze)
Type of Data | Recommended Formats | Formats Acceptable |
Plain Text | txt, pdf/A xml | docx, doc, rtf |
Tabular Text | csv, tsv | xlsx, xls, sav, dta |
Image | tiff, JPEF2000 | jpg, psd, png, gif, bmp |
Audio | wave, aiff | mp3, wma, aac, ogg |
Archiving | zip | rar |
Video | motion jpg 2000, mov, avi | mpeg-4 |
There are a variety of data visualization tools available, many of them open source, to help you explore existing data visualization or to create your own. Below are a few examples.
Excel is a powerful tool for getting meaning out of vast amounts of data and offers a library of chart and graph types to help users visualize their spreadsheet data.
Tableau is a data visualization and analytics platform that enables users to connect to a variety of data sources and explore the data in a simplified way. The drag and drop interface makes it very easy to visualize and create interactive dashboards without any programming skills. (Browse the Tableau public gallery to see examples of visuals and dashboards.)
Palladio is a web-based data visualization tool for analyzing relationships across time and visualize historical or cultural networks.
Gephi is free software for visualizing networks. The main website hosts official tutorials and also links to popular community-developed tutorials.
D3.js is a JavaScript library for producing dynamic, interactive data visualizations in web browsers. It is ideal for people who want to develop some JavaScript Programming skills and offers great power and flexibility.
Data is a collection of facts, statistics, measurements, and the like that are recorded (or should be recorded) using standardized methods. It is the smallest or rawest form of information and, as such, requires analysis and interpretation. A variety of means are used to collect data, some of which include questionnaire interviews, document analysis, machine measurements, and web scraping.
The terms "data" and "statistics" are often used interchangeably, however, in scholarly research, there is an important distinction between them. Data are individual pieces of factual information recorded and used for the purpose of analysis. It is the raw information from which statistics are created. Statistics are the results of data analysis, meaning its interpretation and presentation.
The following represent questions that would benefit from a data-oriented analysis and data DS methods, e.g., data visualization.
Where in the texts and how often do children speak in Virginia Woolf's novels?
What does Rodolfo Gonzales' correspondence reveal about his political networks?
How closely does the rate of heart disease in adults correlate with economic class, race, gender, and area type (i.e., urban, suburban, or rural)?
How do the rates of African American population increase in Philadelphia and Los Angeles between 1916-1940 correlate with changes in housing laws and redlining practices in both cities?
Subject/Discipline | Example Archive/Repository |
Ecology |
DNA Sequences |
Chemistry |
Social Sciences |