The Data Analysis Process
At the start of 2020, I took stock of my web developer career and made a hard pivot. Data analysis: it was a topic that I had heard a lot about the past year and so I decided to explore~ Because @treehouse tutorials reign supreme, I began this adventure by taking lessons offered by @bendeitch and @sfgergasi that covered the basics of using spread sheets to analyze, summarize and visually present data in a meaningful way. Naturally, I queried a few search engines to learn more and stumbled upon a plethora of resources outfitted for my study. The information below demonstrates knowledge gained thus far and will serve as the foundation for my journey into the world of data analytics.
What is Data Analysis?
Data Analysis is a process of aggregating data by performing conditional and statistical evaluation of data to help professionals from all walks of like make well informed decisions and improve business and organizational efficiency. Advancements in technology has allowed this craft to flourish in ways that meaningfully impact everyday life and it can become an ideal niche for a transitioning web developer interested in data science!
There are multiple facets related to the analysis process as a whole: data is first collected then formatted. This involves confirming the accuracy of data by checking for anomalies, duplicate values or empty cells and ensuring column and row headers are titled. With data in clear view, outline strategic goals designed to produce a specific insight by defining questions that qualify potential solutions or results and quantify how to measure it. This seems like simple math: spread sheets offer functions that calculate data to confirm totals, identify relationships and generally answer initial questions.
Probability and statistical analysis does much to further evaluate data (i.e. a/b testing, correlation, distribution) with the most common statistic methods involving mean, standard deviation, linear regression, variance, percentile and hypothesis testing. In business, Key Performance Indicators (KPIs) are used to evaluate business activity performance: it should be stressed that formulating a list of questions based on KPIs prior to analyzing data is a best practice! Other relevant analyses include cohort, ethnographic, narrative, univariate, bivariate, nominal, ordinal, hermeneutic, discourse and phenomenological.
Once data has been thoroughly vetted and prepped for presentation, it should be charted to help users visually interpret the mapping of existing relationships between targeted variables. Chart-type selection should be taken serious as graphic choice plays a pivotal role in the effectiveness of how a story is told. Thus, storytellers must strongly consider the analytical purpose of an inquiry when outlining an approach for how to present data in a meaningful way. With data, stories begin by allowing users to readily access data, in which case users should be able to filter and sort data to make comparisons based on conditions that determine. Providing options to ascertain anomalies, correlations or the range takes those charts to the next level. These are concepts that help create the stories that you want the data to tell!
Collect and Organize Data
Data is obtained from spread sheets that can be modified using programs like Microsoft Excel. Data files obtained from external sources typically use the
.csv file extension for downloads, but data saved in excel should use the
.xlsx extension (be aware of
.xlsm extension used solely with Excel 4.0 macro-enabled). In this regard, use a date format with version number to save work in chronological order like so: [
While most analyst receive data directly from their employer, researchers must depend on external sources for data collection. Regardless of how the data was procurred, data is considered either structured or unstructured. Whereas structured data is properly organized as a defined data model and resides in a relational database, unstructured data does not. Here are some open sourced datasets that can help the most novice of intellects brainstorm first-class solutions for mankind's most complex issues.
The process of using spread sheets to import data is quite simple:
Working with Data
The data stored in an individual sheet is called a dataset: the values (text or numeric) stored in rows or columns are referred to as data-series. Datasets are separated using tabs located at the bottom of the sheet, and are especially useful for testing scenarios when forecasting. It is a best practice to use tabs to separate originally inputted data from forecasted output assumptions.
Data is labeled by its column and/or row headers so one of the first things to do when organizing data is to freeze the pane to keep column headers visible. Transpose the headers to switch the placement of column and row labels and convert values before starting analysis. Leave comments to collaboratively exchange information and notes to define data within a cell to help analyst stay abreast of file changes. Sort data using specified criteria such as a quantity, price of an item, order of an item, etc. Filter a targeted dataset when dealing with several categories then sort specified columns within the filter module. Keep in mind that filtered data is temporarily hidden and must be restored before continuing analysis.
There are many keyboard shortcuts that can be used to make organizing data less complicated as well.
Spread sheets have features that make handling data a cinch. Most of the heavy lifting is conducted via formulas and functions, which are used during the summarizing phase to test a hypothesis. While a formula is a math expression made up of cell addresses and arithmetic operators, a function is a built-in formula that enables analysts to perform calculations using values based on a specified condition. Excel Easy offers high-quality examples to aid when using Microsoft Excel spread sheets. Before diving into what these tools can do, it is useful to know how to reference data and duplicate formulas for other calculations.
Functions use all values from a single column: these values are known as a range / variable and can be applied to other functions to expound upon initial calculations. Although using mean is most accurate, using median produces more realistic numbers. When using a variable / range to measure something, it turns into a metric that can be used for additional calculations. Metrics are aggregated values stored in cells, thus, how they are referenced is vital to the integrity of data being analyzed. At its initial state, all cells are relative, identified by a column and row number. Individual cells can be referenced by separating such cells with commas; reference cell groups by separating the first and last cells with a colon. Cells can be referenced absolutely (terms defined) so that the value in that cell can be used to calculate the value in another cell: such data never updates and is useful when performing multiple calculations using the same dataset. In cases were an absolute reference is implemented, that range of data can be duplicated for usage in another cell to perform an entirely separate calculation. Keep an eye on absolutely referenced cells and how they are referenced as either the column, row or both can be referenced. In the event of either/or, data may be updated resulting in poor analysis.
Implement a Function
The aforementioned tactics are imperative for working with formulas and functions, which are the stimulus behind summarizing data. All functions begin with an equal sign (=) followed by its function name in all caps; parenthesis enclose the parameters which is the calculation to be performed on referenced cells.
There are a plethora of formulas and functions that can be used to modify, summarize and validate data depending on the problem at hand. Implementing formulas and functions requires a bit of practice for which I created an article to demonstrate conceptual usage.
Adding Data between Worksheets
Lookup() functions provide a conditional search to get data from a column or row in one worksheet then adding it to the column or row in another worksheet using VLOOKUP() or HLOOKUP(), respectively. When using these functions it is important to delete empty rows or columns and check that the same data type is being used in the 1st column of the range.
This function takes four arguments: a lookup_value, a table_array, a column_index_number and a range_lookup value. The lookup_value is the cell of data to match: it must be in the farthest left column. The table_array is the range of cells to search wherein the headers represent columns of data to retrieve: must absolutely reference cells to use entire column data!. The col_ind_num represents the header column number for the column of data to retrieve: the argument is always a number. The range_lookup is an optional value: use false to return an exact-match or true to return an approximate match. It is a best practice to use false when working with unique identifiers, and to sort data in ascending order otherwise.
=VLOOKUP(A2,'worksheet - dataset'!$A:$C, 3, false
In the example above, the cell
A2 is the lookup value that we want to match with data from
'worksheet - dataset'! and is absolutely referenced as so:
$A:$C. This matches all data between the targeted columns in both input and output tabs where the
3 indicates the data targeted for retrieval, returning an exact match. The HLOOKUP() function must be implemented manually.
Because data is not always distributed purposefully, it is important to know ways by which to measure the spread of data: the Standard Deviation (σ) is a function that can do just that.
This function easily calculates the distance between values and the mean value. For example, a group of numbers that are the same would have no deviation. To calculate the standard deviation by hand:
- Subtract All Values by the Average
- Square All Values, then Add Total Sum
- Divide the Total Sum by # of Values
(This result is the Variance)
- Apply Square Root to Variance
When combining data of different sample sizes, the returned metrics may have skewed results. Data is considered to have Normal Distribution whenever it does not skew. One property of normal distribution is how often values occur within a given standard deviation, where distribution is graded.
There are two functions that can be used to ascertain how data is distributed, Skew() and Countifs():
=COUNTIFS('Range', ">"&Cell2-Cell3, 'Range', "<"&Cell2+Cell3)/Cell1
The first condition checks if the value is greater than the mean minus one standard deviation. The second condition checks if the value is less than the mean plus one standard deviation. The function is then divided by the total sum of values to determine the distribution. In the example below, the first condition checks if the range
> the mean minus one standard deviation
&B6-B10 and the second condition checks if the range is
< the mean plus one standard deviation. This is then divided by the total sum of values
=COUNTIFS('2017'!S2:S50, ">"&B6-B10, '2017'!S2:S50, "<"&B6+B10)/B1
Ascertain distribution at grade-2 or -3 by absolutely referencing the range and multiplying the standard deviation by 2 or 3, respectively.
=COUNTIFS('2017'!$S$2:$S$50, ">"&B6-B10×2, '2017'!$S$2:$S$50, "<"&B6+B10×2)/B1
=COUNTIFS('2017'!$S$2:$S$50, ">"&B6-B10×3, '2017'!$S$2:$S$50, "<"&B6+B10×3)/B1
The COUNTIFS() function can also be used to calculate the number of values within a specified interval, which can be useful when charting large datasets. In the example below
A1 represents the base number interval and
A2 is the interval baseline.
=COUNTIFS('2017'!$S$2:$S$50, ">"&A1, '2017'!$S$2:$S$50, "<"&A2
Errors occur with input, output, logic or in a formula itself. When data is missing or inputted incorrectly and/or a formula is misspelled, uses wrong order, bad math, etc. problems occur! During analysis and even before charting data, it is imperative to be weary of flawed data that could impair statistical significance, which consists of qualitative measures that ascertain meaningfulness of the analysis. Hypothesis testing is important to validate results of such measurements. A good hypothesis statement should include (a) an “if” and “then”, (b) both the independent and dependent variables, (c) be testable by experiment, survey or other scientifically sound technique, (d) be based on info in prior research (either yours or someone else’s), and (e) have design criteria (for engineering or programming projects). Aside from a hypothesis test, analysts can use built-in features to check for errors as well as implement measures that prevent users from making errors.
Prevent Errors using Data Validation
Data validation is a good tool to use to lessen the likelihood of human input errors: an example is a company forecasting revenue by which all store managers input data into a shared file. Validation controls how data is entered while providing users w/ instructions to avoid errors; any error will produce a message notifying user why input is being denied. Here are a few commands that help:
Checking for Errors
There are also commands that assist with checking for errors that are blind to the naked eye:
How data is collected and summarized determines the quality by which that data can be articulated. Using charts helps articulate data in a manner that the audience can visually imagine.
There are many ways that data can be charted, but all of which depend on a given situation or scenario in relation to how you want that data presented.
Column and Bar charts commonly used to compare different values among items and across multiple categories or to show changes in data over a period of time. These charts work well to highlight differences between a limited number of data-series.
Pie charts are used to quantify the proportional value of an item to the sum total of all items. Often displayed as a percentage, this chart focuses on an individual data-series and works well to highlight market share, a snapshot allocation or wealth management.
Line charts are used to compare different values of similar data over a continuous period of time. This chart works well to illustrate trends Great to illustrate trends in data at equally specified time intervals such as months, quarters or years; particularly useful to understand seasonality or cannibalization in business.
Scatter charts are used to identify correlations between two sets of scientific data, specifically, to ascertain if a relationship exists between two variables. Such charts are used to compare numeric values and/or perform statistical analysis only: to determine relationships that help make future predictions. A linear line demonstrates a correlation, which is the measurement of the interdependence of variable quantities expressed in a range between -1 (negative), 0 (none), 0.5 (low) and 1 (positive). The Correll function is used in unison with this chart to calculate correlations between two data-series.
100% Stacked Area charts are used in similar fashion as line charts and is great to illustrate trends as a percentage. Combo charts can be created using the aforementioned chart types and the line chart works well in this regard. There are strategies that can be implemented to create a range of charts: use a Speedometer to capture a value of one data-series within a range, show progress with a Thermometer, track schedule efficiency with a Gantt chart or use the Pareto method to describe how 80% of an events effects derive from 20% of the causes. More elaborate schemes can be used to create Maps and reflect Stock prices.
With a clear idea of how to visualize a problem, analyst can be confident in their chart selection and proceed to implement. Creating charts is a relatively simple process:
Chart examples are provided below along with tools to learn more about chart types and usage cases.
A PivotTable offers analysts a quick synopsis of data: it is a means to format data prior that can be filtered within a PivotChart. Data used to create pivottables should be separated in order to add new data later (be sure to refresh when adding data and tinker with output values).
This process will open a new sheet with a PivotTable Field, which will reveal fields of which accept categorical data. Because the order that data is added will affect how it is displayed, target larger categories first, using this order: [ Rows > Columns > Values > Filters ] where row and columns serve as headers, values reflect dollar figures and filter represents a drop-down list of categorized data.
Once a chart has been made, it can be formatted to include and modify various chart components. Both the chart and individual data-series can be formatted to add an aesthetic flare that contributes to the overall story being told.
The first method to edit a chart targets text styles and enables analyst to 'shape' data-series. The latter two methods are used to apply components to a chart: a brief description of key elements follows. Note: not all charts will accept all components.
Aside from the the actual chart title, Axis Titles can be used to define values. Data Lables can be used to display exact values on a chart, but a Data Table can be used to display the dataset from which the chart was created. Legends are used to define the relationship of a data-series with the data being graphed while Trend lines are used to forecast.
The Right-Click method provides options to target data-series and text using fill, glow, outline, shadow, size and 3d properties. Use this method to add Line Markers to a data-series.
Another formatting feature is the Sparkline, which is a mini-graph of sorts that is displayed in a cell.
Analyst format charts to 'arrange' data in a meaningful way, but we can use Conditional Formatting for presentation purposes to help the audience 'make sense' of the data. When telling the story, it is imperative to introduce a problem, share how you formulated a hypothesis and steps taken to test results before finally sharing conclusions. Conditional formatting instills confidence when sharing conclusions as these features highlight differences based on specified conditions being met to help visualize data trends or variances to better analyze data more efficiently.
The best part about CF is that it adds color to life, literally: all properties of this feature implements a standard Green-Yellow-Red color scheme to distinguish high/low, positive/negative values from one another. Choose the Highlight Cell Rules feature to invoke greater/less than, equal to or between x and y conditions; or to check for duplicate values. The Top/Bottom Rules feature can be used to highlight values above/below average as well as the top or bottom 10(%). Data Bars can be used to compare and highlight the value of a cell relative to other values within that data-series.
The Color Scales feature is awesome: use it to make a Heat Map as it is used to distinguish values along a color spectrum. To this end, Icon Sets are similar to color scales with additional context in the form of arrows, shapes and ratings to clearly demonstrate fixed points along that data-series values' spectrum. All of these features are formatting rules: analyst have the ability to manage these rules, reset all data rules and create rules for a specific purpose. For example, the spectrum uses a default ration of >68-<67-<33, but these numbers can be modified to fit the needs of a scenario.
Basic Visualized Examples
Presented below are basic examples and explanations for how to visualize data.
That covers the basic principles of how to visualize data during the analysis process. Please take the time to review how I process data at length: find that article here.