# The Data Analysis Process

At the start of 2020, I took stock of my web developer career and made a hard pivot. Data analysis: it was a topic that I had heard a lot about the past year and so I decided to explore~ Because @treehouse tutorials reign supreme, I began this adventure by taking lessons offered by @bendeitch and @sfgergasi that covered the basics of using spread sheets to analyze, summarize and visually present data in a meaningful way. Naturally, I queried a few search engines to learn more and stumbled upon a plethora of resources outfitted for my study. The information below demonstrates knowledge gained thus far and will serve as the foundation for my journey into the world of data analytics.

## What is Data Analysis?

*Data Analysis* is a process of aggregating data by performing conditional and statistical evaluation of data to help professionals from all walks of like make well informed decisions and improve business and organizational efficiency. Advancements in technology has allowed this craft to flourish in ways that meaningfully impact everyday life and it can become an ideal niche for a transitioning web developer interested in data science!

There are multiple facets related to the analysis process as a whole: data is first collected then formatted. This involves confirming the accuracy of data by checking for anomalies, duplicate values or empty cells and ensuring column and row headers are titled. With data in clear view, outline strategic goals designed to produce a specific insight by defining questions that qualify potential solutions or results and quantify how to measure it. This seems like simple math: spread sheets offer functions that calculate data to confirm totals, identify relationships and generally answer initial questions.

Probability and statistical analysis does much to further evaluate data (i.e. a/b testing, correlation, distribution) with the most common statistic methods involving mean, standard deviation, linear regression, variance, percentile and hypothesis testing. In business, Key Performance Indicators (KPIs) are used to evaluate business activity performance: it should be stressed that formulating a list of questions based on KPIs prior to analyzing data is a best practice! Other relevant analyses include cohort, ethnographic, narrative, univariate, bivariate, nominal, ordinal, hermeneutic, discourse and phenomenological.

Once data has been thoroughly vetted and prepped for presentation, it should be charted to help users visually interpret the mapping of existing relationships between targeted variables. Chart-type selection should be taken serious as graphic choice plays a pivotal role in the effectiveness of how a story is told. Thus, storytellers must strongly consider the analytical purpose of an inquiry when outlining an approach for how to present data in a meaningful way. With data, stories begin by allowing users to readily access data, in which case users should be able to filter and sort data to make comparisons based on conditions that determine. Providing options to ascertain anomalies, correlations or the range takes those charts to the next level. These are concepts that help create the stories that you want the data to tell!

## Collect and Organize Data

Data is obtained from spread sheets that can be modified using programs like Microsoft Excel. Data files obtained from external sources typically use the `.csv`

file extension for downloads, but data saved in excel should use the `.xlsx`

extension (be aware of `.xlsm`

extension used solely with Excel 4.0 macro-enabled). In this regard, use a date format with version number to save work in chronological order like so: [`YYYYMMDDversion#`

].

While most analyst receive data directly from their employer, researchers must depend on external sources for data collection. Regardless of how the data was procurred, data is considered either structured or unstructured. Whereas structured data is properly organized as a defined *data model* and resides in a relational database, unstructured data does not. Here are some open sourced datasets that can help the most novice of intellects brainstorm first-class solutions for mankind's most complex issues.

#### Dataset Resources

The process of using spread sheets to import data is quite simple:

**w/ Excel**

**>>**Get External Data

**>>**Choose Source

**w/ Google**

**>>**Import

**>>**Upload

**>>**Import Action

**>>**Create New

**>>**Import Data

#### Working with Data

The data stored in an individual sheet is called a *dataset*: the values (text or numeric) stored in rows or columns are referred to as *data-series*. Datasets are separated using tabs located at the bottom of the sheet, and are especially useful for testing *scenarios* when *forecasting*. It is a best practice to use tabs to separate originally inputted data from forecasted output assumptions.

Data is labeled by its column and/or row headers so one of the first things to do when organizing data is to *freeze the pane* to keep column headers visible. *Transpose* the headers to switch the placement of column and row labels and *convert values* before starting analysis. Leave *comments* to collaboratively exchange information and *notes* to define data within a cell to help analyst stay abreast of file changes. *Sort* data using specified criteria such as a quantity, price of an item, order of an item, etc. *Filter* a targeted dataset when dealing with several categories then sort specified columns within the filter module. Keep in mind that filtered data is temporarily hidden and must be restored before continuing analysis.

**Freeze Panes**

**>>**Freeze Panes

**Transpose**

**>>**Edit

**>>**Paste Special

**>>**Transpose

**Convert Values**

`123`

Button **>>**Choose Value

**Sort Data**

**>>**Data

**>>**Sort

**Filter Data**

**>>**Data

**>>**Filter

**>>**Choose Categories to Ignore

**Restore Data**

**>>**Select All

There are many keyboard shortcuts that can be used to make organizing data less complicated as well.

## Summarize Data

Spread sheets have features that make handling data a cinch. Most of the heavy lifting is conducted via formulas and functions, which are used during the summarizing phase to test a hypothesis. While a *formula* is a math expression made up of cell addresses and arithmetic operators, a *function* is a built-in formula that enables analysts to perform calculations using values based on a specified condition. Excel Easy offers high-quality examples to aid when using Microsoft Excel spread sheets. Before diving into what these tools can do, it is useful to know how to reference data and duplicate formulas for other calculations.

Functions use all values from a single column: these values are known as a *range* / *variable* and can be applied to other functions to expound upon initial calculations. Although using mean is most accurate, using median produces more realistic numbers. When using a variable / range to measure something, it turns into a metric that can be used for additional calculations. *Metrics* are aggregated values stored in cells, thus, how they are referenced is vital to the integrity of data being analyzed. At its initial state, all cells are *relative*, identified by a column and row number. Individual cells can be referenced by separating such cells with commas; reference cell groups by separating the first and last cells with a colon. Cells can be referenced *absolutely* (terms defined) so that the value in that cell can be used to calculate the value in another cell: such data never updates and is useful when performing multiple calculations using the same dataset. In cases were an absolute reference is implemented, that range of data can be *duplicated* for usage in another cell to perform an entirely separate calculation. Keep an eye on absolutely referenced cells and how they are referenced as either the column, row or both can be referenced. In the event of either/or, data may be updated resulting in poor analysis.

**Reference Cells**

**Absolute Reference**

**>>**F4

**Convert a Value**

`123`

Button **>>**Choose Value

**Sort Data**

**>>**Data

**>>**Sort

**Filter Data**

**>>**Data

**>>**Filter

**>>**Choose Categories to Ignore

**Restore Data**

**>>**Select All

**Duplicate a Formula**

**Add Intervals**

**>>**Cell2: Base + #

**>>**Drag Cells

#### Implement a Function

The aforementioned tactics are imperative for working with formulas and functions, which are the stimulus behind summarizing data. All functions begin with an equal sign (=) followed by its function name in all caps; parenthesis enclose the parameters which is the calculation to be performed on referenced cells.

`=FUNCTION_Name(`

cell_reference, condition`)`

There are a plethora of formulas and functions that can be used to modify, summarize and validate data depending on the problem at hand. Implementing formulas and functions requires a bit of practice for which I created an article to demonstrate conceptual usage.

#### Adding Data between Worksheets

*Lookup()* functions provide a conditional search to get data from a column or row in one worksheet then adding it to the column or row in another worksheet using VLOOKUP() or HLOOKUP(), respectively. When using these functions it is important to delete empty rows or columns and check that the same data type is being used in the 1st column of the range.

**Process to Implement VLOOKUP() Function**

**>>**Lookup & Reference

**>>**VLOOKUP

**>>**Input Args

This function takes four arguments: a lookup_value, a table_array, a column_index_number and a range_lookup value. The **lookup_value** is the cell of data to match: it must be in the farthest left column. The **table_array** is the range of cells to search wherein the headers represent columns of data to retrieve: must absolutely reference cells to use entire column data!. The **col_ind_num** represents the header column number for the column of data to retrieve: the argument is always a number. The **range_lookup** is an optional value: use false to return an exact-match or true to return an approximate match. It is a best practice to use false when working with unique identifiers, and to sort data in ascending order otherwise.

`=VLOOKUP(`

A2,'worksheet - dataset'!$A:$C, 3, false`)`

In the example above, the cell `A2`

is the lookup value that we want to match with data from `'worksheet - dataset'!`

and is absolutely referenced as so: `$A:$C`

. This matches all data between the targeted columns in both input and output tabs where the `3`

indicates the data targeted for retrieval, returning an exact match. The HLOOKUP() function must be implemented manually.

#### Identifying Patterns

Because data is not always distributed purposefully, it is important to know ways by which to measure the spread of data: the *Standard Deviation* (σ) is a function that can do just that.

`=STDEV(`

range, [condition]`)`

This function easily calculates the distance between values and the mean value. For example, a group of numbers that are the same would have no deviation. To calculate the standard deviation by hand:

- Subtract All Values by the Average
- Square All Values, then Add Total Sum
- Divide the Total Sum by # of Values

(This result is the Variance) - Apply Square Root to Variance

When combining data of different sample sizes, the returned metrics may have skewed results. Data is considered to have *Normal Distribution* whenever it does not skew. One property of normal distribution is how often values occur within a given standard deviation, where distribution is graded.

**Grade-1**

**Grade-2**

**Grade-3**

There are two functions that can be used to ascertain how data is distributed, *Skew()* and *Countifs()*:

`=SKEW(`

'Range'`)`

`=COUNTIFS(`

'Range', ">"&Cell2-Cell3, 'Range', "<"&Cell2+Cell3)/Cell1`)`

The first condition checks if the value is greater than the mean minus one standard deviation. The second condition checks if the value is less than the mean plus one standard deviation. The function is then divided by the total sum of values to determine the distribution. In the example below, the first condition checks if the range `'2017'!S2:S50`

is `>`

the mean **minus** one standard deviation `&B6-B10`

and the second condition checks if the range is `<`

the mean **plus** one standard deviation. This is then divided by the total sum of values `B1`

.

`=COUNTIFS(`

'2017'!S2:S50, ">"&B6-B10, '2017'!S2:S50, "<"&B6+B10)/B1`)`

Ascertain distribution at grade-2 or -3 by absolutely referencing the range and multiplying the standard deviation by 2 or 3, respectively.

`=COUNTIFS(`

'2017'!$S$2:$S$50, ">"&B6-B10×2, '2017'!$S$2:$S$50, "<"&B6+B10×2)/B1`)`

`=COUNTIFS(`

'2017'!$S$2:$S$50, ">"&B6-B10×3, '2017'!$S$2:$S$50, "<"&B6+B10×3)/B1`)`

The COUNTIFS() function can also be used to calculate the number of values within a specified interval, which can be useful when charting large datasets. In the example below `A1`

represents the base number interval and `A2`

is the interval baseline.

`=COUNTIFS(`

'2017'!$S$2:$S$50, ">"&A1, '2017'!$S$2:$S$50, "<"&A2`)`

## Validate Data

Errors occur with input, output, logic or in a formula itself. When data is missing or inputted incorrectly and/or a formula is misspelled, uses wrong order, bad math, etc. problems occur! During analysis and even before charting data, it is imperative to be weary of flawed data that could impair *statistical significance*, which consists of qualitative measures that ascertain meaningfulness of the analysis. *Hypothesis testing* is important to validate results of such measurements. A good hypothesis statement should include (a) an “if” and “then”, (b) both the independent and dependent variables, (c) be testable by experiment, survey or other scientifically sound technique, (d) be based on info in prior research (either yours or someone else’s), and (e) have design criteria (for engineering or programming projects). Aside from a hypothesis test, analysts can use built-in features to check for errors as well as implement measures that prevent users from making errors.

#### Prevent Errors using Data Validation

Data validation is a good tool to use to lessen the likelihood of human input errors: an example is a company forecasting revenue by which all store managers input data into a shared file. Validation controls how data is entered while providing users w/ instructions to avoid errors; any error will produce a message notifying user why input is being denied. Here are a few commands that help:

**Validate Function**

**>>**Data Validation (x2)

**>>**Settings

**>>**Allow (whole #s)

**>>**Data Dropdown

**>>**Between

**>>**Set min / max

**Circle Invalid**

**>>**Data Validation

**>>**Select Circle Invalid Data

**Clear Invalid**

**>>**Data Validation (x2)

**>>**Error Message

**>>**Clear

**Configure Error MSG**

**>>**Data Validation (x2)

**>>**Error (Default setting 'Stop' prevents data input; 'Warning' gives user option to enter input or not)

**Configure Input MSG**

**>>**Data Validation (x2)

**>>**Input Message

#### Checking for Errors

There are also commands that assist with checking for errors that are blind to the naked eye:

**Check Sum Values**

*Add all Values for a product then compare the two sets of totals.*

**Highlight Formulas**

**>>**Find & Select

**>>**Go to Special

**>>**Formulas

**Display Formulas**

**>>**Select Show Formulas (

*Click again to Remove*)

**Evaluate Formulas**

**>>**Select Evaluate Formula

**>>**Select (

*Helps solve order-of-operation errors.*)

**Find and Replace**

**>>**CTRL-F

**>>**Enter Find What + Replace With

**Trace Error Origin**

**>>**Select Error Checking

**>>**Select Trace Error

**Trace Dependence**

*Cells that contain a formula can be traced to cells with originating values. Must have pointer on the cell with the formula. Useful when working with large file with multiple tabs.*)

## Visualize Data

How data is collected and summarized determines the quality by which that data can be articulated. Using charts helps articulate data in a manner that the audience can visually imagine.

#### Chart Types

There are many ways that data can be charted, but all of which depend on a given situation or *scenario* in relation to how you want that data presented.

*Column* and *Bar* charts commonly used to *compare different values among items and across multiple categories* **or** *to show changes in data over a period of time*. These charts work well to highlight differences between a limited number of data-series.

*Pie* charts are used to *quantify the proportional value of an item to the sum total of all items*. Often displayed as a percentage, this chart focuses on an individual data-series and works well to highlight market share, a snapshot allocation or wealth management.

*Line* charts are used to *compare different values of similar data over a continuous period of time*. This chart works well to illustrate trends
Great to illustrate trends in data at equally specified time intervals such as months, quarters or years; particularly useful to understand seasonality or cannibalization in business.

*Scatter* charts are used to *identify correlations between two sets of scientific data*, specifically, to ascertain if a relationship exists between two variables. Such charts are used to compare numeric values and/or perform statistical analysis only: to determine relationships that help make future predictions. A linear line demonstrates a correlation, which is the measurement of the interdependence of variable quantities expressed in a range between -1 (negative), 0 (none), 0.5 (low) and 1 (positive). The *Correll* function is used in unison with this chart to *calculate correlations between two data-series*.

*100% Stacked Area* charts are used in similar fashion as line charts and is great to illustrate trends as a percentage. *Combo * charts can be created using the aforementioned chart types and the line chart works well in this regard. There are strategies that can be implemented to create a range of charts: use a *Speedometer* to capture a value of one data-series within a range, show progress with a *Thermometer*, track schedule efficiency with a *Gantt* chart or use the *Pareto* method to describe how 80% of an events effects derive from 20% of the causes. More elaborate schemes can be used to create *Maps* and reflect *Stock* prices.

With a clear idea of how to visualize a problem, analyst can be confident in their chart selection and proceed to implement. Creating charts is a relatively simple process:

**Create a Chart**

**+**Insert

**>>**Choose Chart

*--or--*

**Convert Chart**

**>>**Move

**>>**Rename

**>>**Ok

**Save Template**

**>>**Save as Template

**Use Template**

**>>**Add a Chart/Change Chart

**>>**Select Template

**Change Type**

**>>**Change Chart Type

Chart examples are provided below along with tools to learn more about chart types and usage cases.

#### PivotTable

A *PivotTable* offers analysts a quick synopsis of data: it is a means to format data prior that can be filtered within a *PivotChart*. Data used to create pivottables should be separated in order to add new data later (be sure to refresh when adding data and tinker with output values).

**Create PivotTable**

**>>**Select Table

**>>**Name Table

**+**Insert

**>>**Pivot Table

**>>**Choose New

**Change Calculation**

**>>**Field Settings

**>>**Choose

This process will open a new sheet with a PivotTable Field, which will reveal *fields* of which accept categorical data. Because the order that data is added will affect how it is displayed, target larger categories first, using this order: **[ Rows > Columns > Values > Filters ]** where row and columns serve as headers, values reflect dollar figures and filter represents a drop-down list of categorized data.

#### Format Charts

Once a chart has been made, it can be formatted to include and modify various chart components. Both the chart and individual data-series can be formatted to add an aesthetic flare that contributes to the overall story being told.

**Edit Chart**

**>>**Choose Style

**>>**Add Chart Element

**>>**Choose Option

**>>**Format Chart Area

**>>**Choose Option

**Add Sparklines**

**>>**Insert

**>>**Sparklines

The first method to edit a chart targets text styles and enables analyst to 'shape' data-series. The latter two methods are used to apply components to a chart: a brief description of key elements follows. Note: not all charts will accept all components.

Aside from the the actual chart title, *Axis Titles* can be used to *define values*. *Data Lables* can be used to *display exact values* on a chart, but a *Data Table* can be used to *display the dataset* from which the chart was created. *Legends* are used to *define the relationship of a data-series with the data being graphed* while *Trend lines* are used to *forecast*.

The *Right-Click* method provides options to target data-series and text using fill, glow, outline, shadow, size and 3d properties. Use this method to *add* *Line Markers* *to a data-series*.

Another formatting feature is the *Sparkline*, which is a mini-graph of sorts that is displayed in a cell.

#### Conditional Formatting

Analyst format charts to 'arrange' data in a meaningful way, but we can use *Conditional Formatting* for presentation purposes to help the audience 'make sense' of the data. When telling the story, it is imperative to *introduce a problem, share how you formulated a hypothesis and steps taken to test results before finally sharing conclusions*. Conditional formatting instills confidence when sharing conclusions as these features highlight differences based on specified conditions being met to help visualize data trends or variances to better analyze data more efficiently.

**Add Condition**

**>>**Home

**>>**Conditional Formatting

The best part about **CF** is that it adds color to life, literally: all properties of this feature implements a standard Green-Yellow-Red color scheme to distinguish high/low, positive/negative values from one another. Choose the *Highlight Cell Rules* feature to invoke *greater/less than, equal to or between x and y conditions; or to check for duplicate values*. The *Top/Bottom Rules* feature can be used to *highlight values above/below average as well as the top or bottom 10(%)*. *Data Bars* can be used to *compare and highlight the value of a cell relative to other values within that data-series*.

The *Color Scales* feature is awesome: use it to make a Heat Map as it is used to *distinguish values along a color spectrum.* To this end, *Icon Sets* are similar to color scales with additional context in the form of arrows, shapes and ratings to clearly demonstrate fixed points along that data-series values' spectrum. All of these features are formatting *rules*: analyst have the ability to manage these rules, reset all data rules and create rules for a specific purpose. For example, the spectrum uses a default ration of >68-<67-<33, but these numbers can be modified to fit the needs of a scenario.

## Basic Visualized Examples

Presented below are basic examples and explanations for how to visualize data.

This *Bar Chart* highlights the *total sum of email signatures collected by ten teams* of volunteers while raising awareness about youth homelessness. Before inserting a chart, I __sorted__ values along the vertical axis to reflect the rank order by highest value instead of by team name. Recognizing that all values fall between the range of 40 and 90, I modified the horizontal axis' __bounds and units__ to eliminate whitespace outside the range, then added __data labels__ to the outside end of each value for clarity.

This *Pie Chart* *compares total donations earned among five groups and compares them to each other as a proportional percentage*. The groups used here are made up of colleagues from different departments at a non-profit organization that provides home-based no-cost after-school tutoring services to low-income youth, while operating solely on donations. I used the `design`

tab setting to set each group as a certain color, employed a __legend__ and formatted the __data labels__ to match groups with their respective piece of the pie, by which each slice reflects a specific dollar amount and the percentage of all donations each group contributed.

This *Line Chart* *compares membership data for seven regional health club franchises*. In contrast with the bar and pie charts, line charts are used to compare the values of multiple variables simultaneously: the data used to create this chart included franchises location and their membership numbers by month over the course of the year. I used a __legend__ to make it easy to match each group with their slice of the pie. I formatted the __data series__ and used __line markers__ to highlight the membership count at each franchise for each month.

This *Column Chart* *compares Business Expenditures over a Three-Year period* wherein the expenditures are based on four categories: raw materials, tools, labor and facilities. After inserting a chart, I realized that the header titles were misplaced, so I __transposed__ the column and row headers and this had an immediate impact on the chart. Then, I added __gridlines__ and centered __data labels__ before tossing the __legend__ directly beneath the title to make it easy to interpret the expenditures of each category during each year under review. For good measure, I formatted the bounds and units on the __vertical axis__.

These *Scatter Plot* charts were made to *identify correlations of statistical results of the top NBA scorers during the 2018-2019 playoffs*. The data for this chart consisted of two dozen statistical categories for the top 50 scorers. I used two statistical pairings (points / field goal percentage and points / minutes) to illustrate a negative and mid-level correlation among the stats. The points / minutes pairing showed a positive correlation within the test parameters, but it would still be considered a weak correlation in general. Formatted the __axis titles__, __bound and units__, and added a __trend line__.

That covers the basic principles of how to visualize data during the analysis process. Please take the time to review how I process data at length: find that article here.