Follow the 16 step 1.5(IQR) Outlier Tutorial below to find statistical outliers in your own data:
1. Buy and download the 1.5(IQR) Outlier AddIn to get custom, interactive messages and outlier reports that automatically pop-up on screen as the 16 outlier analysis steps are run.
OR
2. Copy, paste and run the FREE 1.5(IQR) Statistical Outlier and Anomaly Detection Code in the VBA Code Editor of a PC or MAC Microsoft Excel workbook to find outliers in your data, then manually create descriptive statistics and visuals for your outliers report.
3. Learn how to set up your PC or MAC laptop or PC before using code (Option 2) or the AddIn (Option 1).
This 1.5(IQR) Outlier Tutorial shows descriptive statistics and visuals before and after outliers are removed or replaced. Only steps 1-4 using the AddIn require your input after clicking the 'Find Outliers' button to create your own outliers scenario on your datasets: different datasets will give different output and on-screen messages. Outliers are extreme data points based on the box & whisker plot. The box & whisker plot uses key statistics (Median, 25th Percentile, 75th Percentile, Interquartile Range (IQR)) to calculate the UPPER and LOWER limits of a box & whisker plot graph. Statistical outliers are values greater than the UPPER limit or less than the LOWER limit of a box & whisker plot. After replacing or transforming statistical outliers, create a 2-period moving average forecast chart. Visually compare your data's statistical story moving forward with and without outliers skewing results. And learn whether you might have to use special statistical formulas on your data if you want to keep outliers in your data analysis.
Step 1: Add the AddIn in Excel
Whether you use the Interactive Statistics Education AddIn or not, flagging and visualizing outliers in your data can help you learn a lot about how to successfully apply statistics as a data science or analytics professional with or without the use of algorithms.
Step 2: Select the 'Outlier Analysis' Tab
If you do choose to buy, download and install the Interactive Statistics Education AddIn, you will see a 'Find Outliers' button on an 'Outlier Analysis' tab at the top of every Microsoft Excel workbook that you open. Or, you can use the Free Outlier and Anomaly Detection Source Code referred to in Step 3 to select and find outliers in any column A of data.
Step 3: Click the 'Find Outliers' Button
To use the Interactive Statistics Education AddIn, all you must do is have your column of numeric data (eg. #, $, %) in Column 'A' on a worksheet labeled 'Sheet1' in your Excel workbook so that your report will be generated automatically. Without the AddIn, you can use the Free Outlier and Anomaly Detection Source Code on your own datasets to flag and visualize statistical outliers. You can use the Free Time Series Outlier and Anomaly Detection Source Code on your own datasets if you want to see forecasts (eg. moving average, linear regression) with and without statistical outliers across days, weeks, quarters, months, or even years.
Step 4: Select Your Column of Data
Once you click the 'Find Outliers' button of the Interactive Statistics Education AddIn, you will be prompted to select your one column of data in Column 'A' to be analyzed for outliers, and this column will be used to generate interactive on-screen messages and a full outlier analysis report automatically in the next steps.
Step 5: Outlier Analysis Conclusion
A new column of data has been added in the first stage of the outliers report creation: it flags which data points in Column 'A' are 'High Outlier' or 'Low Outlier' values based on the statistical theory of the box & whisker plot. You can also see a summary conclusion pop-up message on-screen explaining which dataset values are outliers. Click the 'OK' button to move on to the next step automatically. If you perform a single factor Analysis of Variance (ANOVA), you should find that the five segments created in column B are statistically significant. This will be true if the p-Value generated by your ANOVA output is less than the level of significance (eg. alpha = .05) at which you choose to do your analysis. This reinforces the value of using outlier analysis to define meaningful segments in your datasets.
Step 6: Descriptives Report and Charts
Now, you can begin to see a box & whisker plot created to visually display data points which are 'High Outlier' (UPPER box plot whisker) or 'Low Outlier' (LOWER box plot whisker) in your dataset. And, a pie chart summarizes the percentage of your dataset's records that are considered 'High Outlier' and 'Low Outlier'. If you do not see any dots above or below the 'whiskers' of a box plot - the horizontal lines at the top and bottom of it - there are no 'High Outlier' or 'Low Outlier' values in the data.
Step 7: Descriptives Report and Charts
As the box & whisker plot and pie chart are being completed, a detailed descriptive statistics summary report is generated from Column 'A' of data, and you will begin to see this report in the next step. The UPPER and LOWER data values seen below reflect the values for the box plot 'whiskers': data points above and below these values indicate 'High Outlier' and 'Low Outlier' values, respectively, in the dataset.
Step 8: Descriptives Report and Charts
At this stage, you can analyze the simple descriptive statistics for a dataset to understand whether data has 'High Outlier' or 'Low Outlier' values causing your 'Average', 'Median' and 'Mode' descriptive statistics to be different. The extent to which this is true will be reflected in the 'Kurtosis' and 'Skewness' statistics and, as you will see later, your histogram and scatter diagram. The greater the difference between the 'Average', Median' and 'Mode', the larger the 'Kurtosis' and 'Skewness' values will be: and these two metrics can be negative, positive, or have a value of '0'. Data for which the 'Kurtosis' and 'Skewness' are both '0' indicates that the 'Average', 'Median', and 'Mode' are exactly the same value. The bigger (negative or positive) that the 'Skewness' and 'Kurtosis' values are, the larger the gap between the 'Average', 'Median', and 'Mode' descriptive statistics.
Step 9: Descriptives Report and Charts
At this point, the full descriptive statistics report is generated. A histogram has also been plotted to help interpret the dataset's distribution based on the 'Skewness' and 'Kurtosis' descriptive statistics in the analysis. If the 'Skewness' and 'Kurtsosis' values for dataset are both '0', you will see a perfectly symmetrical histogram with an equal distribution of records above and below the 'Median' - or, '50th Percentile'.
Step 10: Histogram Analysis Conclusion
The Histogram Analysis Conclusion on-screen pop-up message displays a customized message explaining the interpretation of the 'Skewness' and 'Kurtosis' descriptive statistics. And how they impact the histogram you see - a reflection of the 'Average', 'Median', and 'Mode' descriptive statistics. Click the 'OK' button to move to the next step automatically. Starting in Step 11, you will see the last 2 rows in 'Column I' of the spreadsheet which automatically tell you how many categories (bins) and the range of each category that you would need in order to manually bin and create a proper histogram of your data.
Step 11: Trend Analysis Conclusion
At this point, the Trend Analysis Conclusion on-screen pop-up message provides an analysis of whether the trend in a dataset is positive, negative or neutral. And, it displays 'R' and 'R-squared' co-efficients to support the conclusions. 'High Outlier' and 'Low Outlier' values in data will have an impact on whether an accurate trend can be viewed and predicted, and this is reflected in lower than desirable 'R' and 'R-squared' co-effiecients. The value of 'R' ranges from -1 to 1, and the value of 'R-squared' ranges from 0 to 1. And 'R' of '-1' indicates that there is a powerful downward trend in the data: as time goes on, new values in the data will become lower and lower. An R-squared value of '1' indicates that you can use the current data to perfectly predict an upward trend in future data values over time. Under the 'Histogram Statistics' in Column 'J', you will see the 'Optimal # of Bins' and 'Optimal Histogram Bin Width' values. These metrics will help you analyze and change the distribution of your data with a histogram before using it to forecast or in a business analytics report. The 'Optimal # of Bins' describes how many categories that your data should be organized into on a histogram. 'Optimal Histogram Bin Width' indicates how wide each bin should be (in whatever unit of measurement that your data is in) to help you create the lower and upper limits for each category on your histogram. Click the 'OK' button to move on to the next step automatically.
Step 12: With vs. Without Outliers Analysis Conclusion
Now you can understand what a dataset - as measured by descriptive statistics - looks like with and without the 'High Outlier' and 'Low Outlier' data values flagged in previous steps. The summary 'With vs. Without Outliers' report can help understand to what extent data outliers cause a problem when forecasting or reporting Key Performance Indicators (KPIs). The 's' statistic, or Standard Deviation, is different 'With vs. Without Outliers'. Click the 'OK' button to move on to the next step automatically.
Step 13: With vs. Without Outliers Box & Whisker Plot
As the scatter diagram and a new segmented box & whisker plot are generated, the impact of 'High Outlier' and 'Low Outlier' values in the dataset become visually clear. Two new columns of data show up in Columns 'W' and 'X' of 'Sheet1': one column is the entire dataset - 'With Outliers', the other one data points excluding outliers - 'Without Outliers'.
Step 14: Student's t-Test Analysis Conclusion
In this stage, we see an on-screen pop-up message summarizing and explaining a t-Test analysis confirming that the values in Column 'W' - 'With Outliers' and Column 'X' - 'Without Outliers' represent two statistically significant different sets of behaviour. If the t-Test Statistic is less than or equal to the p-Value (.05) level of significance, then the two columns ('W' and 'X') of data represent distinct patterns of behaviour. Click the 'OK' button to end the report.
Step 15: Replace Outliers With Box & Whisker Plot UPPER and LOWER Limits
At this point, 'High Outlier' and 'Low Outlier' data points are automatically replaced with the UPPER and LOWER whisker limits determined by analyzing the dataset. The summary on-screen pop-up message below indicates what those UPPER and LOWER whisker values are, and you can see a complete dataset with outliers replaced in Column 'Y' of the worksheet.
Step 16: Outlier Replacement Conclusions
The on-screen pop-up message below summarizes the key descriptive statistics for data to compare it 'With Outliers', 'Without Outliers' and if you 'Replace Outliers With Box & Whisker Plot UPPER and LOWER Limits'. The 'Kurtosis' and 'Skewness' statistics should both be much closer to '0' after replacing outliers. The 'Average' and 's' statistic (Standard Deviation) values after outlier replacement will be closer before versus after outlier values are removed or estimated.
Post Outlier Analysis: Forecasting Before and After Statistical Outliers Replaced
The Interactive Statistics Education AddIn contains code that lets you automatically create 2-period moving average forecasts of your data before and after statistical outliers have been replaced with box & whisker plot UPPER and LOWER limits. Visually compare what your performance will look like moving forward with and without statistical outliers skewing your results.
You may decide that you wish to keep statistical outliers in your dataset and apply one of several data transformation techniques using common statistical formulas. You may wish to visualize what your dataset would look like when you take the SQUARE ROOT or LOG of its values or set them to the POWER of some value. The four box & whisker plots below visually compare a set of values before and after they have been transformed. The 'Box & Whisker Square Root' box plot graph visualizes how taking the square root of the '# Customers' column of values results in no statistical outliers. Rather than replacing statistical outliers, it is possible to transform data and bring those data points closer in value to the rest of a dataset before developing forecasts from the data.