Outlier Analysis.
[Audio] Hey guys !! In this video, we will talk about what is Outlier Analysis. Outlier is nothing but an abnormality among the data objects, The one which will not obey the general behavior in the sense suppose everybody is walking in the forward direction so from that group if any one person is walking in the reverse direction then that person has an outlier when compared to the common behavior so in a nutshell, if anybody is behaving against to it then it is called as an outlier and the analysis of these outliers are called as the outlier analysis..
[Audio] There are a variety of ways to find outliers. All these methods employ different approaches to finding values that is unusual compared to the rest of the dataset. Here we'll look at just a few of these techniques for example Sorting Graphing Your Data to Identify Outliers Using Z-scores to Detect Outliers Using the Interquartile Range to Create Outlier Fences.
[Audio] Sorting is the easiest technique for outlier analysis. Load your dataset into any kind of data manipulation tool, such as a spreadsheet, and sort the values by their magnitude. Then, look at the range of values of various data points. If any data points are significantly higher or lower than others in the dataset, they may be treated as outliers. Let's look at an example of sorting in actual. Your data set for pilot experiment consist of 8 values is as below 180 , 156, 9, 176, 1827, 166, 171 And when we sort this in any order may it be ascending or descending order then highlighted number 9 and 1872 are the outliers..
[Audio] Graphing Your Data to Identify Outliers Boxplots, histograms, and scatterplots can highlight outliers. Boxplots display asterisks or other symbols on the graph to indicate explicitly when datasets contain outliers. These graphs use the interquartile method with fences to find outliers, which I explain later. The boxplot below displays our example dataset. It's clear that the outlier is quite different than the typical data value..
[Audio] Z-scores can quantify the unusualness of an observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean. To calculate the Z-score for observation, take the raw measurement, subtract the mean, and divide by the standard deviation. Mathematically, the formula for that process is the following: z equals to X minus meu(mean) divided by sigma(standard deviation) In our example dataset below, I display the values in the example dataset along with the Z-scores. This approach identifies the same observation as being an outlier. Indeed, our Z-score of ~ 3.6 is right near the maximum value for a sample size of 15. Sample sizes of 10 or fewer observations cannot have Z-scores that exceed a cutoff value of +/- 3..
[Audio] You can use the interquartile range ( IQR), several quartile values, and an adjustment factor to calculate boundaries for what constitutes minor and major outliers. Minor and major denote the unusualness of the outlier relative to the overall distribution of values. Major outliers are more extreme. Analysts also refer to these categorizations as mild and extreme outliers. The IQR is the middle 50% of the dataset. It's the range of values between the third quartile and the first quartile (Q3 – Q1). We can take the IQR, Q1, and Q3 values to calculate the following outlier fences for our dataset: lower outer, lower inner, upper inner, and upper outer. These fences determine whether data points are outliers and whether they are mild or extreme..
[Audio] The first step is knowing what type of anomaly you're up against, being able to accurately categorize outliers sharpen the focus of automated anomaly detection and yields much better results so let's talk about the three categories that outliers fall into the first one is called global outliers, second one is Contextual outliers and third one is Collective outliers..
[Audio] The first one is called global outliers so the next question arises in the mind as what is a global outlier ?? Well a data point of points is considered a global outlier if their values are far outside everything else in the data set to think of it this way suppose a water pipe breaks in your neighborhood causing your area's water consumption to go through the roof well if you compare it to water consumption in all other days of the year this would be a global outlier but let's look at a global outlier in an actual business setting think about zoom at the start of the pandemic within a matter of days the number of people using the zoom spiked exponentially that was a global outlier when you compare those numbers to their pre-covet user base this is the ultimate example of a global outlier in a business context any business would love that..
[Audio] The second category of outliers is contextual or conditional a data point is considered a contextual quite a lot from the rest of the data points that are in the same context note that the same value may not be considered an outlier if it occurs in a different context, for example, say that you live in a climate where your city gets snowstorms in the winter months if a heavy snowstorm happens in the middle of summer that would be considered a contextual outlier the event is anomalous compared to the seasonal pattern where it typically snows in winter but not in summer we also see contextual outliers in business consider a sudden surge in order volume at an e-commerce company in the middle of the night it's a contextual outlier because you wouldn't expect this high volume to occur outside daytime could this rush of sales be due to a pricing glitch well this scenario has actually happened several times with the airlines that offered tickets flights at wildly discounted prices though the offerings were a mistake the airlines usually honored the prices and lost significant revenues on those seats..
[Audio] This one's a little tricky to explain we see collective outliers when a group of data points within a larger data set is significantly different from the entire data set but each data point on its own wouldn't be considered anomalous in either a contextual or a global sense we often see this when two-time series that are related to each other are combined into a single anomaly for each Time-series the individual behavior doesn't deviate significantly from the normal range however the combined anomaly indicates a bigger issue a simple example illustrates this outlier suppose a whole block of people moved out of your neighborhood on the same day this is a collective outlier because although individual households move out from time to time it's very unusual that an entire block relocates at the same time and now for a business case involving collective outliers imagine you're running an ad campaign as your budget increases you normally see an increase in both impressions and ad clicks suppose you increase your budget and you see the number of impressions increase but you also see a decrease in the number of clicks neither the increase in impressions nor the drop in the clicks is anomalous but when they happen together that means that you have an issue with your Campaign perhaps the ad exchange is serving an empty ad or you're serving to the wrong audience this is a pretty common example of a the collective outlier in the ad world..
[Audio] Outlier Detection Improve Business Analysis in many ways such as Outlier data points can represent either item that is so far outside the norm that they need not be considered or the illustration of a very unique and singular category or variable that is worth exploring either to capitalize on a niche or find an area where an organization can offer a unique focus. When considering the use of Outlier Analysis, a business should first think about why they want to find the outliers and what they will do with that data. That focus will help the business to select the right method of analysis, graphing or plotting to reveal the results they need to see and understand. When considering the use of Outlier Analysis, it is important to recognize that, when the Outlier Analysis is applied to certain datasets, the results will indicate that outliers should be discounted, while in other cases, the outlier results will indicate that the organization should focus solely on those outliers For example, if an outlier indicates a risk or a mistake, that outlier should be identified and the risk or mistake should be addressed. If an outlier indicates an exceptional result, such as a person that recovered from a particular disease in spite of the fact that most other patients did not survive, the organization will want to perform further analysis on the outlier result to identify the unique aspects that may be responsible for the patient's recovery..
[Audio] So finally we have come to an end of this series. So whether it's a drop in application usage a price glitch or a glitch in a marketing campaign outliers of all kinds exist all around us it's important to detect them because of the impact they have on our day-to-day life and in business Thank you..