Continuation of Part 3.
Central Tendency
Given a Histogram, how would you choose a number/numbers that accurately represents the typical salary
The value at which the frequency is highest called the mode, and this certainly works in describing the distribution, the most common value is the mode.
Value in the middle is called the median and this will also work
Average is a statistic that rests at a specific spot in the middle of the distribution
Mode
- Mode occurs with the highest frequency
- So what is the mode of [2, 5, 5, 9, 8, 3]
- Answer = 5 because it occurs twice in the dataset
- In case where there are thousands of data point in a histogram, then the mode is the range that occurred with the highest frequency, because we cannot see the individual values but we can see which bin has the highest frequency.
- In case where the entire histogram is at same level then it is called Uniform Distribution, such distributions have no mode.
- In case where there are two or more distinct clear trends will have more than one mode, making it a bimodal distribution.
Mode Explained
- Mode can be used to describe any type of data we have, whether it is numerical or categorical.
- Not all scores in the dataset affect mode only the repeating ones.
- If we take a lot of samples from the same population, the mode will be different for each sample.
- Mode changes with change in bin size.
- There is no equation for mode, there is a procedure to find the mode, but we cannot describe it with an equation, since it really depends on how we present the data.
- Mean
- Sum of all the numbers divided by the total numbers
- If data is [1, 2, 3, 4, 5] then mean = (1 + 2 + 3 + 4 + 5)/5
- For sample we say x bar = Sigma of x divided by n (small n)
- For sample we say mu = Sigma of x divided by N (capital N)
- Properties of Mean
- All scores in the distribution affect the mean.
- Think of mean as a pivot trying to keep the scale balanced, if we add/remove a score the scale will become off-balance and will have to be recalculated to re-balance it.
- The mean can be described with a formula.
- Many samples from the same population will have similar or roughly similar mean.
- The mean of the sample can be used to make inferences about the population it came from.
- The mean will change if we add an extreme value to the dataset.
- This is known as outlier, these are the values that are unexpectedly different from the other observed values.
- Outliers create skewed distributions by pulling the mean towards the outlier and this causes misleading average/mean.
- All scores in the distribution affect the mean.
- Median
- Sort the data
- Find the middle value of the data
- Median of even numbers is calculated by
- First sorting them
- Then we select the two middle values
- Take average of these two middle values
- When data has outlier, the median does not get affected much by departures from the norm, this tendency of median is called robust
Median Formula -For even values where X is the value and n is the position of the value - (X(n/2) + X(n/2+1))/2 find the two middle values and then find average of those two values -For odd values - X(n+1)/2
- Positively Skewed (High frequency towards Left)
- Mode or highest frequency will be towards the left due to highest frequency being there
- Mean will be pulled towards the right because of lot of smaller non repeating values are in right
- Median will be in the middle of Mode and Mean
- So Mode is less than Median which is less than Mean (Mode < Median < Mean)
- Normally Distributed (frequency in centre)
- Mean will be equal to Median which will be equal to Mode (Mean = Median = Mode)
- Mode will occur in the centre bin where the frequency is the highest.
- But also since the distribution is symmetrical therefore the Mean and the Median will both occur pretty much right in the centre.
- Measure Of Centre
- Mean:
- Mean has simple equation
- Mean will always change if any data value changes
- Mean is not affected by change in bin size, it will always be the same, not matter how we visualize the data with the histogram
- Mean is affected severely by outliers
- Mean is not easy to find just by looking at the histogram
- Median
- Median does not has a simple equation
- Median will not always change if any data value changes
- Median is not affected by change in bin size
- Median is not affected severely by outliers
- Median is not easy to find just by looking at the histogram
- Mode
- Mode does not has a equation
- Mode will not always change if any data value changes
- Mode is affected by change in bin size
- Mode is not affected severely by outliers
- Mode is easy to find just by looking at the histogram, because it is the highest frequency
- It can be used to describe categorical data, such as gender or country of origin
- Mean:
- In an introductory statistics course, the same number of students scored below 75% as above 75% on the final exam. What shape(s) could the distribution of final exam scores have? - This is another way of saying that 75% was the median score on the exam. All of these distributions can have a median of 75%. - Uniform - Normal - Bimodal - Positively Skewed - Negatively Skewed