Video Transcript
Hello everybody, welcome back to This Not That. This is a vlog where we talk about best practices and common mistakes that we see in the data/BI realm. I’m excited to be doing an interview today. We’re going to talk about measures of central tendency. So I’m excited to see what Spencer has to say about that.
Hobbs: Welcome back everybody, as I said today I am fortunate enough to have my friend and coworker Spencer joining us on TNT to talk about some of the ways we think about measures of centrality. So, Spencer tell us a little bit about yourself.
Spencer: Thanks Hobbs, I’m Spencer Cook. I’m a Data scientist here at Valorem Reply. I am really interested in the intersection of statistics and computer science. Lately, I’ve also been pursuing Internet of Things (IoT), AI and specifically the democratization of AI and ML.
Hobbs: Ok. So I invited Spencer on to talk about a best practice from his perspective, in his corner of the data world, and the mistakes that he thinks that it accommodates for. So Spencer what are your thoughts on this?
Spencer: So I want to peel back the layers of a pretty common topic which is the Mean or Average. So in almost every dashboard I see, there’s some form of average being measured or communicated to the user and I feel like we can do better. If you think back to your elementary school math classes, there’s two other measures of central tendency that they teach you, that often get forgotten. Those are the Median and the Mode. So the mean is obviously really great for communicating trends within a data set but it has a couple of short comings.
A big one is that it has a tendency to absorb outliers. So if you have a bunch of data that’s in agreement and then you have a highly skewed point of data on one end, the Mean is going to move towards that outlier and often times that can skew what’s actually happening within the data set. The other issue [the Mean] has is, it doesn’t always represent an actual element within the data set. So let’s say that you have a very simple data set with the number 1 and the number 3, well you’re Mean is going to come out to be 2, which isn’t actually present. So if you sell a product that is one dollar and another that’s three, you’re average sales are two, which doesn’t tell you your most popular product.
So the Median, instead of a being a calculation on top of the data set, is actually just a measure of the center most point within the ordered set of data. So if you have products that are one dollar or three dollar and five dollars and you’re median is three or your median is five, that tells you something meaningful about where your data is split. So similarly, the Mode is just telling you which is the most frequent element within your data set and that is also going to be an element that’s present. Often times, those two things in addition to the Mean can tell you more about what’s going on within the data set than just the Mean on its own.
Hobbs: I do a lot of the visual side of things, so let me think about this in a visual way. If in my head, I’m picturing data of some kind, a distribution of data, can you help me visually picture what the difference is between these different measures?
Spencer: Yeah. So let’s say that we have a data set that has 3 bell shapes. On the left you have a smaller bell, in the middle you have a medium bell and on the right you have another small bell. The Average is sort of going to take all of those, smash them together and find the average bell size and communicate that back. Whereas the Median is going to find a balance point within that data set that splits it evenly. So, if most of your data is in that central bell, you’re not getting skewed by the two on either edge.
Similarly, for Mode, let’s imagine that we have one big bell that comprises most of the data and then on the right we have a very narrow bell with a high peak. So that’s the most popular item but it doesn’t really happen all that frequently compared to the overall scale of the data. Well we’re not going to lose that information when we calculate the central tendency. It will still be present but also not effecting the rest of the data. So in that case having both the Mean and the Mode tells you more about what’s actually going on with your data.
Hobbs: Ok. When you think about presenting these to end users, are you imagining that you would display all three or just Median or just Mode? When would you want more than one available to someone?
Spencer: The Mode is very helpful whenever you’re looking at let’s say like a day or a month in time and you’re interested in trends. "What’s the most popular basket for your product", that kind of thing. The Median can often be misleading on its own because it’s so similar to the Mean. So, what I like to do is actually create a KPI that is the Mean minus the Median and that tells you the direction of the skew in your data set. So you could make that be sort of a two dimensional KPI where you have the sign, a plus or minus and that could be say red or green, and then the intensity of the skew sort of dictates how much opacity that KPI has. And so that lets you know at a glance, we’re trending in this specific direction. You can now say, 'we’re skewing toward higher sales or lower sales' and react accordingly.
Hobbs: I like that, because then you are stating what the center is but also which direction do the outliers tend to lie in. And you can base your business decisions around those tendencies.
Spencer: Yep, exactly. Instead of getting frustrated with your sales people because the data isn’t in the direction you want it, you can actually measure trends so the outliers are properly accounted for.
Hobbs: Thank you so much Spencer.
Spencer: Thank you
Hobbs: To everyone out there in the audience, this is TNT. And we got to talk today some about measures of central tendency and how to think about what each of those numbers represent. About sort of a mathematical truth in your data, and then how to compare and contrast them. And even talking about changes in those tendencies/relationships over time.
If you enjoyed this, we would love to have you to follow us on social media or leave comments below about things that you’d like us to talk about/topics that are of interest you.
Additionally, if you’re working on a data project and you’d like to get together with us and talk about what we could do as a company for you, we would be very interested in that. We can come in as consultants and do advising or we can help you out with projects that are having some difficulties, that’s the sort of thing that we love to help with.
Hope you have a good week and we will see you next time.