What is the only thing that stands between you and your wildest dreams? If it's a constant stream of small donations from strangers worldwide then this blog post if for you. Thanks to a Kaggle Dataset amalgamating 392,000 completed Projects we've broken down the easy 5 steps between you and your unsustainable dream.
And if this freeze frame doesn't lineup to your situation, then here is an interactive, choose your own destiny version.
My wife and I have been spending more time than usual looking up baby names. Going from website to website, hoping to find some inspiration.
This didn't appease my analytical, data visualization, desires. To combat it, I did some digging. The social security administration has interesting data on male and female name popularity in the US dating back to 1880. I used this to not only see a list of the most popular baby names over the last 150 years, but to see how my name has fared the test of time.
In Part one (which you can see on LinkedIn) I explained how to get this data and what to do with the important information.
This is part two, a look at some of the fun stuff.
The first thing from the file that I was immediately curious in was my friend count. Facebook lists out the added and removal date of all your friends.
One interesting thing in the data is your added friends list doesn't include people added and then deleted, which is unfortunate. But oh well. Life goes on. Below you can see a common trend. Gradual friends added and big days where friendships are ended. Also note the chart has a different axis's. I'm not a big fan on doing that but sometimes it just makes sense.
Messaging was a bit trickier but thanks to an awesome tool that can be found here I was able to get all the messages into a mostly useable format. Besides looking over some past antics, I wanted to see a few interesting things.
One theory I had was that, in my infinite maturity, I would now be typing sophisticated messages instead of quick whips. But that turned out to be a miss.
I underestimated something. In 2011 I got my first smartphone (late to the game, I know). Which means my primary messaging device suddenly sucked to type on. And thus, the length plummets.
What I really wanted to look at was my history with individual people, and one of the most interesting patterns was my wife and I. We see each other a bunch (obviously) and primarily use Messenger to send each other images when we're lying on opposite couches. But there are some noticeable exceptions.
There's a lot more you could do with the data, but finding any other pattern was pretty challenging. The creators had a really great view that you can see and use at this link
It's probably obvious by now that my go to tool is Tableau, but I have dabbled in a few others. G2Crowd has an excellent annual report that surveys end users and aggregates the results. See how Domo, Board, Qlicview and Tableau rank in the BI landscape. Less exciting, but also see if you can spot my first ever viz-in-a-tooltip.
Shut up and show me the dashboard
I've had an interesting use case over the winter.
Next week I'm competing at the Quebec Winter Triathlon. It's my first one. I've spent almost as much time looking at last year's results as I have training. And as I'm looking at last year's results I can't help but compare my current training to the field and figure out how I'll fair (spoiler - not great).
And I don't think this is a totally crazy use case of analytics- looking at a dataset and wanting to input your own data.
Let's see what that might look like.
First - the winter triathlon.
It's a 25 KM race that involves Snowshoeing, Skating and Skiing - in that order.
There's an elite and an age class category. Obviously I'm not elite, so let's look at the Age Class. Last year there were 75 racers, the majority were men and somewhat surprisingly the 40-60 year olds dominate.
I took a look at each of the disciplines to see how people ranked compared to their overall finish time. A fun side note - the finish time also includes transition time, so Snow+Skate+Ski not = to Finish time. Which threw me for a long loop when doing data validation.
Somewhat interesting is that the Ski is most correlated to the finish. But that kind of makes sense considering it takes the longest to complete.
I can only spend so much time looking at this before selfishly thinking of myself. How would I have finished based on these results. Where do I rank in last year's race.
In Tableau you need a row in order to do anything. With this database I got lucky as they have a dummy participant's row. So I've simply hijacked that. What you can also do with Tableau's unions, is join a blank txt or csv file which will generate a new row.
My first bit of code is replacing the dummy row
IF CONTAINS([Name],"Tirage au Sort")
then [Name Param]
So if it's the dummy row, then replace it with the variable name, otherwise use the name. And you need to do this for every field where you have a field you want to import. For me it looks like this
And now to enter my abysmal times. The Snowshoe and Ski are easy. My snowshoe is brutal and I'm coming in at about 31 minutes. Hopefully that's due to my courses and fatigue or else I'll be starting well back. The Ski is significantly more competitive, I did 11 K the other day in 39 minutes which translates to a 33:10 9K. The Skate I have no idea. I've been skating but it's hard to do 11K on a 200 meter track in a thick crowd. But I used to be good at skating so I'll generously give myself a mid pack time of 29:30.
So those times put me 13th. Which is ludicrous. But a couple of things that I'm sure you're noticing as well:
It doesn't include transition time, which is why my distance ranks seem funky compared to the overall. That could easily tack on 5 minutes.
I haven't factored in fatigue of doing the three events back to back, which will surely be significant.
With this kind of format you can play with it as much as you want, tweaking your inputs to see how you compare to a field. It's a very interesting way to compare your current results to a historical dataset. .
Hey! The winter olympics kick off this week, and in the spirit of gold medals here are three quick charts that look at the history of Winter Olympic Success
All Time Ranks
Word maps really interest me and with the countries abbreviations being so common I thought this was a perfect use case for one. You can fit a lot of obvious info into each cell. Here the colours represent the rank of the country and the size is the % of medals won that year
It's hard to remember year over year which countries are the favorites in each event. Here's a simple text table that shows the # of all medals won by country for each event.
Host Country Performance
Lastly - does hosting make a significant difference in medal count? Checkout all the host countries medal totals over time. The answer, as with any great stats question, is "kind of"
- Don't force an Area chart. Line charts are often the more legible option.
- Help your consumers with consistent reference marks
- The most powerful aspect of an area chart is it's ability to give a total, but this can distract from the individual variables.
Area charts emphasize the combined performance of a cohort rather than it's individual pieces. If that is your key takeaway to an audience, then an Area chart can be put to good use. However Area Charts often put too much onus on the end users while offering some pretty subtle weaknesses which make it, in my opinion, one of the more challenging charts to effectively use.
Here’s an area chart of total volume of alcohol sold in Canada from 2002-2016
It’s showing the growth of three common alcoholic beverage types (+ other which is coolers, ciders etc) from 2002 – 2016.
At a glance we can tell that total consumption has increased since 2002, having plateaued a bit between 2009 and 2014. We can see that Other is quite obviously the lowest type and beer is the highest.
At a glance, that may be the only things we can say for certain based on this one view.
Let’s quickly compare this to a line chart showing the same information.
The line chart, in this circumstance, drives your attention to some of the big stories. Wines have closed a big gap on spirits. We can also see there's been a slight uptick in Other, and Beer isn't growing much at all. This offers a much deeper analysis than the Area chart. You can also see that all sectors are growing, which leads you to conclude that the industry is growing.
When to actually use Area Charts
Area charts start to add value if you're forced to compare many variables. Take a look at the beer by province chart below. It's showing a lot while being just shy of too busy. You can quickly see that:
Ontario has 65% of Beer sales
British Columbia, Quebec and Ontario make up 75% of Beer Sales
It also shows the relative insignificance of size from the maritimes and territories (despite our best efforts).
This becomes total chaos on a line chart, which has a bunch of clutter, shows the smaller provinces as dwarfed by the bigger players and, overall, is a big mess. .
Let's take the scenario one step further. Let's now compare Provincial Sales of all alcohol side by side
The area charts, side by side, give a lot of interesting information. They all use the same scale and with the reference lines following from one to the other it's easy to have a dialogue with this chart. Look at the jump for Quebec Wine compared to Saskatchewan who drops off completely.
How about the higher variability of "other" which sees significantly more peaks and troughs than the other types.
What's most interesting is the amount of information you can convey with this in a relatively small space. The impact of each mark means that you can get more with less to tell the story.
Most importantly is what this doesn't tell. There's no way to compare total volume by type from province to province. You don't know year of year growth in any of these provinces. The Area chart forces a certain narrative, and it's necessary to determine if the chart matches the message (which is a great rule of thumb).
Area charts are so frequently included in my rough sketches of dashboards and then fail to make the final cut. They unfortunately don't have a wide range of use cases that can't be better done by line or stacked bar charts. But if you can find the right place to intuitively use one they can be a powerful accent in your analysis.
I've created a full dashboard if you're interested in diving into this information further. It is best viewed on my Tableau Public Site
One of my customers said to me the other day "You data scientists are all the same, you hate Pie Charts". And we do. Rightfully so. They are an executive's favorite terrible method to display data. "I want to see sales by province in a pie chart" ... but that's 10 values (and 13 if you want territories - which the BA was vague about) it will be barely legible. Oh well. You're told to make it happen anyway.
But this generalization of a data scientists hate for Pie Charts. It caught me off guard. It got me thinking. Could you make an executive dashboard with only pie charts. The ultimate corner office dashboard.
Could you? Yes.
Should you? If you really have to ask, then get out.
Feel free to find me on twitter and let me know your favorite Pie Chart use! @OffTheChartsC
I built out a totally fictitious pie franchise based in Central/Western Canada. The dataset is an aggregation of sales and profits for the last 12 months. My Pie franchise keeps things simple. We sell the five best flavours of Pie: Cherry, Lemon (meringue), Pumpkin, Blueberry and Apple. Sales are sporadic (almost totally random, oddly) across the provinces, months and types. There's also Target sales that are tracked provincially and by flavour, so that the VP can hammer on someone if sales trail off.
View 1 - Executive Summary.
First thing we need is an overall view of which type of Pie is meeting it's Target. And by how much. To do this I made two calculated fields. One was Target minus Sales for when Sales were less Target. The other was Target minus Sales for when Sales were greater than Target. Basically creating separate fields for above and below target. Then I added target itself, put measure values on as the pie slice, and measure names as the colour. Due to the separate fields for below and above only one or the other would show, along with Target. The last little thing was ordering the measures correctly so that "Above" would trend to the right and "Below" to the left. The final result: