Learn R Programming & Build a Data Science Career | Michael Toth

A Detailed Guide to ggplot colors

2019-05-14T10:00:00-04:00

Once you've figured out how to create the standard scatter plots, bar charts, and line graphs in ggplot, the next step to really elevate your graphs is to master working with color.

Strategic use of color can really help your graphs to stand out and make an impact.

In this guide, you'll learn how to incorporate your own custom color palettes into your graphs by modifying the base ggplot colors.

By the end of this tutorial, you’ll know how to:

Change all items in a graph to a static color of your choice
Differentiate between setting a static color and mapping a variable in your data to a color palette so that each color represents a different level of the variable
Customize your own continuous color palettes using the scale_color_gradient, scale_fill_gradient, scale_color_gradient2, and scale_fill_gradient2 functions
Customize your own color palettes for categorical data using the scale_color_manual and scale_fill_manual functions

Introducing video tutorials!

I'm also excited to try something new in this guide! I'll be adding video tutorials to accompany the content, so please let me know what you think about these and if you find them helpful. I'd love to do more of this in the future if you find them valuable!

Get my free workbook to build a deeper understanding of `ggplot` colors!

Have you ever read a tutorial or guide, learned a bunch of interesting things, only to forget them shortly after you finish reading?

Me too. And it's really annoying!

Unfortunately, our brains aren't good at remembering what we read. We need to think critically and be engaged in solving problems to learn information so it sticks.

That's why I've created a free workbook to accompany this post. The workbook is an R file that includes additional questions and exercises to help you engage with this material.

Get your free workbook to master working with colors in ggplot

A high-level overview of `ggplot` colors

By default, ggplot graphs use a black color for lines and points and a gray color for shapes like the rectangles in bar graphs.

Sometimes this is fine for your purposes, but often you'll want to modify these colors to something different.

Depending on the type of graph you're working with, there are two primary attributes that affect the colors in a graph.

You use the color attribute to change the outline of a shape and you use the fill attribute to fill the inside of a shape.

Specifically, we use the color attribute to change the color of any points and lines in your graph. This is because points and graphs are 0- and 1-dimensional objects, so they don't have any inside to fill!

library(tidyverse)

ggplot(iris) + 
    geom_point(aes(x = Sepal.Width, y = Sepal.Length), color = 'red')

In contrast, bars and other 2-dimensional shapes do have an inside to fill, so you will be using the fill attribute to change the color of these items in your graph:

ggplot(mpg) +
    geom_bar(aes(x = class), fill = 'red')

Side note: technically you can also use the color attribute to change the outline of shapes like bars in a bar graph. I use this functionality very rarely, and for the sake of simplicity I will not go into this in further detail in this guide.

Except for the difference in naming, color and fill operate very similarly in ggplot. As you'll see, the functions that exist for modifying your ggplot colors all come in both color and fill varieties. But before we get to modifying the colors in your graphs, there's one other thing we need to touch on first.

Modifying `ggplot` colors: static color vs. color mapping

We need to distinguish between two different ways of modifying colors in a ggplot graph. The two things we can do are:

setting a static color for our entire graph
mapping a variable to a color so each level of the variable is a different color in our graph

In the earlier examples, we used a static color (red) to modify all of the points and bars in the two graphs that we created.

It's often the case, however, that we want to use color to convey additional information in our graph. Usually, we do this by mapping a variable in our dataset to the color or fill aesthetic, which tells ggplot to use a different color for each level of that variable in the data.

Setting a static color is pretty straightforward, and you can use the two examples above as references for how to accomplish that.

In the rest of this guide, I'm going to show you how you can map variables in your data to colors in your graph. You'll learn about the different functions in ggplot to set your own color palettes and how they differ for continuous and categorical variables.

Working with Color Palettes for Continuous Data

Let's start with a simple example of the default continuous color palette in ggplot. First, we'll generate some random data that we'll use for our graph.

df <- data.frame(
  x = runif(100), # 100 uniformly distributed random values
  y = runif(100), # 100 uniformly distributed random values
  z1 = rnorm(100), # 100 normally distributed random values
  z2 = abs(rnorm(100)) # 100 normally distributed random values mapped to positive
)

On sequential color scales and `ggplot` colors

When we map a continuous variable to a color scale, we map the values for that variable to a color gradient. You can see the default ggplot color gradient below.

This is called a sequential color scale, because it maps data sequentially from one color to another color. The minimum value will in your dataset will be mapped to the left side (dark blue) of this sequential color gradient, while the maximum value will be mapped to the right side (light blue) of this sequential color gradient.

You can imagine stretching a number line across this gradient of colors. Then, for every value in your data, you find it on the number line, take the color at that location, and graph using that resulting color.

Let's see how this works in practice. Using the random data we generated above, we'll graph a scatter plot of the x and y variables. To illustrate the color gradient, we'll map the z2 variable to the color aesthetic:

# Default colour scale colours from light blue to dark blue
g1 <- ggplot(df, aes(x, y)) +
  geom_point(aes(color = z2))

g1

Note the contrast between this syntax and the syntax before where we set a static color for our graph. Here, we aren't specifying the color to use, we're simply telling ggplot to map the z2 variable to the color aesthetic by including the mapping color = z2 within the aes function.

In the dataset that I created, the minimum value for the z2 variable is 0.0024422, while the maximum value is 2.6241346. All values--and therefore all colors--fall between these minimum and maximum levels.

Modifying our `ggplot` colors for continuous data using scale_color_gradient

Now that you understand how ggplot can map a continuous variable to a sequential color gradient, let's go into more detail on how you can modify the specific colors used within that gradient.

Instead of the default blue gradient that ggplot uses, we can use any color gradient we want! To modify the colors used in this scale, we'll be using the scale_color_gradient function to modify our ggplot colors.

Side note: if we were instead graphing bars or other fillable shapes, we would use the scale_fill_gradient function. For brevity, I won't be including an example of this function. It operates in exactly the same way as the scale_color_gradient function, so you can easily modify this code to work for filling graphs with color as well.

Using the same graph from before, we simply add a call to the scale_color_gradient function to modify our color palette. Here, we can specify our own values for low and high to customize the gradient of colors in our graph. In this case, we'll be mapping low values to greenyellow and high values to forestgreen.

g1 + 
  scale_color_gradient(low = 'greenyellow', high = 'forestgreen')

By modifying the values you're passing to the scale_color_gradient function, you can create a sequential color scale between any two colors!

Under the hood, ggplot was already using this color scale with the dark blue and light blue colors that show up by default. By adding this color scale to the graph and specifying your own colors, you're simply overriding the default values that ggplot was already using.

On diverging gradient scales and `ggplot` colors

Sequential color scales are great when you want to easily differentiate between low and high values in a dataset.

Sometimes, however, that's not what you want. Sometimes you want to look at deviations from a certain baseline value, and you care about distinguishing both positive and negative deviations. For this type of data, we use what's called a diverging color scale.

A diverging color scale creates a gradient between three different colors, allowing you to easily identify low, middle, and high values within your data. You can see an example of a diverging color scale below.

In this color scale, we see that blue is associated with values on the low end, white with values in the middle, and red with values on the high end. Among other things, this type of scale is often used when presenting United States presidential election results.

Instead of the scale_color_gradient function that we used for a sequential color palette, we're now going to use the scale_color_gradient2 to produce a diverging palette.

Side note: Again, there is a similar function called scale_fill_gradient2 that we would use if we were instead graphing bars or other fillable shapes. I won't be including an example of this function, but it operates in exactly the same way as the scale_color_gradient2 function, so you can easily modify this code to work for filling graphs with color as well.

As before, we tell the scale_color_gradient2 function which colors to map to low and high values of our variable. In addition, we also specify a color to map the mid values to. As in the color scale we just reviewed, we'll use the blue-white-red color palette for this example:

ggplot(df, aes(x, y)) +
  geom_point(aes(colour = z1)) +
  scale_color_gradient2(low = 'blue', mid = 'white', high = 'red')

While you can technically specify any 3 colors for a diverging color scale, the convention is to use a light color like white or light yellow in the middle and darker colors of different hues for both low and high values, like we've done here.

Working with Color Palettes for Categorical Data

When working with continuous data, each value in your dataset was automatically mapped to a value on a 2-color sequential gradient or 3-color diverging gradient, as we just saw. The goal was to show a smooth transition between colors, highlighting low and high values or low, middle, and high values in the data.

When working with categorical data, each distinct level in your dataset will be mapped to a distinct color in your graph. With categorical data, the goal is to have highly differentiated colors so that you can easily identify data points from each category.

There are built-in functions within ggplot to generate categorical color palettes. That said, I've always preferred the control I get from generating my own, and that's what I'm going to show you how to do here.

My favorite tool for building categorical color palette: Color Picker for Data

My favorite way of generating beautiful color palettes is to use Tristen Brown's tool Color Picker for Data. It offers an intuitive visual interface to build and export a color palette that you can use directly within ggplot.

Mapping Categorical Data to Color in `ggplot`

For this example, we'll be working with the mtcars dataset. We're going to create a scatter plot of weight and miles per gallon. Then, we'll use the color aesthetic to map 4-, 6-, and 8-cylinder engines each to a different color using the default ggplot colors:

g2 <- ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point(aes(color = factor(cyl)))

g2

Similar to how we worked with categorical data, we simply map a variable to the color aesthetic by including the code color = your_variable within the aes function of our geom_point call.

The one caveat is that here we're converting the cyl variable to a factor before we do this mapping. Because cyl is recorded as a numerical variable, it will by default map to the color gradients we saw before, which isn't what we want in this case, as we're treating cyl as a categorical variable with 3 levels. Remember, just because a variable happens to have numeric values does not necessarily mean it should be mapped as a continuous scale!

We know how we can map a categorical variable to the color aesthetic to produce different colors in our graph for each level in our dataset. How can we modify those colors to a color palette of our choice?

Modifying our `ggplot` colors for categorical data using scale_color_manual

Once you have your color palette, you can use the scale_color_manual function to map the levels in your dataset to different colors in your generated color palette.

Side note: Can you guess? Yes, again, there is a similar function called scale_fill_manual that we would use if we were instead graphing bars or other fillable shapes. I won't be including an example of this function, but it operates in exactly the same way as the scale_color_manual function, so you can easily modify this code to work for filling graphs with color as well.

Here, we start by creating a vector that maps the different levels in our data, in this case "4", "6", and "8", to different colors.

We then use the scale_color_manual function and specify the mapping by passing our colors vector to the values argument of scale_color_manual. It will then go through each entry in the cyl column, mapping it to the relevant color in our colors vector.

colors <- c("4" = "#D9717D", "6" = "#4DB6D0", "8" = "#BECA55")

g2 + 
  scale_color_manual(values = colors)

A Summary of Working with `ggplot` Colors

Congratulations! You know know how to work with colors in your ggplot graphs.

In this guide, you learned:

How to change all items in a graph to a static color of your choice
To distinguish between two ways of modifying color in your ggplot graph:
1. Setting a static color for all elements in your graph
2. Mapping a variable in your data to a color palette so that each color represents a different level of the variable
How you can customize your sequential color scales using the scale_color_gradient and scale_fill_gradient functions for continuous data
How to customize your diverging color scales using the scale_color_gradient2, and scale_fill_gradient2 functions for continuous data
How to customize your color palettes for categorical data using the scale_color_manual and scale_fill_manual functions

Don't Forget to Practice!

Right now, you should have a pretty good understanding of how you can work with and modify the colors in your ggplot graphs. But if you don't practice, you're going to forget this stuff!

That's why I've created a free workbook that you can use to apply what you've learned in this guide. The workbook is an R file that includes additional questions and exercises to help you engage with this material.

Get your free workbook to master working with colors in ggplot

10 Steps to Better Graphs in R

2019-05-07T10:44:00-04:00

Over the last 5 years, I have created a LOT of graphs. And let me tell you, they haven't all been pretty. But with each new graph that I've created, I've improved my knowledge of what works and what doesn't.

And I've used that knowledge to develop a set of best practices that I follow every time I'm working on a new project that involves communicating results or information with graphs.

You see, when I'm making a graph, I'm not doing it just to explore some data or show something "interesting". No. I want my graphs to speak to my audience and help them to understand and take action based on what they learn.

A good data scientist needs to be able to not only analyze data, but also to convey the insights hidden in that data in a way that convinces people to take appropriate action.

If you can use your data analysis skills to consistently drive change in your organization, you will quickly find yourself on a path toward promotion, increased responsibility, and greater control over your own work.

In my own experience, learning to graph effectively was the single biggest thing that helped me increase my impact and ability to drive change at work. That's why I think it's so important for you to learn to graph effectively, and that's why I'm sharing this checklist with you today.

This is the exact checklist I go through when I'm working on graphs for big consulting projects with my clients. I keep a printed copy on my desk and I refer to it every time I'm working on a new graph. It helps me, and I think it will help you too. Be sure to get a copy of the checklist for yourself so you'll always have it handy when you need it!

Get Your Free 10-Step ggplot Graphing Checklist

Why Do You Need a Checklist

Graphs are a versatile tool that can be used for a variety of different goals. That's one of the reasons why I think learning to graph effectively is one of the highest priorities for data scientists.

But I also think that leads to a lot of problems with graphs. You see, you can use graphs for exploratory data analysis, and you can also use graphs for presentation and sharing results. The problem that I see ALL THE TIME is that people try to use the same graphs for both of these things. Ahhh!

Look, I've been there. It's tempting and easy to throw a few quick graphs together and call it a day.

I used to work in finance, and I was once tasked with building a model to predict of how likely borrowers were to default on loans they had taken out.

It was a challenging problem, and I spent around 6 weeks creating a sophisticated model that predicted defaults based on all kinds of data about the borrower's income level, where they lived, how big their loan was, and what their interest rate was.

I was pretty pleased that the model I created worked really well! I knew our clients would find a lot of value in this model once we got it implemented into our platform.

Before that, I needed to summarize my results and share them with the rest of the company. At the time, I didn't see this as an opportunity to advocate for my work and its benefits to our clients. Instead, I saw it as an annoying obligation that I had to do on top of my already extensive analysis.

I knew my analysis was good and that this change was valuable. I was sure other people understood that as well.

So I threw some graphs together, gave a quick presentation on the topic to a room full of glazed-over eyes, and went back to my desk.

It took weeks for us to build this into the platform, when it could have been done in a matter of days if there was sufficient motivation.

Whose fault was that? At the time, it was easy for me to blame the engineering team for the slow implementation. But the reality is, it was my fault!

Everybody else is busy with their own work, and for the most part, they don't really know what it is you do all day. You're often so deep in the weeds of your own analysis that things you think are simple and obvious aren't even on the minds of everybody else. That's why it's important to treat every presentation and every graph you share as an opportunity to educate others and inspire them to move forward with an action.

I'd love to say that I immediately changed my presentation and graphing style after that experience, but I didn't. It took me years to develop the knowledge and skills to do this effectively.

But it doesn't have to take years! I'm here to help you learn today how to effectively use graphs to communicate ideas and drive change.

If you implement these ideas consistently, you will see improvements in the impact of your work and your influence in the organization.

Now, let's get into the checklist. And remember, if you want to keep a copy of this for yourself so you'll always have it to refer back to, you can get that here:

Get Your Free 10-Step Checklist to Graphing for Impact

Before you graph

1. Decide who this graph is for

In order to create an effective graph, you need to know who will be using this information. Many of your design decisions stem from this key point. If you understand your audience’s background, goals, and challenges, you’ll be far more effective in creating a graph to help them make a decision, which is what this is all about! In particular, remember that this graph is not for you. You have all kinds of specialized knowledge that your audience likely does not have. You need your graph to appeal to them.

2. Structure your graph to answer a question

Your graphs should answer an important question that your audience has. “How have our revenue numbers changed over time?” or “Which of our services has the lowest level of customer satisfaction?” Design your graph to answer their question, not just to explore data. This serves two purposes.

It gives people a reason to pay attention. If you're answering a big question they have, they're going to listen to you.
It provides a clear path from data to action, which is ultimately what you want. Remember: the entire point of this field is to extract insights from data to help businesses make better decisions!

3. Decide which type of graph to use

The graph you use will depend on the data and the question you are answering.

Line Graph: Use line graphs to track trends over time or show the relationship between two variables.
Bar Charts: Use bar charts to compare quantitative data across multiple categories.
Scatter Plots: Use scatter plots to assess the relationship between two variables.
Pie Charts: Use pie charts to show parts of a whole. I personally do not use pie charts, and I advise you to be very careful with them. If you must use them, limit the number of categories, as more than 3 or 4 makes them unreadable.

4. Decide how to handle outliers

Outliers are an inevitability. You need to decide how to handle this. Sometimes, the outlier itself can be a point of focus in your graph that you want to highlight. Other times, it can be a distraction from your message that you would prefer to remove. Before removing an outlier, think critically about why the outlier exists and make a judgment call as to whether removing it helps to clarify your point without being misleading.

Building Your Graph

5. Remove unnecessary data

Your audience should be able to clearly understand the point of your graph. Excessive and unnecessary data can distract from this goal. Decide what is necessary to answer your question and cut the rest.

6. Don't be misleading

There are many ways that graphs can be misleading, either intentionally or unintentionally. These two seem to come up most frequently:

If you’re using a bar chart, the baseline for the y-axis must start at 0. Otherwise, your graph will be misleading by amplifying the actual differences across the categories.

Your titles and captions should accurately describe your data. Titles and captions are a great way to bring salience to your graph, but you need to ensure the text reinforces what the data says, rather than changing the message.

Styling Your Graph

7. Decide on an appropriate color palette

Color is an important and often-neglected aspect of graphs. For single-color graphs, choose a color that’s related to your organization’s brand or thematically related to the graph (for example, green for forestry data). For multicolor graphs, use Color Picker for Data, an excellent tool for building visually pleasing color palettes.

8. Make all of your axis titles and labels horizontal

All of the axis titles and labels in your graph should be horizontal. Horizontal labels greatly improve the readability and visual appeal of a graph.

9. Adjust your titles, labels, and legend text

Give your graph a compelling title, and add descriptive and well formatted names to the axis titles and legend. A good choice for your graph title is to simply state the question you’re trying to answer. I also like to use a subtitle that drives home the message you want people to take away. You can use the labs function in ggplot to modify these labels.

BONUS: Add your company logo and branding

If you’re sharing this graph with clients or the public, adding your company logo and branding elements can help your graph to stand out and to build credibility for your organization. This is great for you, because it will help you to grow your own influence and visibility within the company. Read my guide on this subject for more details on how to implement this tip.

Exporting Your Graph

10. Save your graph in a readable high-resolution format

Think about how your graph is going to be read. Will it be online, printed, or in a slide for a presentation? Each format may require different adjustments to text and graph sizing to be readable. Be sure to test for yourself to ensure you can read your graph in its final format. This will avoid frustrating reworks or--even worse--sharing an unreadable graph! Use the ggsave function to save your graph and modify the resolution. Then, adjust sizes until you’re satisfied with the final result.

Conclusion

Some of these tips may have been obvious, and others may have seemed like revelations. The important thing is to think through these steps and apply them consistently with every graph you produce. I promise you that if you incorporate this checklist into your workflow, you're going to see a big change in how people respond to your analysis at work.

Remember to get a copy of this graphing checklist so you can be sure to go through it every time!

Get Your Free 10-Step Checklist to Graphing for Impact

One Step to Quickly Improve the Readability and Visual Appeal of ggplot Graphs

2019-05-03T06:00:00-04:00

There's something wonderful about a graph that communicates a point clearly. You know it when you see it. It's the kind of graph that makes you pause and say 'wow!'.

There are all kinds of different graphs that fit this description, but they usually have a few things in common:

Clarity: The message of the graph is clear
Simplicity: Extraneous details are removed
Visual appeal: The graph should be pleasing to look at

Of course, your graph also needs to be communicating something worthwhile. But I see so many graphs that ultimately fall short of their potential because they don't meet these three points above!

I've been there myself. Some of my earliest graphs in R fall short, in retrospect. But the key to improving is to keep learning new things and keep getting better over time.

It seems like many people learn how to create basic bar charts, scatter plots, and line graphs in R, and then stop developing their skills further. But you shouldn't stop there!

I don't think most people are doing this intentionally. In fact, I think most people simply don't know what's possible and what they should be aiming for.

If the only graphs you've ever seen are basic examples from statistics textbooks or code documentation, how would you possibly know that you can do better? How would you know that you can create graphs that capture attention, drive action, or inspire awe? You wouldn't.

I want to teach you how to make graphs that get your point across with clarity, simplicity, and visual appeal. There are quick fixes you can make to your graphs right now that will get you much closer to making that a reality.

Today, I'm going to show you how you can use axis text rotation to greatly improve both the readability and visual appeal of your graphs.

Are you ready? Let's go!

Creating a Base Graph to Work From

To start, let's load in the libraries we'll be using throughout this post: tidyverse (for graphing and data manipulation), and hrbrthemes (a useful package that I use to improve the styling of my graphs).

library(hrbrthemes)
library(tidyverse)

For this post, we'll be using the mtcars dataset to illustrate these graphing techniques. Here, I group the cars in that dataset by the number of cylinders in their engines (4, 6, or 8), and then calculate the average horsepower for each group.

# Calculate average horsepower for cars with 4-, 6-, and 8-cylinder engines
hp_by_cyl <- mtcars %>% group_by(cyl) %>%
    summarise(avg_hp = mean(hp))

Finally, I create a simple bar chart in ggplot to show this data. Let's review this code briefly, so we're all on the same page:

The first two lines (ggplot and geom_bar) are what creates the base bar chart.
The next 5 lines use the labs function to assign labels to the graph.
The last line uses theme_ipsum from the hrbrthemes package to apply some nice styling to the graph.

# Creating a base graph without formatting the axis text
g <- ggplot(hp_by_cyl) +
    geom_bar(aes(x = factor(cyl), y = avg_hp), stat = 'identity') + 
    labs(title = 'Average Horsepower for Cars with 4-, 6-, and 8-Cylinder Engines',
         subtitle = 'Based on Data for 32 Cars from the 1974 Motor Trend Magazine',
         x = 'Cylinders',
         y = 'Horsepower',
         caption = 'michaeltoth.me / @michael_toth') +
    theme_ipsum(axis_title_size = 12)

g

This graph is fine, for the most part. But I don't aim for fine, and you shouldn't either!

I have a high attention to detail for graphing, and I want my graphs to be excellent. The easier I can make it for people to read and understand a graph, the better job I'll do of convincing them to move forward with a particular course of action.

Rotate Your Y-Axis Title To Improve Readability

There are several things we could do to improve this graph, but in this guide let's focus on rotating the y-axis label. This simple change will make your graph so much better. That way, people won't have to tilt their heads like me to understand what's going on in your graph:

That's not a look you want. Luckily, it's super easy to rotate your axis title in ggplot! To do this, we'll modify some parameters using ggplot's theme function, which can also be used to adjust all kinds of things in your graph like axis labels, gridlines, and text sizing.

Here, we specifically want to adjust the y-axis, which we can do using the axis.title.y parameter. To adjust a text element, we use element_text. You can use element_text this to adjust things like font, color, and size. But here, we're interested in rotation, so we're going to use angle. Setting the angle to 0 will make the y-axis text horizontal. Take a look:

# Modifying the graph from before (stored as g), to make the text horizontal
g + theme(axis.title.y = element_text(angle = 0))

It's a minor change, but this graph is far more readable and more visually appealing than the graph we had before. That's what we're going for. Simplicity, clarity, and visual appeal.

More on Text Rotation in `ggplot`

As we just saw, when you need to rotate text in ggplot, you can accomplish this by adjusting the angle within element_text. Here we did this to adjust the axis title, but this works the same way for any text that you want to rotate in ggplot.

The angle parameter can take any value between 0 and 360, corresponding to the angle of rotation from a horizontal baseline. To briefly illustrate how different angle values work in ggplot, take a look at the following graph, where I explore four different angle rotations:

library(gridExtra)

# 0-Degree angle
g1 <- g + theme(axis.title.y = element_text(angle = 0)) + 
    labs(title = 'Y-Axis Title at 0 degrees', 
         subtitle = 'Using theme(axis.title.y = element_text(angle = 0))', 
         caption = '')

# 90-Degree angle
g2 <- g + theme(axis.title.y = element_text(angle = 90)) + 
    labs(title = 'Y-Axis Title at 90 degrees', 
         subtitle = 'Using theme(axis.title.y = element_text(angle = 90))', 
         caption = '')

# 180-Degree angle
g3 <- g + theme(axis.title.y = element_text(angle = 180)) + 
    labs(title = 'Y-Axis Title at 180 degrees', 
         subtitle = 'Using theme(axis.title.y = element_text(angle = 180))', 
         caption = '')

# 270-Degree angle
g4 <- g + theme(axis.title.y = element_text(angle = 270)) + 
    labs(title = 'Y-Axis Title at 270 degrees', 
         subtitle = 'theme(axis.title.y = element_text(angle = 270))')

# Add all graphs to a grid
grid.arrange(g1, g2, g3, g4, nrow = 2, ncol = 2)

You should now have a better understanding of how you can use axis title rotation to improve the readability and visual appeal of your graphs in ggplot!

You really need to think about minor details like this, especially when you're going to be using a graph in a presentation or a report to a broader audience. Small details can really improve your graph, which in turn will make it easier for you to educate your audience, convince people of your conclusions, and drive change in your organization.

I will help you learn the specific skills you need to work more effectively, grow your income, and improve your career.

Detailed Guide to the Bar Chart in R with ggplot

2019-05-01T05:40:00-04:00

When it comes to data visualization, flashy graphs can be fun. Believe me, I'm as big a fan of flashy graphs as anybody. But if you're trying to convey information, especially to a broad audience, flashy isn't always the way to go.

Whether it's the line graph, scatter plot, or bar chart (the subject of this guide!), choosing a well-understood and common graph style is usually the way to go for most audiences, most of the time. And if you're just getting started with your R journey, it's important to master the basics before complicating things further.

So in this guide, I'm going to talk about creating a bar chart in R. Specifically, I'll show you exactly how you can use the ggplot geom_bar function to create a bar chart.

A bar chart is a graph that is used to show comparisons across discrete categories. One axis--the x-axis throughout this guide--shows the categories being compared, and the other axis--the y-axis in our case--represents a measured value. The heights of the bars are proportional to the measured values.

For example, in this extremely scientific bar chart, we see the level of life threatening danger for three different actions. All dangerous, to be sure, but I think we can all agree this graph gets things right in showing that Game of Thrones spoilers are most dangerous of all.

Introduction to ggplot

Before diving into the ggplot code to create a bar chart in R, I first want to briefly explain ggplot and why I think it's the best choice for graphing in R.

ggplot is a package for creating graphs in R, but it's also a method of thinking about and decomposing complex graphs into logical subunits.

ggplot takes each component of a graph--axes, scales, colors, objects, etc--and allows you to build graphs up sequentially one component at a time. You can then modify each of those components in a way that's both flexible and user-friendly. When components are unspecified, ggplot uses sensible defaults. This makes ggplot a powerful and flexible tool for creating all kinds of graphs in R. It's the tool I use to create nearly every graph I make these days, and I think you should use it too!

Follow Along With the Workbook

To accompany this guide, I've created a free workbook that you can work through to apply what you're learning as you read.

The workbook is an R file that contains all the code shown in this post as well as additional guided questions and exercises to help you understand the topic even deeper.

If you want to really learn how to create a bar chart in R so that you'll still remember weeks or even months from now, you need to practice.

So Download the workbook now and practice as you read this post!

Download your free ggplot bar chart workbook!

Investigating our dataset

Throughout this guide, we'll be using the mpg dataset that's built into ggplot. This dataset contains data on fuel economy for 38 popular car models. Let's take a look:

The mpg dataset contains 11 columns:

manufacturer: Car Manufacturer Name
model: Car Model Name
displ: Engine Displacement (liters)
year: Year of Manufacture
cyl: Number of Cylinders
trans: Type of Transmission
drv: f = front-wheel drive, r = rear-wheel drive, 4 = 4wd
cty: City Miles per Gallon
hwy: Highway Miles per Gallon
fl: Fuel Type
class: Type of Car

How to create a simple bar chart in R using `geom_bar`

ggplot uses geoms, or geometric objects, to form the basis of different types of graphs. Previously I have talked about geom_line for line graphs and geom_point for scatter plots. Today I'll be focusing on geom_bar, which is used to create bar charts in R.

library(tidyverse)

ggplot(mpg) +
    geom_bar(aes(x = class))

Here we are starting with the simplest possible ggplot bar chart we can create using geom_bar. Let's review this in more detail:

First, we call ggplot, which creates a new ggplot graph. Basically, this creates a blank canvas on which we'll add our data and graphics. Here we pass mpg to ggplot to indicate that we'll be using the mpg data for this particular ggplot bar chart.

Next, we add the geom_bar call to the base ggplot graph in order to create this bar chart. In ggplot, you use the + symbol to add new layers to an existing graph. In this second layer, I told ggplot to use class as the x-axis variable for the bar chart.

You'll note that we don't specify a y-axis variable here. Later on, I'll tell you how we can modify the y-axis for a bar chart in R. But for now, just know that if you don't specify anything, ggplot will automatically count the occurrences of each x-axis category in the dataset, and will display the count on the y-axis.

And that's it, we have our bar chart! We see that SUVs are the most prevalent in our data, followed by compact and midsize cars.

Changing bar color in a `ggplot` bar chart

Expanding on this example, let's change the colors of our bar chart!

ggplot(mpg) +
    geom_bar(aes(x = class), fill = 'blue')

You'll note that this geom_bar call is identical to the one before, except that we've added the modifier fill = 'blue' to to end of the line. Experiment a bit with different colors to see how this works on your machine. You can use most color names you can think of, or you can use specific hex colors codes to get more granular.

If you're familiar with line graphs and scatter plots in ggplot, you've seen that in those cases we changed the color by specifing color = 'blue', while in this case we're using fill = 'blue'.

In ggplot, color is used to change the outline of an object, while fill is used to fill the inside of an object. For objects like points and lines, there is no inside to fill, so we use color to change the color of those objects. With bar charts, the bars can be filled, so we use fill to change the color with geom_bar.

This distinction between color and fill gets a bit more complex, so stick with me to hear more about how these work with bar charts in ggplot!

Mapping bar color to a variable in a `ggplot` bar chart

Now, let's try something a little different. Compare the ggplot code below to the code we just executed above. There are 2 differences. See if you can find them and guess what will happen, then scroll down to take a look at the result. If you've read my previous ggplot guides, this bit should look familiar!

ggplot(mpg) +
    geom_bar(aes(x = class, fill = drv))

This graph shows the same data as before, but now instead of showing solid-colored bars, we now see that the bars are stacked with 3 different colors! The red portion corresponds to 4-wheel drive cars, the green to front-wheel drive cars, and the blue to rear-wheel drive cars. Did you catch the 2 changes we used to change the graph? They were:

Instead of specifying fill = 'blue', we specified fill = drv
We moved the fill parameter inside of the aes() parentheses

Before, we told ggplot to change the color of the bars to blue by adding fill = 'blue' to our geom_bar() call.

What we're doing here is a bit more complex. Instead of specifying a single color for our bars, we're telling ggplot to map the data in the drv column to the fill aesthetic.

This means we are telling ggplot to use a different color for each value of drv in our data! This mapping also lets ggplot know that it also needs to create a legend to identify the drive types, and it places it there automatically!

More Details on Stacked Bar Charts in `ggplot`

As we saw above, when we map a variable to the fill aesthetic in ggplot, it creates what's called a stacked bar chart. A stacked bar chart is a variation on the typical bar chart where a bar is divided among a number of different segments.

In this case, we're dividing the bar chart into segments based on the levels of the drv variable, corresponding to the front-wheel, rear-wheel, and four-wheel drive cars.

For a given class of car, our stacked bar chart makes it easy to see how many of those cars fall into each of the 3 drv categories.

The main flaw of stacked bar charts is that they become harder to read the more segments each bar has, especially when trying to make comparisons across the x-axis (in our case, across car class). To illustrate, let's take a look at this next example:

# Note we convert the cyl variable to a factor to fill properly
ggplot(mpg) + 
    geom_bar(aes(x = class, fill = factor(cyl)))

As you can see, even with four segments it starts to become difficult to make comparisons between the different categories on the x-axis. For example, are there more 6-cylinder minivans or 6-cylinder pickups in our dataset? What about 5-cylinder compacts vs. 5-cylinder subcompacts? With stacked bars, these types of comparisons become challenging. My recommendation is to generally avoid stacked bar charts with more than 3 segments.

Dodged Bars in ggplot

Instead of stacked bars, we can use side-by-side (dodged) bar charts. In ggplot, this is accomplished by using the position = position_dodge() argument as follows:

# Note we convert the cyl variable to a factor here in order to fill by cylinder
ggplot(mpg) + 
    geom_bar(aes(x = class, fill = factor(cyl)), position = position_dodge(preserve = 'single'))

Now, the different segments for each class are placed side-by-side instead of stacked on top of each other.

Revisiting the comparisons from before, we can quickly see that there are an equal number of 6-cylinder minivans and 6-cylinder pickups. There are also an equal number of 5-cylinder compacts and subcompacts.

While these comparisons are easier with a dodged bar graph, comparing the total count of cars in each class is far more difficult.

Which brings us to a general point: different graphs serve different purposes! You shouldn't try to accomplish too much in a single graph. If you're trying to cram too much information into a single graph, you'll likely confuse your audience, and they'll take away exactly none of the information.

Scaling bar size to a variable in your data

Up to now, all of the bar charts we've reviewed have scaled the height of the bars based on the count of a variable in the dataset. First we counted the number of vehicles in each class, and then we counted the number of vehicles in each class with each drv type.

What if we don't want the height of our bars to be based on count? What if we already have a column in our dataset that we want to be used as the y-axis height? Let's say we wanted to graph the average highway miles per gallon by class of car, for example. How can we do that in ggplot? There are two ways we can do this, and I'll be reviewing them both. To start, I'll introduce stat = 'identity':

# Use dplyr to calculate the average hwy_mpg by class
by_hwy_mpg <- mpg %>% group_by(class) %>% summarise(hwy_mpg = mean(hwy))

ggplot(by_hwy_mpg) + 
    geom_bar(aes(x = class, y = hwy_mpg), stat = 'identity')

Now we see a graph by class of car where the y-axis represents the average highway miles per gallon of each class. How does this work, and how is it different from what we had before?

Before, we did not specify a y-axis variable and instead let ggplot automatically populate the y-axis with a count of our data. Now, we're explicityly telling ggplot to use hwy_mpg as our y-axis variable. And there's something else here also: stat = 'identity'. What does that mean?

We saw earlier that if we omit the y-variable, ggplot will automatically scale the heights of the bars to a count of cases in each group on the x-axis. If we instead want the values to come from a column in our data frame, we need to change two things in our geom_bar call:

Add stat = 'identity' to geom_bar()
Add a y-variable mapping

Adding a y-variable mapping alone without adding stat='identity' leads to an error message:

Why the error? If you don't specify stat = 'identity', then under the hood, ggplot is automatically passing a default value of stat = 'count', which graphs the counts by group. A y-variable is not compatible with this, so you get the error message.

If this is confusing, that's okay. For now, all you need to remember is that if you want to use geom_bar to map the heights of a column in your dataset, you need to add BOTH a y-variable mapping AND stat = 'identity'.

I'll be honest, this was highly confusing for me for a long time. I hope this guidance helps to clear things up for you, so you don't have to suffer the same confusion that I did. But if you have a hard time remembering this distinction, ggplot also has a handy function that does this work for you. Instead of using geom_bar with stat = 'identity', you can simply use the geom_col function to get the same result. Let's see:

# Use dplyr to calculate the average hwy_mpg by class
by_hwy_mpg <- mpg %>% group_by(class) %>% summarise(hwy_mpg = mean(hwy))

ggplot(by_hwy_mpg) + 
    geom_col(aes(x = class, y = hwy_mpg))

You'll notice the result is the same as the graph we made above, but we've replaced geom_bar with geom_col and removed stat = 'identity'. geom_col is the same as geom_bar with stat = 'identity', so you can use whichever you prefer or find easier to understand. For me, I've gotten used to geom_bar, so I prefer to use that, but you can do whichever you like!

Revisiting `color` in `geom_bar`

Above, we showed how you could change the color of bars in ggplot using the fill option. I mentioned that color is used for line graphs and scatter plots, but that we use fill for bars because we are filling the inside of the bar with color. That said, color does still work here, though it affects only the outline of the graph in question. Take a look:

ggplot(mpg) +
    geom_bar(aes(x = class), color = 'blue')

This created graphs with bars filled with the standard gray, but outlined in blue. That outline is what color affects for bar charts in ggplot!

I personally only use color for one specific thing: modifying the outline of a bar chart where I'm already using fill to create a better looking graph with a little extra pop. The standard fill is fine for most purposes, but you can step things up a bit with a carefully selected color outline:

ggplot(mpg) +
    geom_bar(aes(x = class), fill = '#003366', color = '#add8e6')

It's subtle, but this graph uses a darker navy blue for the fill of the bars and a lighter blue for the outline that makes the bars pop a little bit.

This is the only time when I use color for bar charts in R. Do you have a use case for this? I'd love to hear it, so let me know in the comments!

A deeper review of `aes()` (aesthetic) mappings in ggplot

We saw above how we can create graphs in ggplot that use the fill argument map the cyl variable or the drv variable to the color of bars in a bar chart. ggplot refers to these mappings as aesthetic mappings, and they include everything you see within the aes() in ggplot.

Aesthetic mappings are a way of mapping variables in your data to particular visual properties (aesthetics) of a graph.

I know this can sound a bit theoretical, so let's review the specific aesthetic mappings you've already seen as well as the other mappings available within geom_bar.

Reviewing the list of geom_bar aesthetic mappings

The main aesthetic mappings for a ggplot bar graph include:

x: Map a variable to a position on the x-axis
y: Map a variable to a position on the y-axis
fill: Map a variable to a bar color
color: Map a variable to a bar outline color
linetype: Map a variable to a bar outline linetype
alpha: Map a variable to a bar transparency

From the list above, we've already seen the x and fill aesthetic mappings. We've also seen color applied as a parameter to change the outline of the bars in the prior example.

I'm not going to review the additional aesthetics in this post, but if you'd like more details, check out the free workbook which includes some examples of these aesthetics in more detail!

Download your free ggplot bar chart workbook!

Aesthetic mappings vs. parameters in ggplot

I often hear from my R training clients that they are confused by the distinction between aesthetic mappings and parameters in ggplot. Personally, I was quite confused by this when I was first learning about graphing in ggplot as well. Let me try to clear up some of the confusion!

Above, we saw that we could use fill in two different ways with geom_bar. First, we were able to set the color of our bars to blue by specifying fill = 'blue' outside of our aes() mappings. Then, we were able to map the variable drv to the color of our bars by specifying fill = drv inside of our aes() mappings.

What is the difference between these two ways of working with fill and other aesthetic mappings?

When you include fill, color, or another aesthetic inside the aes() of your ggplot code, you're telling ggplot to map a variable to that aesthetic in your graph. This is what we did when we said fill = drv above to fill different drive types with different colors.

Each of the aesthetic mappings you've seen can also be used as a parameter, that is, a fixed value defined outside of the aes() aesthetic mappings. You saw how to do this with fill when we made the bar chart bars blue with fill = 'blue'. You also saw how we could outline the bars with a specific color when we used color = '#add8e6'.

Whenever you're trying to map a variable in your data to an aesthetic to your graph, you want to specify that inside the aes() function. And whenever you're trying to hardcode a specific parameter in your graph (making the bars blue, for example), you want to specify that outside the aes() function. I hope this helps to clear up any confusion you have on the distinction between aesthetic mappings and parameters!

Common errors with aesthetic mappings and parameters in ggplot

When I was first learning R and ggplot, this difference between aesthetic mappings (the values included inside your aes()), and parameters (the ones outside your aes()) was constantly confusing me. Luckily, over time, you'll find that this becomes second nature. But in the meantime, I can help you speed along this process with a few common errors that you can keep an eye out for.

Trying to include aesthetic mappings outside your `aes()` call

If you're trying to map the drv variable to fill, you should include fill = drv within the aes() of your geom_bar call. What happens if you include it outside accidentally, and instead run ggplot(mpg) + geom_bar(aes(x = class), fill = drv)? You'll get an error message that looks like this:

Whenever you see this error about object not found, be sure to check that you're including your aesthetic mappings inside the aes() call!

Trying to specify parameters inside your `aes()` call

On the other hand, if we try including a specific parameter value (for example, fill = 'blue') inside of the aes() mapping, the error is a bit less obvious. Take a look:

ggplot(mpg) + 
    geom_bar(aes(x = class, fill = 'blue'))

In this case, ggplot actually does produce a bar chart, but it's not what we intended.

For starters, the bars in our bar chart are all red instead of the blue we were hoping for! Also, there's a legend to the side of our bar graph that simply says 'blue'.

What's going on here? Under the hood, ggplot has taken the string 'blue' and created a new hidden column of data where every value simple says 'blue'. Then, it's mapped that column to the fill aesthetic, like we saw before when we specified fill = drv. This results in the legend label and the color of all the bars being set, not to blue, but to the default color in ggplot.

If this is confusing, that's okay for now. Just remember: when you run into issues like this, double check to make sure you're including the parameters of your graph outside your aes() call!

You should now have a solid understanding of how to create a bar chart in R using the ggplot bar chart function, geom_bar!

Solidify Your Understanding

Experiment with the things you've learned to solidify your understanding. You can download my free workbook with the code from this article to work through on your own.

I've found that working through code on my own is the best way for me to learn new topics so that I'll actually remember them when I need to do things on my own in the future.

Download your free ggplot bar chart workbook!

Getting a Data Science Job is not a Numbers Game!

2019-04-29T06:23:00-04:00

My First (Non Data Science) Job Search

Let me tell you a story about my first job search. It was 2010, and data science jobs weren't really a thing yet. I'll get to that in a minute, but bear with me first because there's a point to all this.

At the time, I was a junior at the University of Pennsylvania, where I was studying finance and statistics.

Every year, there was months-long on-campus recruiting season where all of the students frantically applied to secure prestigious jobs and internships from big banks and other financial companies.

And I knew I NEEDED to get one of those jobs.

But unlike a lot of the other students at Penn, I didn't grow up in New York City. I was from a working class family in the midwest. So when I started my job search, I had no connections in the industry. Zero.

Obviously, my applications weren't going to be fast tracked. But more importantly, I had nobody to ask questions about the different job opportunities available.

It felt like everybody else in school understood all of the different types of jobs in finance and which ones to apply to.

Not me. I had no clue. But, what the hell, I figured. I was smart. I was near the top of my class! I had a 3.8 GPA and an extremely difficult and technical course load. Somebody would hire me! And I didn't really care what type of job I got, I just needed a job. So I applied to everything.

I applied to trading jobs. I applied to investment risk jobs. I applied to investment banking jobs. Over 100 jobs in total. And then... crickets. Most of the companies didn't even respond to me! It was demoralizing.

I was freaking out. I needed to land a job. My first job was extremely important, and it would set me up for my entire career. I worried if I didn't get a job now, I would have to move back to Ohio after I graduated, probably closing the door on a prestigious finance job forever.

My Big Break

Finally, after weeks of anguish, I got a big break. I had been asked to interview for one of the jobs I had applied to, a technical trading job at Allstate.

I was so excited. When I went to the interview, I met a guy named Mark. Mark was the one who had decided to interview me, and the hiring decision was ultimately his to make.

Mark and I got along well, from what I remember. He had a relatively senior role at the company, but his schooling background had been similar to mine. He'd studied finance and engineering, and he was looking for somebody smart with a strong combination of finance and technical skills.

He must have seen some potential in me, because shortly after the interview he offered me the job. Of the hundred-or-so jobs I applied to, this was the only offer I received.

What I Learned

So what's my point? I tell you your data science job search is not a numbers game, but then I tell you how I applied to over a hundred jobs to get a single offer. What gives?

Here's the thing: I got this job because I had the exact combination of finance and technical skills that Mark was looking for. Most of the other jobs I applied for, I never had a shot of getting, because I didn't have the right background. All of those applications, and all of my effort in applying, were a waste.

If I had instead focused on only applying to the technical finance jobs where I had a unique advantage, I'd have had a much higher success rate.

Getting My First Data Science Job with a Strategic R Blog Post

My experience during that application process changed how I thought about job searches. Years later, when I was trying to break into data science from my finance job at BlackRock, I took a different approach.

I knew I wanted to get into data science, but with no formal training and potentially hundreds of applicants for each position, I also knew I needed to stand out. So I scoured job boards searching for jobs where I knew my skills would give me an advantage.

I was very good at financial data manipulation, something the majority of data scientists know absolutely nothing about.

So when I found a financial technology startup specialized in online lending analytics that was looking for a data scientist, I knew it was the perfect opportunity for me.

What did I do? Did I send in my application and then just hope they contacted me?

Nope.

I wrote a detailed blog post analyzing historical default rates for Lending Club loans. This was exactly the type of work I'd be expected to do at the company. I probably spent 10 hours doing the research and analysis and then writing that blog post.

What do you imagine happened? They called me in for an on-site interview. And I pretty much breezed through it. The interview was primarily to assess my cultural fit, not my data science skills, because I'd already proven to them I was capable!

Avoid the Spray and Pray Approach when Applying for Data Science Jobs

When you carefully select the companies you're going to apply to based on an alignment of their needs and your skills, you can dedicate more time to each application. That extra time is how you stand out in a pool of hundreds of job applicants for a single position. You could:

Write a blog post showing your ability to do the work
Send them a detailed list of metrics they should be tracking to improve their business
Analyze a relevant public dataset and tell the company how they could incorporate that data into their product

The specific thing you do will vary across companies and industries. But if you can do something to add value and differentiate yourself from the hundreds of other applications, you will vastly improve your chances of getting a job.

ALWAYS REMEMBER THIS: The job you're applying for exists because the company has a problem that they're trying to solve. They're not looking for a generic person with a generic set of data analysis skills. They're looking for a specific person that can help them solve their problems.

That doesn't mean you need to know everything, but it does mean that you should lean into your specific strengths when deciding where to apply. If you can show the company that you can solve their problems, you become a top-5 candidate immediately.

For your next data science job search, don't apply for a hundred jobs. Find 10 jobs where you bring unique skills to the table, and try to do something that demonstrates your unique skills to those employers. I promise you'll find far more success with this approach.

I will help you learn the specific skills you need to work more effectively, grow your income, and improve your career.

A Detailed Guide to the ggplot Scatter Plot in R

2019-04-24T08:05:00-04:00

When it comes to data visualization, flashy graphs can be fun. But if you're trying to convey information, especially to a broad audience, flashy isn't always the way to go.

Last week I showed how to work with line graphs in R.

In this article, I'm going to talk about creating a scatter plot in R. Specifically, we'll be creating a ggplot scatter plot using ggplot's geom_point function.

A scatter plot is a two-dimensional data visualization that uses points to graph the values of two different variables - one along the x-axis and the other along the y-axis. Scatter plots are often used when you want to assess the relationship (or lack of relationship) between the two variables being plotted.

Scatter Plot of Adam Sandler Movies from FiveThirtyEight

For example, in this graph, FiveThirtyEight uses Rotten Tomatoes ratings and Box Office gross for a series of Adam Sandler movies to create this scatter plot. They've additionally grouped the movies into 3 categories, highlighted in different colors.

The Famous Gapminder Scatter Plot of Life Expectancy vs. Income by Country

This scatter plot, initially created by Hans Rosling, is famous among data visualization practitioners. It graphs the life expectancy vs. income for countries around the world. It also uses the size of the points to map country population and the color of the points to map continents, adding 2 additional variables to the traditional scatter plot.

Hans Rosling used a famously provocative and animated presentation style to make this data come alive. He used his presentations to advocate for sustainable global development through the Gapminder Foundation.

Hans Rosling's example shows how simple graphic styles can be powerful tools for communication and change when used properly! Convinced? Let's dive into this guide to creating a ggplot scatter plot in R!

Follow Along With the Workbook

I've created a free workbook to help you apply what you're learning as you read.

The workbook is an R file that contains all the code shown in this post as well as additional questions and exercises to help you understand the topic even deeper.

If you want to really learn how to create a scatter plot in R so that you'll still remember weeks or even months from now, you need to practice.

So Download the workbook now and practice as you read this post!

Introduction to ggplot

Before we get into the ggplot code to create a scatter plot in R, I want to briefly touch on ggplot and why I think it's the best choice for plotting graphs in R.

ggplot is a package for creating graphs in R, but it's also a method of thinking about and decomposing complex graphs into logical subunits.

Investigating our dataset

Throughout this post, we'll be using the mtcars dataset that's built into R. This dataset contains details of design and performance for 32 cars. Let's take a look to see what it looks like:

The mtcars dataset contains 11 columns:

mpg: Miles/(US) gallon
cyl: Number of cylinders
disp: Displacement (cu.in.)
hp: Gross horsepower
drat: Rear axle ratio
wt: Weight (1000 lbs)
qsec: 1/4 mile time
vs: Engine (0 = V-shaped, 1 = straight)
am: Transmission (0 = automatic, 1 = manual)
gear: Number of forward gears
carb: Number of carburetors

How to create a simple scatter plot in R using geom_point()

ggplot uses geoms, or geometric objects, to form the basis of different types of graphs. Previously I talked about geom_line, which is used to produce line graphs. Today I'll be focusing on geom_point, which is used to create scatter plots in R.

library(tidyverse)

ggplot(mtcars) +
    geom_point(aes(x = wt, y = mpg))

Here we are starting with the simplest possible ggplot scatter plot we can create using geom_point. Let's review this in more detail:

First, I call ggplot, which creates a new ggplot graph. It's essentially a blank canvas on which we'll add our data and graphics. In this case, I passed mtcars to ggplot to indicate that we'll be using the mtcars data for this particular ggplot scatter plot.

Next, I added my geom_point call to the base ggplot graph in order to create this scatter plot. In ggplot, you use the + symbol to add new layers to an existing graph. In this second layer, I told ggplot to use wt as the x-axis variable and mpg as the y-axis variable.

And that's it, we have our scatter plot! It shows that, on average, as the weight of cars increase, the miles-per-gallon tends to fall.

Changing point color in a ggplot scatter plot

Expanding on this example, we can now play with colors in our scatter plot.

ggplot(mtcars) +
    geom_point(aes(x = wt, y = mpg), color = 'blue')

You'll note that this geom_point call is identical to the one before, except that we've added the modifier color = 'blue' to to end of the line. Experiment a bit with different colors to see how this works on your machine. You can use most color names you can think of, or you can use specific hex colors codes to get more granular.

Now, let's try something a little different. Compare the ggplot code below to the code we just executed above. There are 3 differences. See if you can find them and guess what will happen, then scroll down to take a look at the result. If you've read my previous ggplot guides, this bit should look familiar!

mtcars$am <- factor(mtcars$am)

ggplot(mtcars) +
    geom_point(aes(x = wt, y = mpg, color = am))

This graph shows the same data as before, but now there are two different colors! The red dots correspond to automatic transmission vehicles, while the blue dots represent manual transmission vehicles. Did you catch the 3 changes we used to change the graph? They were:

First, we converted the am variable to a factor. What do you think happens if we don't do this? Give it a try!
Instead of specifying color = 'blue', we specified color = am
We moved the color parameter inside of the aes() parentheses

Let's review each of these changes:

Converting the `am` variable to a factor

In the dataset, am was initially a numeric variable. You can check this by running class(mtcars$am). When you pass a numeric variable to a color scale in ggplot, it creates a continuous color scale.

In this case, however, there are only 2 values for the am field, corresponding to automatic and manual transmission. So it makes our graph more clear to use a discrete color scale, with 2 color options for the two values of am. We can accomplish this by converting the am field from a numeric value to a factor, as we did above.

On your own, try graphing both with and without this conversion to factor. If you've already converted to factor, you can reload the dataset by running data(mtcars) to try graphing as numeric!

This point is a bit tricky. Check out my workbook for this post for a guided exploration of this issue in more detail!

Specifying `color = am` and moving it within the `aes()` parentheses

I'm combining these because these two changes work together.

Before, we told ggplot to change the color of the points to blue by adding color = 'blue' to our geom_point() call.

What we're doing here is a bit more complex. Instead of specifying a single color for our points, we're telling ggplot to map the data in the am column to the color aesthetic.

This means we are telling ggplot to use a different color for each value of am in our data! This mapping also lets ggplot know that it also needs to create a legend to identify the transmission types, and it places it there automatically!

Changing point shapes in a ggplot scatter plot

Let's look at a related example. This time, instead of changing the color of the points in our scatter plot, we will change the shape of the points:

mtcars$am <- factor(mtcars$am)

ggplot(mtcars) +
    geom_point(aes(x = wt, y = mpg, shape = am))

The code for this ggplot scatter plot is identical to the code we just reviewed, except we've substituted shape for color. The graph produced is quite similar, but it uses different shapes (triangles and circles) instead of different colors in the graph. You might consider using something like this when printing in black and white, for example.

A deeper review of `aes()` (aesthetic) mappings in ggplot

We just saw how we can create graphs in ggplot that map the am variable to color or shape in a scatter plot. ggplot refers to these mappings as aesthetic mappings, and they include everything you see within the aes() in ggplot.

Aesthetic mappings are a way of mapping variables in your data to particular visual properties (aesthetics) of a graph.

I know this can sound a bit theoretical, so let's review the specific aesthetic mappings you've already seen as well as the other mappings available within geom_point.

Reviewing the list of geom_point aesthetic mappings

The main aesthetic mappings for a ggplot scatter plot include:

x: Map a variable to a position on the x-axis
y: Map a variable to a position on the y-axis
color: Map a variable to a point color
shape: Map a variable to a point shape
size: Map a variable to a point size
alpha: Map a variable to a point transparency

From the list above, we've already seen the x, y, color, and shape aesthetic mappings.

x and y are what we used in our first ggplot scatter plot example where we mapped the variables wt and mpg to x-axis and y-axis values. Then, we experimented with using color and shape to map the am variable to different colored points or shapes.

In addition to those, there are 2 other aesthetic mappings commonly used with geom_point. We can use the alpha aesthetic to change the transparency of the points in our graph. Finally, the size aesthetic can be used to change the size of the points in our scatter plot.

Note there are two additional aesthetic mappings for ggplot scatter plots, stroke, and fill, but I'm not going to cover them here. They're only used for particular shapes, and have very specific use cases beyond the scope of this guide.

Changing the `size` aesthetic mapping in a ggplot scatter plot

ggplot(mtcars) +
    geom_point(aes(x = wt, y = mpg, size = cyl))

In the code above, we map the number of cylinders (cyl), to the size aesthetic in ggplot. Cars with more cylinders display as larger points in this graph.

Note: A scatter plot where the size of the points vary based on a variable in the data is sometimes called a bubble chart. The scatter plot above could be considered a bubble chart!

In general, we see that cars with more cylinders tend to be clustered in the bottom right of the graph, with larger weights and lower miles per gallon, while those with fewer cylinders are on the top left. That said, it's a bit hard to make out all the points in the bottom right corner. How can we solve that issue? Let's learn more about the alpha aesthetic to find out!

Changing transparency in a ggplot scatter plot with the `alpha` aesthetic

ggplot(mtcars) +
    geom_point(aes(x = wt, y = mpg, alpha = cyl))

In this code we've mapped the alpha aesthetic to the variable cyl. Cars with fewer cylinders appear more transparent, while those with more cylinders are more opaque. But in this case, I don't think this helps us to understand relationships in the data any better. Instead, it just seems to highlight the points on the bottom right. I think this is a bad graph!

How else can we use the alpha aesthetic to improve the readability of our graph? Let's turn back to our code from above where we mapped the cylinders to the size variable, creating what I called a bubble chart. Remember how it was difficult to make out all of the cars in the bottom right? What if we made all of the points in the graph semi-transparent so that we can see through the bubbles that are overlapping? Let's try!

ggplot(mtcars) +
    geom_point(aes(x = wt, y = mpg, size = cyl), alpha = 0.3)

This makes it much easier to see the clustering of larger cars in the bottom right while not reducing the importance of those points in the top left! This is my favorite use of the alpha aesthetic in ggplot: adding transparency to get more insight into dense regions of points.

Aesthetic mappings vs. parameters in ggplot

Above, we saw that we are able to use color in two different ways with geom_point. First, we were able to set the color of our points to blue by specifying color = 'blue' outside of our aes() mappings. Then, we were able to map the variable am to color by specifying color = am inside of our aes() mappings.

Similarly, we saw two different ways to use the alpha aesthetic as well. First, we mapped the variable cyl to alpha by specifying alpha = cyl inside of our aes() mappings. Then, we set the alpha of all points to 0.3 by specifying alpha = 0.3 outside of our aes() mappings.

What is the difference between these two ways of dealing with the aesthetic mappings available to us?

Each of the aesthetic mappings you've seen can also be used as a parameter, that is, a fixed value defined outside of the aes() aesthetic mappings. You saw how to do this with color when we made the scatter plot points blue with color = 'blue' above. Then, you saw how to do this with alpha when we set the transparency to 0.3 with alpha = 0.3. Now let's look at an example of how to do this with shape in the same manner:

ggplot(mtcars) +
    geom_point(aes(x = wt, y = mpg), shape = 18)

Here, we specify to use shape 18, which corresponds to this diamond shape you see here. Because we specified this outside of the aes(), this applies to all of the points in this graph!

To review what values shape, size, and alpha accept, just run ?shape, ?size, or ?alpha from your console window! For even more details, check out vignette("ggplot2-specs")

Common errors with aesthetic mappings and parameters in ggplot

When I was first learning R and ggplot, the difference between aesthetic mappings (the values included inside your aes()), and parameters (the ones outside your aes()) was constantly confusing me. Luckily, over time, you'll find that this becomes second nature. But in the meantime, I can help you speed along this process with a few common errors that you can keep an eye out for.

Trying to include aesthetic mappings outside your `aes()` call

If you're trying to map the cyl variable to shape, you should include shape = cyl within the aes() of your geom_point call. What happens if you include it outside accidentally, and instead run ggplot(mtcars) + geom_point(aes(x = wt, y = mpg), shape = cyl)? You'll get an error message that looks like this:

Whenever you see this error about object not found, be sure to check that you're including your aesthetic mappings inside the aes() call!

Trying to specify parameters inside your `aes()` call

On the other hand, if we try including a specific parameter value (for example, color = 'blue') inside of the aes() mapping, the error is a bit less obvious. Take a look:

ggplot(mtcars) + 
    geom_point(aes(x = wt, y = mpg, color = 'blue'))

In this case, ggplot actually does produce a scatter plot, but it's not what we intended.

For starters, the points are all red instead of the blue we were hoping for! Also, there's a legend in the graph that simply says 'blue'.

What's going on here? Under the hood, ggplot has taken the string 'blue' and effectively created a new hidden column of data where every value simple says 'blue'. Then, it's mapped that column to the color aesthetic, like we saw before when we specified color = am. This results in the legend label and the color of all the points being set, not to blue, but to the default color in ggplot.

If this is confusing, that's okay for now. Just remember: when you run into issues like this, double check to make sure you're including the parameters of your graph outside your aes() call!

You should now have a solid understanding of how to create a scatter plot in R using the ggplot scatter plot function, geom_point!

Experiment with the things you've learned to solidify your understanding. You can download my free workbook with the code from this article to work through on your own.

I've found that working through code on my own is the best way for me to learn new topics so that I'll actually remember them when I need to do things on my own in the future.

Download the workbook now to practice what you learned!

If You Want to be Effective, You Need to Approach Data Science with a Business Mindset

2019-04-23T00:00:00-04:00

There's a great article by Dominik Haitz making its way around the data science world this past week.

The entire article is worth reading, but I'll highlight some of my favorite points.

In the article, Dominik talks about the importance of developing a business mindset to succeed in data science.

A business mindset is critical. Remember, at the end of the day, the point of your work is to create some kind of concrete value for your organization. You can, and should, prioritize learning and working on fun and interesting projects, but above all else, you need to create value.

Dominik argues:

Prioritizing your work and knowing when to stop is the key to efficiency. Think of diminishing returns: Is it worth spending weeks to tweak a model for another 0.2% of precision? Quite often, good enough is the real perfect.

As I've discussed before, prioritization is extremely important in this fast-moving field, and it's impossible for you to know everything.

I believe in applying the pareto principal to your learning: focus on mastering the 20% of concepts that will drive 80% of results, and only optimize further as necessary.

It's not that further optimization isn't valuable, but that your resources as a person are limited, and it's always better to produce something good but imperfect than it is to never produce something that would be perfect.

Dominik also talks about the importance of communicating your results, something I think many data scientists struggle with.

As difficult as it is to hear this, your analysis is meaningless if you can't convince key stakeholders in your company to take action based on what you find!

You need to be able to communicate your results effectively, both throughout your organization and externally. Communicating effectively means trying to see the world through the eyes of those you are communicating to. You can, and should, discuss your analysis differently when speaking with a teammate, an executive, and a client.

This was a great article, and I agree with a lot of what Dominik is saying. Check out the article for more!

Mapping Legal Marijuana States and Medical Marijuana States 1995 - 2019

2019-04-20T00:00:00-04:00

Today is April 20, 2019. As stoners everywhere celebrate the occasion, I thought I'd turn to creating maps.

As the number of legal marijuana states and medical marijuana states seems to have grown considerably in recent years, I got thinking: what has the history been of legalization and how has it grown over time?

In the graph below, I show legal marijuana states, medical marijuana states, and illegal marijuana states over time. Note that for medical marijuana states, there are two categories: broadly legal for medical purposes, and low-THC marijuana legal for medical purposes. Without further ado, I present the map:

Read on if you're interested in learning how to create this map yourself in R!

Gathering the data

I sourced the data for legal marijuana states and medical marijuana states from the Timeline of cannabis laws in the United States article on Wikipedia.

library(albersusa) # devtools::install_github("hrbrmstr/albersusa")
library(animation)
library(ggalt)
library(hrbrthemes)
library(maps)
library(tidyverse)

# Set up base map theme for ggplot
map_theme <- theme(axis.line = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank(), 
        panel.background = element_blank(), 
        panel.border = element_blank(), 
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(), 
        plot.background = element_blank(),
        legend.position = 'top')

timeline <- read_csv('~/Desktop/Marijuana_Legalization.csv')

Above I load in the various packages we'll be using for this analysis. I also create a blank map theme that will remove a lot of things like axis gridlines and text that we don't want on our final map.

Then, I load in the legal marijuana states data I gathered from the Wikipedia article above. I had to do some manual processing to convert the text in that article to a workable file that had dates of different legalization statuses by state. The file I ended up putting together looked like this:

In order to create the animated graph, we need to reformat the input data a bit. Ultimately, we want a dataset organized with three columns:

Year
State
Status

There should be one status entry for each combination of Year and State, giving the legal marijuana status for that state in that particular year.

While we don't need it for this map, I'm also going to include a Criminalized column to indicate whether marijuana had been decriminalized in a particular year in a state. The processing to get the data in this format all happens in the section below.

For each state, we read in the data to determine which year, if any, marijuana became medicinally legal, recreationally legal, or medicinally legal with low-THC content. This lets us categorize each state-year combination into one of four categories:

Illegal: All forms of marijuana consumption are illegal Low-THC Medicinally Legal: Low-THC varieties of marijuana are legal for medicinal use Medicinally Legal: Marijuana is legal for medicinal use Legal: Marijuana is legal for recreational use

legalized <- data.frame()
for(state in timeline$State) {
    current_state <- filter(timeline, State == state)

    decrim_year <- current_state$Decriminalized
    crim_year <- current_state$Criminalized
    med_year <- current_state$Legalized_Medical
    med_low_year <- current_state$Legalized_Medical_Low_THC
    legal_year <- current_state$Legalized_Recreational

    status = 'Illegal'
    criminalized = TRUE
    for(year in 1960:2019) {
        if(!is.na(decrim_year) & year == decrim_year) {
            criminalized = FALSE
        }

        if(!is.na(crim_year) & year == crim_year) {
            criminalized = TRUE
        }

        if(!is.na(med_year) & year == med_year) {
            status = 'Legal for Medical Use'
        }

        if(!is.na(med_low_year) & year == med_low_year) {
            status = 'Legal for Medical Use, Low-THC Only'
        }

        if(!is.na(legal_year) & year == legal_year) {
            status = 'Legal'
        }

        current_status <- data.frame(State = state, 
                                     Year = year,
                                     Status = status,
                                     Criminalized = criminalized)

        legalized <- rbind(legalized, current_status)
    }
}

Great, so now we have the data that we'll need to create our ultimate graph of legal marijuana states. First, I'm going to create the graph for 2019 to make sure that I have the base formatting down.

I start by loading in a shapefile for the United States that we'll use for graphing. I remove Washington D.C. which I didn't pull legalization data for initially, which causes issues with mapping.

Then I reorganize the legal status into a factor column, which lets us create explicit orders and color mappings below, something we'll need to create the final graph.

I filter the legalization data to only include 2019, then merge the legalization data with the map shapefile to get the information we need for graphing. Finally, I produce the ultimate graph in ggplot!

us <- usa_composite()
us_map <- fortify(us, region="name") %>% filter(id != 'District of Columbia')

legalized$Status <- factor(legalized$Status, 
                           levels = c('Illegal', 
                                      'Legal for Medical Use, Low-THC Only',
                                      'Legal for Medical Use',
                                      'Legal'))

legalized_2019 <- filter(legalized, Year == 2019)

legalized_2019_map <- merge(us_map, legalized_2019, by.x = "id", by.y = "State", all = T) %>%
    arrange(order)

ggplot(legalized_2019_map) +
    geom_polygon(aes(fill = Status, x = long, y = lat, group = group), color = 'white') +
    coord_proj(us_laea_proj) +
    scale_fill_manual(values = c('#E9E9E9', '#105927', '#349941', '#61DE58'),
                      name = '', limits = levels(legalized$Status)) +
    theme_ipsum(base_size = 10) +
    map_theme +
    labs(title = paste0('Legal Status of Marijuana in in ', year),
         subtitle = '',
         x = 'michaeltoth.me / @michael_toth', y = '')

A quick comparison of the above map to other maps I found shows we're capturing the legal statuses correctly, so we're all set. Now, let's get to animating so we can create the complete graph from above!

legalized_map <- merge(us_map, legalized, by.x = "id", by.y = "State", all = T) %>%
    arrange(order)

saveGIF({
  # Repeat 2019 6 times for a pause at the end of the animation
  for (year in c(1995:2019, rep(2019, 5))) {
      yearly_map <- filter(legalized_map, Year == year)

      p <- ggplot(yearly_map) +
           geom_polygon(aes(fill = Status, x = long, y = lat, group = group), color = 'white') +
           coord_proj(us_laea_proj) +
           scale_fill_manual(values = c('#E9E9E9', '#105927', '#349941', '#61DE58'),
                             name = '', limits = levels(legalized$Status)) +
           theme_ipsum(base_size = 24, plot_title_size = 36,
                       axis_title_size = 24) +
           map_theme +
           labs(title = paste0('Legal Status of Marijuana in ', year),
                x = 'michaeltoth.me / @michael_toth', y = '')

      print(p)
  }

}, movie.name = '~/dev/michaeltoth/content/figures/20190419_Marijuana_Legalization/map.gif', interval = 1, ani.width = 1400, ani.height = 1000)

## [1] FALSE

Above I create the same map, except that I produce 1 map separately for each year from 1995 to 2019. I also repeat 2019 6 times in total to give the effect of pausing on the final frame.

To create this animation, I wrap the creation of the yearly maps in the saveGIF command, which will convert a series of images into a GIF. Then, I loop through the years 1995-2019 (repeating 2019 6 times), to create the animation!

Did you find this post interesting? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, sign up here!

Generating the Ultimate List of 41 Data Science Podcasts by Crowdsourcing Google Results

2019-04-19T00:00:00-04:00

Confession time: years ago, I was skeptical of podcasts. I was a music-only listener on commutes. Can you imagine? But around 2016, I gave in and finally took the plunge into podcasts. And I'm so glad that I did.

Since then, I've seen enormous benefits, all attributable at least in part to the podcasts I've listened to. I've improved my programming, learned new skills, and started multiple income-generating businesses.

Naturally, I've been interested in data science podcasts. Initially, I found my data science podcasts through people I followed on Twitter whose shows I knew about. But I'm always on the lookout for new podcasts, so I took to the internet to find recommendations.

If you're a podcast listener, you've probably learned that the current state of podcast discovery leaves much to be desired. It's very hard to find new podcasts, especially for niche fields like data science and analytics.

So of course, when I tried searching for interesting data science podcasts, I kept finding the same shows recommended everywhere. And usually they were the podcasts I was already listening to!

I decided that if I wanted a bigger list of data science podcasts, I needed to go deep. I searched nearly 100 lists for recommended podcasts, and used them to compile the most complete list of data science podcasts you will find.

Click here for the full list of 41 podcasts

In this post I'm going to talk about how I collected this list and analyze the results.

Gathering a list of data science podcasts

I knew I wanted to build a big list. But I also knew that not all podcasts are created equal, and that I needed some way to differentiate the truly great podcasts on the list. I decided I would use podcast recommendations as a form of social proof, so that the most recommended podcasts would bubble to the top of the list. Here's the method I decided to follow:

I'd generate a list of search terms
I'd perform a google search for each term on the list
I'd open each of the top 10 Google links for that search term and note all the results
I'd aggregate the complete results by podcast and use the total number of recommendations to create a "recommendation score". Higher scores should, in theory, represent better podcasts. Or at least more well known podcasts.

Generating a list of search phrases (Step 1)

This step was relatively easy. I knew I wanted a list of data science podcasts, but I also wanted to include more specific subgenres like data visualization, as well as more broad but related topics like Python programming or SQL & Databases. I settled on this list of phrases:

Data Science Podcasts
Data Engineering Podcasts
Data Visualization Podcasts
Analytics Podcasts
SQL Podcasts
R Podcasts -reddit
Python Podcasts

Pretty straightforward. The only curious bit is with the R Podcasts query. The Reddit url structure is such that reddit.com/r/podcasts leads to the podcasts subreddit. Without the explicit -reddit, all links were to that subreddit and unrelated to R programming.

Searching Google and aggregating (Steps 2-4)

Next, I'd search each of these phrases in Google and open up the first 10 sites that popped up in my search results.

Often times these were lists others had created of, for example, "top 5 data science podcasts". I copied down the list of podcasts and kept an ongoing tally for how many times I'd seen a particular podcast represented.

Occasionally, the Google search would yield links to specific podcasts rather than to lists of podcasts. In this case, I would record this as well. My thinking is that ranking on Google for a particular search phrase is at least as reliable an indicator of podcast quality as inclusion on a "best data science podcasts" list. That said, this did particularly benefit those podcasts whose names closely matched my search query, as was the case with Data Engineering Podcast and the The R-Podcast.

After going through all of the search queries, I had a list of podcast recommendations along with a tally of how often the podcast had been recommended in search results.

Finally, I removed any podcasts that had only 1 appearance in the results. There were a large number of these, and I felt it would dilute the value of the list to include them. This was somewhat arbitrary, but I think it makes for a stronger list overall.

Analyzing the list of best data science podcasts

library(hrbrthemes)
library(tidyverse)
extrafont::loadfonts(quiet = TRUE) # Needed for hrbrthemes / mac dependency issue

podcasts <- read_csv('~/Desktop/best_podcasts.csv')

First I read in my compiled list of podcasts, then I used ggplot to graph the results.

ggplot(podcasts, aes(x = reorder(Title, Score), y = Score)) +
    geom_bar(stat = 'identity') +
    coord_flip() +
    labs(title = 'Top Data Science Podcasts',
         subtitle = 'Rankings crowdsourced from Google search results',
         x = '',
         y = 'Recommendation Score',
         caption = 'michaeltoth.me / @michael_toth') +
    theme_ipsum(base_size = 5, caption_size = 8, axis_title_size = 8) +
    theme(panel.grid.major.y = element_blank(),
          panel.grid.major.x = element_line(colour = 'white', linetype = 'dotted'),
          panel.grid.minor.x = element_line(colour = 'white', linetype = 'dotted'),
          panel.ontop = TRUE)

This is awesome! This list of 41 podcasts is more than I found anywhere else online while going through this exercise, and I'm confident it's the most extensive list of data science podcasts you can find.

I had heard of many of the podcasts in the top 10--Data Stories, Data Skeptic, Partially Derivative, The O'Reilly Data Show, and Not So Standard Deviations--but others were new to me! I hadn't heard of Linear Digressions, or Talking Machines, or Learning Machines 101. While I haven't yet had a chance to listen to these, I'm excited to check them out!

So now we have a list of 41 data science podcasts to review. This is great, but I think we can do better! While these podcasts are all loosely related to data science, they're quite different from one another in focus. For example, FiveThirtyEight Politics is a relatively mainstream podcast, Data Stories deals with topics in data visualization, and Data Skeptic is an all-around instructional data science podcast.

I thought some form of categorization would help guide my listening, so I created a broad list of categories and grouped them as best as I could. I decided I would group the podcasts into 8 categories:

General Data Science and Analytics
Relevant Mainstream Podcasts
Machine Learning & AI
Data Visualization
Data Engineering
SQL & Databases
R Programming
Python Programming

I went through this list and categorized them as best as I could according to topic. Equipped with these new categories, let's take a look at the list of recommendations:

ggplot(podcasts, aes(x = reorder(Title, Score), y = Score, fill = Category)) +
    geom_bar(stat = 'identity') +
    coord_flip() +
    scale_fill_manual(values = c("#a6cee3","#1f78b4","#b2df8a","#33a02c","#fb9a99","#e31a1c","#fdbf6f","#ff7f00")) +
    labs(title = 'Top Data Science Podcasts',
         subtitle = 'Rankings crowdsourced from Google search results',
         x = '',
         y = 'Recommendation Score',
         caption = 'michaeltoth.me / @michael_toth') +
    theme_ipsum(base_size = 5, caption_size = 8, axis_title_size = 8) +
    theme(panel.grid.major.y = element_blank(),
          panel.grid.major.x = element_line(colour = 'white', linetype = 'dotted'),
          panel.grid.minor.x = element_line(colour = 'white', linetype = 'dotted'),
          panel.ontop = TRUE,
          legend.position = 'bottom')

While most of the top 10 are made up of what I'm calling general data science and analytics podcasts, Data Stories stands alone as the number one podcast and the only data visualization podcast in the top 10. Most of the language-specific podcasts are lower on the list, but that's likely due to the fact that they are more specific in nature and not necessarily due to lower quality. The R podcast was the only language-specific podcast to crack the top 10, so I'm definitely going to check that one out!

Do you have favorite podcasts that aren't included here? Let me know in the comments. Also let me know how you discover new podcasts, I'd love to improve my discovery! If you're interested, you can get the full list of data science podcasts below:

Get the spreadsheet of all 41 data science podcasts organized by topic

Every week I publish concise tutorials 🎓 and career advice 💻 for data science and analytics workers. I will help you learn R programming, build your data science career, and raise your salary.

A Detailed Guide to Plotting Line Graphs in R using ggplot geom_line

2019-04-17T00:00:00-04:00

When it comes to data visualization, it can be fun to think of all the flashy and exciting ways to display a dataset. But if you're trying to convey information, flashy isn't always the way to go.

In fact, in most cases, simplicity is key to making your audience understand your data. Whether it's scatter plots, bar graphs, or line graphs (the subject of this post!), common graph types make things easy for your audience, which means you can more easily share your message.

Right now, we're talking about line graphs. A line graph is a type of graph that displays information as a series of data points connected by straight line segments.

The price of Netflix stock (NFLX) displayed as a line graph

Line graph of average monthly temperatures for four major cities

There are many different ways to use R to plot line graphs, but the one I prefer is the ggplot geom_line function.

Introduction to ggplot

Before we dig into creating line graphs with the ggplot geom_line function, I want to briefly touch on ggplot and why I think it's the best choice for plotting graphs in R.

ggplot is a package for creating graphs in R, but it's also a method of thinking about and decomposing complex graphs into logical subunits.

Investigating our dataset

Throughout this post, we'll be using the Orange dataset that's built into R. This dataset contains information on the age and circumference of 5 different orange trees, letting us see how these trees grow over time. Let's take a look at this dataset to see what it looks like:

The dataset contains 3 columns: Tree, age, and cimcumference. There are 7 observations for each Tree, and there are 5 Trees, for a total of 35 observations in all.

Simple example of ggplot + geom_line()

library(tidyverse)

# Filter the data we need
tree_1 <- filter(Orange, Tree == 1)

# Graph the data
ggplot(tree_1) +
    geom_line(aes(x = age, y = circumference))

Here we are starting with the simplest possible line graph using geom_line. For this simple graph, I chose to only graph the size of the first tree. I used dplyr to filter the dataset to only that first tree. If you're not familiar with dplyr's filter function, it's my preferred way of subsetting a dataset in R, and I recently wrote an in-depth guide to dplyr filter if you'd like to learn more!

Once I had filtered out the dataset I was interested in, I then used ggplot + geom_line() to create the graph. Let's review this in more detail:

First, I call ggplot, which creates a new ggplot graph. It's essentially a blank canvas on which we'll add our data and graphics. In this case, I passed tree_1 to ggplot, indicating that we'll be using the tree_1 data for this particular ggplot graph.

Next, I added my geom_line call to the base ggplot graph in order to create this line. In ggplot, you use the + symbol to add new layers to an existing graph. In this second layer, I told ggplot to use age as the x-axis variable and circumference as the y-axis variable.

And that's it, we have our line graph!

Changing line color in `ggplot + geom_line`

Expanding on this example, let's now experiment a bit with colors.

# Filter the data we need
tree_1 <- filter(Orange, Tree == 1)

# Graph the data
ggplot(tree_1) +
    geom_line(aes(x = age, y = circumference), color = 'red')

You'll note that this geom_line call is identical to the one before, except that we've added the modifier color = 'red' to to end of the line. Experiment a bit with different colors to see how this works on your machine. You can use most color names you can think of, or you can use specific hex colors codes to get more granular.

# Graph different data
ggplot(Orange) +
    geom_line(aes(x = age, y = circumference, color = Tree))

This line graph is quite different from the one we produced above, but we only made a few minor modifications to the code! Did you catch the 3 changes? They were:

The dataset changed from tree_1 (our filtered dataset) to the complete Orange dataset
Instead of specifying color = 'red', we specified color = Tree
We moved the color parameter inside of the aes() parentheses

Let's review each of these changes:

Moving from tree_1 to Orange

This change is relatively straightforward. Instead of only graphing the data for a single tree, we wanted to graph the data for all 5 trees. We accomplish this by changing our input dataset in the ggplot() call.

Specifying `color = Tree` and moving it within the `aes()` parentheses

I'm combining these because these two changes work together.

Before, we told ggplot to change the color of the line to red by adding color = 'red' to our geom_line() call.

What we're doing here is a bit more complex. Instead of specifying a single color for our line, we're telling ggplot to map the data in the Tree column to the color aesthetic.

Effectively, we're telling ggplot to use a different color for each tree in our data! This mapping also lets ggplot know that it also needs to create a legend to identify the trees, and it places it there automatically!

Changing linetype in `ggplot + geom_line`

Let's look at a related example. This time, instead of changing the color of the line graph, we will change the linetype:

ggplot(Orange) +
    geom_line(aes(x = age, y = circumference, linetype = Tree))

This ggplot + geom_line() call is identical to the one we just reviewed, except we've substituted linetype for color. The graph produced is quite similar, but it uses different linetypes instead of different colors in the graph. You might consider using something like this when printing in black and white, for example.

A deeper review of `aes()` (aesthetic) mappings in ggplot

We just saw how we can create graphs in ggplot that map the Tree variable to color or linetype in a line graph. ggplot refers to these mappings as aesthetic mappings, and they encompass everything you see within the aes() in ggplot.

Aesthetic mappings are a way of mapping variables in your data to particular visual properties (aesthetics) of a graph.

This might all sound a bit theoretical, so let's review the specific aesthetic mappings you've already seen as well as the other mappings available within geom_line.

Reviewing the list of geom_line aesthetic mappings

The main aesthetic mappings for ggplot + geom_line() include:

x: Map a variable to a position on the x-axis
y: Map a variable to a position on the y-axis
color: Map a variable to a line color
linetype: Map a variable to a linetype
group: Map a variable to a group (each group on a separate line)
size: Map a variable to a line size
alpha: Map a variable to a line transparency

From the list above, we've already seen the x, y, color, and linetype aesthetic mappings.

x and y are what we used in our first ggplot + geom_line() function call to map the variables age and circumference to x-axis and y-axis values. Then, we experimented with using color and linetype to map the Tree variable to different colored lines or linetypes.

In addition to those, there are 3 other main aesthetic mappings often used with geom_line.

The group mapping allows us to map a variable to different groups. Within geom_line, that means mapping a variable to different lines. Think of it as a pared down version of the color and linetype aesthetic mappings you already saw. While the color aesthetic mapped each Tree to a different line with a different color, the group aesthetic maps each Tree to a different line, but does not differentiate the lines by color or anything else. Let's take a look:

Changing the `group` aesthetic mapping in `ggplot + geom_line`

ggplot(Orange) +
    geom_line(aes(x = age, y = circumference, group = Tree))

You'll note that the 5 lines are separated as before, but the lines are all black and there is no legend differentiating them. Depending on the data you're working with, this may or may not be appropriate. It's up to you as the person familiar with the data to determine how best to represent it in graph form!

In our Orange tree dataset, if you're interested in investigating how specific orange trees grew over time, you'd want to use the color or linetype aesthetics to make sure you can track the progress for specific trees. If, instead, you're interested in only how orange trees in general grow, then using the group aesthetic is appropriate, simplifying your graph and discarding unnecessary detail.

ggplot is both flexible and powerful, but it's up to you to design a graph that communicates what you want to show. Just because you can do something doesn't mean you should. You should always think about what message you're trying to convey with a graph, then design from those principles.

Keep this in mind as we review the next two aesthetics. While these aesthetics absolutely have a place in data visualization, in the case of the particular dataset we're working with, they don't make very much sense. But this is a guide to using geom_line in ggplot, not graphing the growth of Orange trees, so I'm still going to cover them for the sake of completeness!

Changing transparency in `ggplot + geom_line` with the `alpha` aesthetic

ggplot(Orange) +
    geom_line(aes(x = age, y = circumference, alpha = Tree))

Here we map the Tree variable to the alpha aesthetic, which controls the transparency of the line. As you can see, certain lines are more transparent than others. In this case, transparency does not add to our understanding of the graph, so I would not use this to illustrate this dataset.

Changing the `size` aesthetic mapping in `ggplot + geom_line`

ggplot(Orange) +
    geom_line(aes(x = age, y = circumference, size = Tree))

Finally, we turn to the size aesthetic, which controls the size of lines. Again, I would say this is not does not add to our understanding of our data in this context. That said, it does slightly resemble Charles Joseph Minard's famous graph of the death tolls of Napoleon's disastrous 1812 Russia Campaign, so that's kind of cool:

Aesthetic mappings vs. parameters in ggplot

Before, we saw that we are able to use color in two different ways with geom_line. First, we were able to set the color of a line to red by specifying color = 'red' outside of our aes() mappings. Then, we were able to map the variable Tree to color by specifying color = Tree inside of our aes() mappings. How does this work with all of the other aesthetics you just learned about?

Essentially, they all work the same as color! That's the beautiful thing about graphing in ggplot--once you understand the syntax, it's very easy to expand your capabilities.

Each of the aesthetic mappings you've seen can also be used as a parameter, that is, a fixed value defined outside of the aes() aesthetic mappings. You saw how to do this with color when we set the line to red with color = 'red' before. Now let's look at an example of how to do this with linetype in the same manner:

ggplot(Orange) +
    geom_line(aes(x = age, y = circumference, group = Tree), linetype = 'dotted')

To review what values linetype, size, and alpha accept, just run ?linetype, ?size, or ?alpha from your console window!

Common errors with aesthetic mappings and parameters in ggplot

When I was getting started with R and ggplot, the distinction between aesthetic mappings (the values included inside your aes()), and parameters (the ones outside your aes() was the concept that tripped me up the most. You'll learn how to deal with these issues over time, but I can help you speed along this process with a few common errors that you can keep an eye out for.

Trying to include aesthetic mappings outside your `aes()` call

If you're trying to map the Tree variable to linetype, you should include linetype = tree within the aes() of your geom_line call. What happens if you accidentally include it outside, and instead run ggplot(Orange) + geom_line(aes(x = age, y = circumference), linetype = Tree)? You'll get an error message that looks like this:

Whenever you see this error about object not found, make sure you check and make sure you're including your aesthetic mappings inside the aes() call!

Trying to specify parameters inside your `aes()` call

Alternatively, if we try to specify a specific parameter value (for example, color = 'red') inside of the aes() mapping, we get a less intutive issue:

ggplot(Orange) +
    geom_line(aes(x = age, y = circumference, color = 'red'))

In this case, ggplot actually does produce a line graph (success!), but it doesn't have the result we intended. The graph it produces looks odd, because it is putting the values for all 5 trees on a single line, rather than on 5 separate lines like we had before. It did change the color to red, but it also included a legend that simply says 'red'. When you run into issues like this, double check to make sure you're including the parameters of your graph outside your aes() call!

You should now have a solid understanding of how to use R to plot line graphs using ggplot and geom_line! Experiment with the things you've learned to solidify your understanding. As an exercise, try producing a line graph of your own using a different dataset and at least one of the aesthetic mappings you learned about. Leave your graph in the comments or email it to me at mt.toth@gmail.com -- I'd love to take a look at what you produce!

Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, sign up here!

R Programmers Earn More than Python Programmers

2019-04-15T00:00:00-04:00

At least globally, that is. According to the 2019 Stack Overflow Developer Survey, R users globally reported earning an average of $64k per year, $1k more than the $63k reported by Python developers. In the United States, that situation reverses, with Python programmers earning $116k and R programmers $108k.

Global Average Salaries by Technology

United States Average Salaries by Technology

Highlights of The Stack Overflow Developer Survey

Since 2011, Stack Overflow has been surveying their users each year to answer questions about the technologies they use, their work experience, their compensation, and their satisfaction at work. Given Stack Overflow's place in the broader programming world, they are able to draw quite the audience for their annual surveys.

This year, nearly 90,000 developers participated in the survey! There's a lot in this survey, and I recommend reviewing it yourself, but I wanted to surface some of the key findings that I thought were particularly relevant to data professionals here.

Stack Overflow says they will be releasing the underlying data for this survey in the coming weeks, so I hope to return to this for a deeper analysis once that's made available. For now, let's get into the results!

Developer Roles

People with all different types of coding backgrounds use Stack Overflow. While most of them identify as developers (just over 51.9% globally identify as full-stack developers), there are also a significant number of data professionals on the list (a term I've just invented to include the categories database administrator, data scienctist, data analyst, and data engineer).

Globally, 11.7% of Stack Overflow users surveyed identified as database administrators, 7.9% as data scientists or machine learning specialists, 7.7% as data analysts / business analysts, and 7.2% as data engineers. These figures were slightly higher for US-based respondents to the survey.

Note that these figures are not mutually exclusive, as people were able to select multiple options in the survey.

United States Average Salaries by Job

In the United States, data scientists/machine learning engineers reported an average salary of $120k. Data engineers also reported an average salary of $120k. Database administrators reported $105k, while data analysts / business analysts reported an average salary of $100k.

Years of Experience

In the survey, data and business analysts reported an average of 9.3 years of professional coding experience, while data scientists reported an average of 7.8 years of experience. I thought this was particularly interesting in light of the fact that average salaries for data scientists were reported to be $20k higher than those of data analysts.

Percent Looking for Jobs

Both data scientists and data analysts ranked near the top of those surveyed to be looking for jobs, with 18.6% of data scientists looking and 17.9% of data analysts looking.

Programming, Scripting, and Markup Languages Used

SQL was by far the most popular data technology, at 54.4%. This is to be expected given its broad use in business applications and its long history. Python clocked in at 41.7%, though as a general purpose programming language the majority of those users are likely not using Python for its data processing in particular. R came in at 5.8% usage in the survey.

Undergraduate Major

While not directly related to analytics, I thought the survey question on undergraduate major was particularly interesting. A massive 62.4% of those surveyed reported studying computer science, computer engineering, or software engineering in undergrad. In undergrad I studied finance and statistics, which came in at 2.4% and 3.9% respectively, quite low on the scale here. I wonder how much these figures would differ for data scientists, where fields like economics, statistics, and business seem to be better represented. I hope that when Stack Overflow releases their data for this survey I'll be able to analyze this question more deeply!

I encourage you to check out the survey for a bunch of other useful information. There's more information on database usage and machine learning frameworks, as well as general information on work environments, developer satisfaction, and much more!

Did you find this post interesting? I frequently write about topics in the fields of analytics and data science to help keep you up to date on developments in the industry. If you want to be notified of new posts, sign up here!

I help technology companies to leverage their data to produce branded, influential content to share with their clients. I work on everything from investor newsletters to blog posts to research papers. If you enjoy the work I do and are interested in working together, you can visit my consulting website or contact me at michael@michaeltothanalytics.com!

Announcement: Register by Friday for Free R Training Sessions

2019-04-10T00:00:00-04:00

I run this blog because I want to help you learn to be a better R user! But in order to do that, I need to know more about where you are in your journey and the kinds of problems you're facing.

So, I'm excited to announce that over the next two weeks I am going to be offering free one-on-one 45-minute R training sessions to 10 people! You come to me with a problem, and we'll work together over a screen sharing session to solve it using R!

Would it be helpful to spend 45 minutes with me and receive personalized advice on solving your R problems? Then please contact me here and give me some details about the project you'd like me to help with!

Make sure to submit your request by noon EST on Friday, April 12. I'll get back to you for scheduling early next week if you're one of the 10 selected!

I want to leave this relatively open to different topics, but this should give you some ideas of things I will be able to help you with:

Loading and Cleaning Data in R using readr, dplyr, and other tidyverse tools
Creating graphs in R with ggplot
Animating graphs in R using gganimate
Creating reports in R using RMarkdown and knitr
Creating your own R package
Creating maps in R

I look forward to hearing from you!

The Ultimate Opinionated Guide to Base R Date Format Functions

2019-04-09T00:00:00-04:00

When I was first learning R, working with dates was one of the hardest and most time consuming tasks I dealt with. There are so many things to learn! What do I do with as.POSIXct(), as.POSIXlt(), strftime(), strptime(), format(), and as.Date()? R date formats were confusing, and it seemed no matter what I did I was always running into issues.

And I'll be honest, working with dates in R still trips me up from time to time. It can be confusing. But I've learned to follow a procedure to guide me through any date manipulation task with ease.

Today I'm going to help you learn that same procedure so you'll never have to worry about R date format issues again!

The goals of most R date format exercises

When working with R date formats, you're generally going to be trying to accomplish one of two different but related goals:

Converting a character string like "Jan 30 1989" to a Date type
Getting an R Date object to print in a specific format for a graph or other output

You may need to handle both of these goals in the same analysis, but it's best to think of them as two separate exercises. Knowing which goal you are trying to accomplish is important because you will need to use different functions to accomplish each of these. Let's tackle them one at a time.

Converting a character string to a date

A common scenario is that you have read in a .csv file where one of the columns contains dates. Often you will find this column is read in as a character vector.

If you don't need to use this column for anything, this might not be a problem.

But, often, you will need to do things with it! You'll want to sort a graph by date, or calculate the number of days between two dates, or format your dates in a specific way.

To accomplish any of those things, you'll first need to convert your character vector to a Date vector.

R has 3 main object types for working with dates: Date, POSIXct, and POSIXlt. Date objects can only work with dates, while POSIXct and POSIXlt objects can work with both dates and times.

Before you do any conversion, you need to first decide whether you want to keep any time data (if available) or if you only are working with dates.

If you're only working with dates, you'll want the as.Date() function which produces objects of type Date.

If you want both dates and times, you'll want the as.POSIXct() function which produces objects of type POSIXct.

Luckily, you'll find that these functions operate very similarly to one another, so you won't need to worry about memorizing little idiosyncracies between them!

Side note: In this post, I'm going to ignore the POSIXlt type which is very similar to POSIXct with some implementation differences beyond the scope of this post.

Converting a character string to a date using the as.Date() R function

The main function for converting from a character string to a Date (without time information) is the as.Date() function. as.Date() accepts a date vector and a format specification. The format specification identifies what date information is contained in the character string you are providing. Let's look at an example:

my_date <- "01/30/1989" # Input character string
as.Date(my_date, format = "%m/%d/%Y")

## [1] "1989-01-30"

Our input character string was in the format month/day/year, and we used the R format specification that corresponds to this, %m/%d/%Y, to convert this character string to a date.

Converting a character string to a POSIXct datetime using the `as.POSIXct()` R function

The main function for converting from a character string to a POSIXct datetime object (with time information) is the as.POSIXct() function. Just like as.Date(), as.POSIXCT() accepts a date vector and a format specification, which identifies what date and time information is contained in the character string you are providing. Here's an example:

my_date_time <- "01/30/1989 23:40:00" # Input character string with time information
as.POSIXct(my_date_time, format = "%m/%d/%Y %H:%M:%S")

## [1] "1989-01-30 23:40:00 EST"

Here we had the added the time string 23:40:00 on top of the same date we processed previously. As before, we used the R format specification for this string, %m/%d/%Y %H:%M:%S, to convert to a datetime POSIXct object.

Discarding unnecessary time data

We'll get more into how these format specifications work in a minute, but first I want to make a quick aside. Sometimes your dates will contain time information, but you won't actually need that for your analysis. This can sometimes be annoying to keep around, and it's often cleaner if you just get rid of it. Luckily, this is quite easy. Instead of using the as.POSIXct() function, we can simply use the as.Date() function and ignore the trailing timestamp information, as follows:

my_date_time <- "01/30/1989 23:40:00"
as.Date(my_date_time, format = "%m/%d/%Y")

## [1] "1989-01-30"

This will convert the character strings to Date objects, dropping the extraneous time data from our dataset. Remember: if you need time data, use as.POSIXct(), but if you don't, just use as.Date()!

More on R date format specifications

Above, we reviewed examples of how as.Date() and as.POSIXct() can convert character strings to dates, given the right format specification. Now, it's time to review what the different format specifications are, and how we can use them to convert character strings formatted in all different ways to dates.

We've already briefly seen the symbols %m, %d, %Y, %H, %M, and %S, and you probably have some idea that these correspond to month, day, year, hour, minute, and second. I'd now like to introduce the list of the most commonly used R date format specifications:

Look through this table to identify the different date formats we worked through previously. Pay special attention to the difference between %b, %B, and %m, as well as %y and %Y!

Standard procedure for converting a character string to date or datetime object

Identify the key variables we need to map
- Generally month, day, and year for dates
- Add in hour, minute, and second for times
For each key variable, identify the appropriate mapping
Construct the format specification string
Construct the as.Date() or as.POSIXct() function call

Let's work through a few examples!

Say we have a string in the format Jan 30th, 1989 23:40.

The variables we need to map are month, day, year, hour, and minute
Let's find the appropriate mappings:
- This string uses an abbreviated month, which maps to %b
- Day of the month maps to %d
- 4-digit years map to %Y
- 24-hour-clock hour maps to %H
- Minutes map to %M
The format specification string should exactly match the input string, simply substituting in our mappings from above. In this case: Jan 30th, 1989 23:40 becomes %b %dth, %Y %H:%M
Because we're dealing with both dates and times here, we know we're going to need as.POSIXct() if we want to maintain that time data. Let's give this a try:

my_date_time <- "Jan 30th, 1989 23:40"
as.POSIXct(my_date_time, format = "%b %dth, %Y %H:%M")

## [1] "1989-01-30 23:40:00 EST"

Awesome! It processed the data exactly as we needed it. Let's go through one more exercise to make sure we have this down before we move onto date formatting for output.

In this case, say we have the string 30 January 1989 11:40 PM.

The variables we need to map are day, month, year, hour, minute, and AM/PM
Let's again find the appropriate mappings:
- Day of the month maps to %d
- This string uses a full month, which maps to %B
- 4-digit years map to %Y
- 12-hour-clock hour maps to %I
- Minutes map to %M
- AM/PM indicator maps to %p
Substituting, 30 January 1989 11:40 PM becomes %d %B %Y %I:%M %p
Again, because we're dealing with both dates and times we need as.POSIXct to maintain the time data

my_date_time <- '30 January 1989 2:24 PM'
as.POSIXct(my_date_time, format = "%d %B %Y %I:%M %p")

## [1] "1989-01-30 14:24:00 EST"

There we have it! You should now be equipped to take a given character string, determine the format of that string, and then use either the as.Date() R function (in the case of only dates) or the as.POSIXct() R function (in the case of dates and times) to convert that character string to a date or datetime representation.

Now, let's turn to our second challenge: formatting an R date or datetime object for output, cleaning things up to be publication-ready. Luckily, you'll find there are many similarities in the approach to what you've just learned!

Formatting a date for publication-ready output

In this case, you have an R object that is already stored as one of several R date formats (Date, POSIXct, or POSIXlt), and now you'd like to clean up that date for graphing or publication. I find that I perform this type of transformation most often when I'm making graphs, but this is useful for creating RMarkdown reports and other output as well.

While earlier we used the as.POSIXct() and as.Date() R functions to convert from characters to dates, we'll now be using the format() R function to convert from dates to characters!

As before, we'll want to decide what information is important for mapping, select the appropriate format specification, and then build our function call. Luckily, this process looks very similar to what we did before, we're just working in reverse! Using the same table from above, we can find the variables we need to map our date to a specific format. Review the examples below to see how we convert a date or datetime variable to different character formats for output!

my_date <- as.Date("01/30/1989", "%m/%d/%Y")

my_date                                   # Unformatted date

## [1] "1989-01-30"

format(my_date, '%B %d %Y')               # Date format 'January 30 1989'

## [1] "January 30 1989"

format(my_date, '%B %dth, %Y')            # Date format 'January 30th, 1989'

## [1] "January 30th, 1989"

format(my_date, '%d %b %Y')               # Date format '30 Jan 1989'

## [1] "30 Jan 1989"

format(my_date, '%A %B %d %Y')            # Date format 'Monday January 30 1989'

## [1] "Monday January 30 1989"

format(my_date, '%m/%d/%y')               # Date format '01/30/90'

## [1] "01/30/89"

my_date_time <- Sys.time()                # Function that generates the current time
my_date_time                              # Unformatted datetime

## [1] "2019-04-09 13:28:12 EDT"

format(my_date_time, '%B %d %Y')          # Datetime format 'April 09 1989' (Discard time)

## [1] "April 09 2019"

format(my_date_time, '%B %d %Y %H:%M')    # Datetime format 'April 09 1989 12:14'

## [1] "April 09 2019 13:28"

format(my_date_time, '%H:%M on %B %d %Y') # Datetime format '12:14 on April 09 1989'

## [1] "13:28 on April 09 2019"

There we have it! You should now be able to easily perform nearly any date operation you need in R. You can take character strings and convert them to dates using the as.POSIXct() and as.Date() R functions. You can also take date or datetime objects and use the format() function to clean them up for publication-ready graphs and papers!

Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, sign up here!

How to Filter in R: A Detailed Introduction to the dplyr Filter Function

2019-04-08T00:00:00-04:00

Data wrangling. It's the process of getting your raw data transformed into a format that's easier to work with for analysis.

It's not the sexiest or the most exciting work.

In our dreams, all datasets come to us perfectly formatted and ready for all kinds of sophisticated analysis! In real life, not so much.

It's estimated that as much as 75% of a data scientist's time is spent data wrangling. To be an effective data scientist, you need to be good at this, and you need to be FAST.

One of the most basic data wrangling tasks is filtering data. Starting from a large dataset, and reducing it to a smaller, more manageable dataset, based on some criteria.

Think of filtering your sock drawer by color, and pulling out only the black socks.

Whenever I need to filter in R, I turn to the dplyr filter function.

As is often the case in programming, there are many ways to filter in R. But the dplyr filter function is by far my favorite, and it's the method I use the vast majority of the time.

Why do I like it so much? It has a user-friendly syntax, is easy to work with, and it plays very nicely with the other dplyr functions.

A brief introduction to dplyr

Before I go into detail on the dplyr filter function, I want to briefly introduce dplyr as a whole to give you some context.

dplyr is a cohesive set of data manipulation functions that will help make your data wrangling as painless as possible.

dplyr, at its core, consists of 5 functions, all serving a distinct data wrangling purpose:

filter() selects rows based on their values
mutate() creates new variables
select() picks columns by name
summarise() calculates summary statistics
arrange() sorts the rows

The beauty of dplyr is that the syntax of all of these functions is very similar, and they all work together nicely.

If you master these 5 functions, you'll be able to handle nearly any data wrangling task that comes your way. But we need to tackle them one at a time, so now: let's learn to filter in R using dplyr!

Loading Our Data

In this post, I'll be using the diamonds dataset, a dataset built into the ggplot package, to illustrate the best use of the dplyr filter function. To start, let's take a look at the data:

library(dplyr)
library(ggplot2)

diamonds

## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

We can see that the dataset gives characteristics of individual diamonds, including their carat, cut, color, clarity, and price.

Our First dplyr Filter Operation

I'm a big fan of learning by doing, so we're going to dive in right now with our first dplyr filter operation.

From our diamonds dataset, we're going to filter only those rows where the diamond cut is 'Ideal':

filter(diamonds, cut == 'Ideal')

## # A tibble: 21,551 x 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
##  3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
##  4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
##  5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
##  6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
##  7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
##  8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
##  9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
## 10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
## # … with 21,541 more rows

As you can see, every diamond in the returned data frame is showing a cut of 'Ideal'. It worked! We'll cover exactly what's happening here in more detail, but first let's briefly review how R works with logical and relational operators, and how we can use those to efficiently filter in R.

A brief aside on logical and relational operators in R and dplyr

In dplyr, filter takes in 2 arguments:

The dataframe you are operating on
A conditional expression that evaluates to TRUE or FALSE

In the example above, we specified diamonds as the dataframe, and cut == 'Ideal' as the conditional expression

Conditional expression? What am I talking about?

Under the hood, dplyr filter works by testing each row against your conditional expression and mapping the results to TRUE and FALSE. It then selects all rows that evaluate to TRUE.

In our first example above, we checked that the diamond cut was Ideal with the conditional expression cut == 'Ideal'. For each row in our data frame, dplyr checked whether the column cut was set to 'Ideal', and returned only those rows where cut == 'Ideal' evaluated to TRUE.

In our first filter, we used the operator == to test for equality. That's not the only way we can use dplyr to filter our data frame, however. We can use a number of different relational operators to filter in R.

Relational operators are used to compare values. In R generally (and in dplyr specifically), those are:

== (Equal to)
!= (Not equal to)
< (Less than)
<= (Less than or equal to)
> (Greater than)
>= (Greater than or equal to)

These are standard mathematical operators you're used to, and they work as you'd expect. One quick note: make sure you use the double equals sign (==) for comparisons! By convention, a single equals sign (=) is used to assign a value to a variable, and a double equals sign (==) is used to check whether two values are equal. Using a single equals sign will often give an error message that is not intuitive, so make sure you check for this common error!

dplyr can also make use of the following logical operators to string together multiple different conditions in a single dplyr filter call!

! (logical NOT)
& (logical AND)
| (logical OR)

There are two additional operators that will often be useful when working with dplyr to filter:

%in% (Checks if a value is in an array of multiple values)
is.na() (Checks whether a value is NA)

In our first example above, we tested for equality when we said cut == 'Ideal'. Now, let's expand our capabilities with different relational parameters in our filter:

filter(diamonds, price > 2000)

## # A tibble: 29,733 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.7  Ideal     E     SI1      62.5    57  2757  5.7   5.72  3.57
##  2  0.86 Fair      E     SI2      55.1    69  2757  6.45  6.33  3.52
##  3  0.7  Ideal     G     VS2      61.6    56  2757  5.7   5.67  3.5 
##  4  0.71 Very Good E     VS2      62.4    57  2759  5.68  5.73  3.56
##  5  0.78 Very Good G     SI2      63.8    56  2759  5.81  5.85  3.72
##  6  0.7  Good      E     VS2      57.5    58  2759  5.85  5.9   3.38
##  7  0.7  Good      F     VS1      59.4    62  2759  5.71  5.76  3.4 
##  8  0.96 Fair      F     SI2      66.3    62  2759  6.27  5.95  4.07
##  9  0.73 Very Good E     SI1      61.6    59  2760  5.77  5.78  3.56
## 10  0.8  Premium   H     SI1      61.5    58  2760  5.97  5.93  3.66
## # … with 29,723 more rows

Here, we select only the diamonds where the price is greater than 2000.

filter(diamonds, cut != 'Ideal')

## # A tibble: 32,389 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  2 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  3 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  4 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  5 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  6 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  7 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  8 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
##  9 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## 10 0.3   Good      J     SI1      64      55   339  4.25  4.28  2.73
## # … with 32,379 more rows

And here, we select all the diamonds whose cut is NOT equal to 'Ideal'. Note that this is the exact opposite of what we filtered before.

You can use <, >, <=, >=, ==, and != in similar ways to filter your data. Try a few examples on your own to get comfortable with the different filtering options!

A note on storing your results

By default, dplyr filter will perform the operation you ask and then print the result to the screen. If you prefer to store the result in a variable, you'll need to assign it as follows:

e_diamonds <- filter(diamonds, color == 'E')

Note that you can also overwrite the dataset (that is, assign the result back to the diamonds data frame) if you don't want to retain the unfiltered data. In this case I want to keep it, so I'll store this result in e_diamonds. In any case, it's always a good idea to preview your dplyr filter results before you overwrite any data!

Filtering Numeric Variables

Numeric variables are the quantitative variables in a dataset. In the diamonds dataset, this includes the variables carat and price, among others. When working with numeric variables, it is easy to filter based on ranges of values. For example, if we wanted to get any diamonds priced between 1000 and 1500, we could easily filter as follows:

filter(diamonds, price >= 1000 & price <= 1500)

## # A tibble: 5,511 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.38  Very Good E     VVS2     61.8    56  1000  4.66  4.68  2.88
##  2 0.39  Very Good F     VS1      57.1    61  1000  4.86  4.91  2.79
##  3 0.38  Very Good E     VS1      61.5    58  1000  4.64  4.69  2.87
##  4 0.38  Premium   E     VS1      60.7    59  1000  4.65  4.7   2.84
##  5 0.38  Ideal     E     VS1      61.6    56  1000  4.65  4.67  2.87
##  6 0.53  Very Good G     SI2      62.5    55  1000  5.14  5.19  3.23
##  7 0.570 Very Good I     SI2      62.1    57  1000  5.29  5.33  3.3 
##  8 0.38  Ideal     E     VS1      61.9    56  1000  4.63  4.67  2.88
##  9 0.5   Good      E     SI2      63.2    61  1000  5.02  5.05  3.18
## 10 0.3   Ideal     D     VVS1     61.3    57  1000  4.29  4.32  2.64
## # … with 5,501 more rows

In general, when working with numeric variables, you'll most often make use of the inequality operators, >, <, >=, and <=. While it is possible to use the == and != operators with numeric variables, I generally recommend against it.

The issue with using == is that it will only return true of the value is exactly equal to what you're testing for. If the dataset you're testing against consists of integers, this is possible, but if you're dealing with decimals, this will often break down. For example, 1.0100000001 == 1.01 will evaluate to FALSE. This is technically true, but it's easy to get into trouble with decimal precision. I never use == when working with numerical variables unless the data I am working with consists of integers only!

Filtering Categorical Variables

Categorical variables are non-quantitative variables. In our example dataset, the columns cut, color, and clarity are categorical variables. In contrast to numerical variables, the inequalities >, <, >= and <= have no meaning here. Instead, you'll make frequent use of the ==, !=, and %in% operators when filtering categorical variables.

Above, we filtered the dataset to include only the diamonds whose cut was Ideal using the == operator. Let's say that we wanted to expand this filter to also include diamonds where the cut is Premium. To accomplish this, we would use the %in% operator:

filter(diamonds, cut %in% c('Ideal', 'Premium'))

## # A tibble: 35,342 x 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.290 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
##  4 0.23  Ideal   J     VS1      62.8    56   340  3.93  3.9   2.46
##  5 0.22  Premium F     SI1      60.4    61   342  3.88  3.84  2.33
##  6 0.31  Ideal   J     SI2      62.2    54   344  4.35  4.37  2.71
##  7 0.2   Premium E     SI2      60.2    62   345  3.79  3.75  2.27
##  8 0.32  Premium E     I1       60.9    58   345  4.38  4.42  2.68
##  9 0.3   Ideal   I     SI2      62      54   348  4.31  4.34  2.68
## 10 0.24  Premium I     VS1      62.5    57   355  3.97  3.94  2.47
## # … with 35,332 more rows

How does this work? First, we create a vector of our desired cut options, c('Ideal', 'Premium'). Then, we use %in% to filter only those diamonds whose cut is in that vector. dplyr will filter out BOTH those diamonds whose cut is Ideal AND those diamonds whose cut is Premium. The vector you check against for the %in% function can be arbitrarily long, which can be very useful when working with categorical data.

It's also important to note that the vector can be defined before you perform the dplyr filter operation:

cuts_to_include <- c('Good', 'Very Good', 'Ideal', 'Premium')
filter(diamonds, cut %in% cuts_to_include)

## # A tibble: 52,330 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## 10 0.3   Good      J     SI1      64      55   339  4.25  4.28  2.73
## # … with 52,320 more rows

This helps to increase the readability of your code when you're filtering against a larger set of potential options. This also means that if you have an existing vector of options from another source, you can use this to filter your dataset. This can come in very useful as you start working with multiple datasets in a single analysis!

Chaining together multiple filtering operations with logical operators

The real power of the dplyr filter function is in its flexibility. Using the logical operators &, |, and !, we can group many filtering operations in a single command to get the exact dataset we want!

Let's say we want to select all diamonds where the cut is Ideal and the carat is greater than 1:

filter(diamonds, cut == 'Ideal' & carat > 1)

## # A tibble: 5,662 x 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  1.01 Ideal I     I1       61.5    57  2844  6.45  6.46  3.97
##  2  1.02 Ideal H     SI2      61.6    55  2856  6.49  6.43  3.98
##  3  1.02 Ideal I     I1       61.7    56  2872  6.44  6.49  3.99
##  4  1.02 Ideal J     SI2      60.3    54  2879  6.53  6.5   3.93
##  5  1.01 Ideal I     I1       61.5    57  2896  6.46  6.45  3.97
##  6  1.02 Ideal I     I1       61.7    56  2925  6.49  6.44  3.99
##  7  1.14 Ideal J     SI1      60.2    57  3045  6.81  6.71  4.07
##  8  1.02 Ideal H     SI2      58.8    57  3142  6.61  6.55  3.87
##  9  1.06 Ideal I     SI2      62.8    55  3146  6.51  6.46  4.07
## 10  1.02 Ideal I     VS2      62.8    57  3148  6.45  6.39  4.03
## # … with 5,652 more rows

BOTH conditions must evaluate to TRUE for the data to be selected. That is, the cut must be Ideal, and the carat must be greater than 1.

You don't need to limit yourself to two conditions either. You can have as many as you want! Let's say we also wanted to make sure the color of the diamond was E. We can extend our example:

filter(diamonds, cut == 'Ideal' & carat > 1 & color == 'E')

## # A tibble: 531 x 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  1.25 Ideal E     I1       60.9    56  3276  6.95  6.91  4.22
##  2  1.01 Ideal E     I1       62      57  3388  6.37  6.41  3.96
##  3  1.01 Ideal E     I1       62      57  3450  6.41  6.37  3.96
##  4  1.02 Ideal E     SI2      62.3    56  3455  6.42  6.37  3.98
##  5  1.04 Ideal E     SI2      59      57  3588  6.65  6.6   3.91
##  6  1.13 Ideal E     I1       62      55  3729  6.66  6.7   4.14
##  7  1.09 Ideal E     SI2      59.4    57  3760  6.74  6.65  3.98
##  8  1.13 Ideal E     I1       62      55  3797  6.7   6.66  4.14
##  9  1.12 Ideal E     SI2      60.9    57  3864  6.66  6.6   4.04
## 10  1.1  Ideal E     I1       61.9    56  3872  6.59  6.63  4.09
## # … with 521 more rows

What if we wanted to select rows where the cut is ideal OR the carat is greater than 1? Then we'd use the | operator!

filter(diamonds, cut == 'Ideal' | carat > 1)

## # A tibble: 33,391 x 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
##  3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
##  4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
##  5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
##  6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
##  7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
##  8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
##  9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
## 10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
## # … with 33,381 more rows

Any time you want to filter your dataset based on some combination of logical statements, this is possibly using the dplyr filter function and R's built-in logical parameters. You just need to figure out how to combine your logical expressions to get exactly what you want!

Conclusion

dplyr filter is one of my most-used functions in R in general, and especially when I am looking to filter in R. With this article you should have a solid overview of how to filter a dataset, whether your variables are numerical, categorical, or a mix of both. Practice what you learned right now to make sure you cement your understanding of how to effectively filter in R using dplyr!

Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, sign up here!

How to Create a Bar Chart Race in R - Mapping United States City Population 1790-2010

2019-04-04T00:00:00-04:00

In my corner of the internet, there's been an explosion over the last several months of a style of graph called a bar chart race. Essentially, a bar chart race shows how a ranked list of something--largest cities, most valuable companies, most-followed Youtube channels--evolves over time. Maybe you've been following this trend with the same curiosity that I have, and that's how you made your way here. Or maybe you're a normal person who doesn't even know what I'm talking about! Who knows, anything is possible. By way of introduction, here is the bar chart race I've created on the largest cities in the United States over time:

Motivation for this post

For me, it all started with this brand value graphic that went viral back in February. For a few days, it felt like this thing was everywhere. And as this went viral, data visualization practitioners all started to try their hand at creating new versions of this on their own. One of my favorites was this bar chart race of world cities created by John Murdoch, who, as far as I can tell, was the person who coined the term bar chart race.

Of course, I knew that I wanted to try to create a version of this graph in R. And for as long as I an remember, I've been obsessed with looking at population statistics for cities. I remember finding an Almanac (!) on my dad's bookshelf as a kid and memorizing the list of the largest cities in the United States.

Later, I remember reading that Detroit had at a time been among the largest cities in the country. From 1916 to 1944, Detroit was the fourth largest city in the country. Its population peaked at 1.85 million in 1950. Today its population is estimated to be 673,000. The history of Detroit's population in particular was interesting to me. Having grown up Toledo, Ohio, 60 miles south of Detroit, I'd seen the effects of Detroit's dramatic population decrease first-hand. I wanted to see how this story played out in the data, and what other interesting trends would be unearthed.

So when I decided I wanted to create a bar chart race, I knew the subject I was going to study. If you're here to learn how to create a bar chart race in R, you're in the right place! Now, let's get into it!

Loading packages and data

We start by loading the packages we'll use to create the graph. gganimate provides the toolkit for animation, tidyverse will help with data processing and graphing, and hrbrthemes provides a nice-looking base graphing theme.

library(gganimate)
library(hrbrthemes)
library(tidyverse)

Now we load in some preprocessed census population data, based on decadal U.S. Census data from 1790 - 2010. This combines the data all into one large dataset we will be using for this analysis.

# Read in Census datasets by year that I downloaded and stored locally
all_data <- data.frame()
for(year in seq(1790, 2010, 10)) {
    data_path <- paste0('~/dev/michaeltoth/content/data/city_populations/', year, '.csv')
    data <- read_csv(data_path)
    data <- data[1:5]
    colnames(data) <- c('Rank', 'City', 'State', 'Population', 'Region') 
    data$year <- year
    all_data <- rbind(all_data, data)
}

# The datasets were inconcistent with state naming, sometimes using full names
# And sometimes abbreviations. This code standardizes on state names:
all_data$State_From_Abbrev <- state.name[match(all_data$State,state.abb)]
all_data <- all_data %>% mutate(State = case_when(is.na(State_From_Abbrev) ~ State,
                                                  TRUE ~ State_From_Abbrev)) %>%
                         select(-State_From_Abbrev)

Interpolating missing values between census readings

Here I'm going to make some adjustments to the datasets I'm using to get them in a format I want to work with. There are 2 things in particular I want to accomplish:

The datasets only contain information at decade intervals. I want yearly data, so I'm going to create blank entries for intermediate years that I'll later fill with linear interpolation.
The datasets generally contain the 100 most populous cities, but I only care about the top 10 at any given time, so I'm going to discard any cities that don't at some point crack the top 10.

# Get the list of cities that were at some point in the top 10 by population
top_cities <- all_data %>% filter(Rank <= 10) %>%
    select(City, State, Region) %>% distinct()

# Generate a list of all years from 1790 - 2010
all_years <- data.frame(year = seq(1790, 2010, 1))

# Create all combinations of city and year we'll need for our final dataset
all_combos <- merge(top_cities, all_years, all = T)

# This accomplishes 2 things:
# 1. Filters out cities that are not ever in the top 10
# 2. Adds rows for all years (currently blank) to our existing dataset for each city
all_data_interp <- merge(all_data, all_combos, all.y = T)

Next, I'll use linear interpolation to calculate an estimate of city populations in between the census readings each 10 years. This isn't strictly necessary, but I think it produces a more interesting final graphic than using only the official census statistics.

all_data_interp <- all_data_interp %>%
    group_by(City) %>%
    mutate(Population=approx(year,Population,year)$y)

Last step before we graph! Here we calculate the ranked list of the top 10 cities for each year, then filter so only those cities remain in the data for that year.

data <- all_data_interp %>%
    group_by(year) %>%
    arrange(-Population) %>%
    mutate(rank=row_number()) %>%
    filter(rank<=10)

Animating the bar chart race in R

Finally, we create the graph! This piece of code looks a bit intimadating, but mostly it's formatting for the graph. Much of the core code here comes from This code by Steven Burr, which was very helpful as I tried to figure out how best to use gganimate for this purpose. The key points to call out:

I use geom_tile, not geom_bar as this allows for better transitions within gganimate
The gganimate functions transition_time and ease_aes handle the animation and transition between bars. The settings here worked well for my purposes, but dig into these functions to get an overview of different options
The nframes and fps parameters to the animate function control the speed of transitions. One mistake I made here initially was to set nframes equal to the number of years in the dataset. This works, but because there is only 1 frame per year, you don't get the smooth transitions that I wanted in this graph. Increasing the number of frames fixed that issue.

p <- data %>%
    ggplot(aes(x = -rank,y = Population, group = City)) +
        geom_tile(aes(y = Population / 2, height = Population, fill = Region), width = 0.9) +
        geom_text(aes(label = City), hjust = "right", colour = "black", fontface = "bold", nudge_y = -100000) +
        geom_text(aes(label = scales::comma(Population)), hjust = "left", nudge_y = 100000, colour = "grey30") +
        coord_flip(clip="off") +
        scale_fill_manual(name = 'Region', values = c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3")) +
        scale_x_discrete("") +
        scale_y_continuous("",labels=scales::comma) +
        hrbrthemes::theme_ipsum(plot_title_size = 32, subtitle_size = 24, caption_size = 20, base_size = 20) +
        theme(panel.grid.major.y=element_blank(),
              panel.grid.minor.x=element_blank(),
              legend.position = c(0.4, 0.2),
              plot.margin = margin(1,1,1,2,"cm"),
              axis.text.y=element_blank()) +
        # gganimate code to transition by year:
        transition_time(year) +
        ease_aes('cubic-in-out') +
        labs(title='Largest Cities in the United States',
             subtitle='Population in {round(frame_time,0)}',
             caption='Source: United States Census
michaeltoth.me / @michael_toth')

animate(p, nframes = 750, fps = 25, end_pause = 50, width = 1200, height = 900)

And there we have it! If you end up creating a bar chart race of your own, please share it in the comments - I'd love to take a look!

Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, sign up here!

You Need to Start Branding Your Graphs. Here's How, with ggplot!

2019-02-27T00:00:00-05:00

In today's post I want to help you incorporate your company's branding into your ggplot graphs. Why should you care about this? I'm glad you asked!

Have you ever seen a graph that looks like this? Of course you have! This is the default ggplot theme, and these graphs are everywhere. Now, look--I like the way this graph looks. The base ggplot theme is reasonable and the graph is clear. But it doesn't in any way differentiate it from thousands of other, similarly designed graphs on the internet. That's not good. You want your graphs to stand out! Take a look at this next graph:

I'm guessing, even before you saw the caption at the bottom, that you knew this was a graph from FiveThirtyEight, Nate Silver's data-driven news service.

How about this one?

Again, this graph is immediately recognizable as coming from the Economist magazine. These two companies have done exceptional jobs of creating branded, differentiating styles to make their graphics immediately recognizable to anybody who sees them. In this post, I'm going to convince you why it's important that you develop a branded style for your graphs at your own company, and then I'll show you some quick steps to do it.

Now, I know, You might be thinking: branding and visual identity is something for the design and marketing teams to worry about. You're a data scientist! You don't have those skills, and, frankly, you have more important things to do. I sympathize, but I'm going to be honest with you: you need to get that idea out of your head. It's easier than you think, and it's part of your job! Or, at least, it should be. You see, when you start to make YOUR work fit in with YOUR COMPANY'S work, doors start to open for you. You create more for cross-departmental and external facing opportunities when your work clearly matches the company brand. Maybe a graph you created can help the sales team put together a presentation to win a big client. Maybe the marketing team can use your work to help put together a press kit. These probably aren't the core focus of your job, but the more you help people around your organization, the more respected you'll be and the more opportunities you'll have.

Over time, you will build a reputation and be recognized for your quality work because your work will be VISIBLE. Pretty soon, you will be the go-to person when an executive needs a graph for a board presentation or an investor pitch. As you build more relationships throughout your company, you'll be able to direct your focus to work that you enjoy, get involved in interesting projects, and advocate for projects of your own. And the best part? Most of this is completely passive! You're already doing the work, these tips will just help you to make it more visible, more shareable, and more impactful!

Convinced? Okay, let's get started. This is going to be one of the easiest high-impact changes you can make, so I hope you're excited!

I'm going to show you how you can easily change the color palette of a graph and add your company's logo to create a final, branded image that's ready for publication.

To start, create a standard ggplot graph like you otherwise would:

library(ggplot2)
library(hrbrthemes)
library(magick)

# Create a base graph, similar to what we had above
p <- ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) + 
    geom_point() +
    labs(title = 'Branding your ggplot Graphs',
         subtitle = 'Simple tweaks you can use to boost the impact of your graphs today',
         x = 'This axis title intentionally left blank',
         y = 'This axis title intentionally left blank',
         caption = 'michaeltoth.me / @michael_toth')

p

Once you have your base graph put together, the next step is for you to change the colors to match your company's color palette. I'm going to create a Coca-Cola branded graph for demonstration purposes, but you should substitute in your own company's details here. If you don't know what your company's color palette is, ask somebody from your design or marketing teams to send it to you! Here, I'm using red, black, and gray to match Coca-Cola's color palette.

Choosing individual colors like I'm doing here works for categorical graphs, but for continuous graphs you'll need to do a bit more work to get a branded color scale. That's a subject for another post, but check out the awesome Viz Palette tool by Elijah Meeks and Susie Lu for a sense of what's possible. As you become more familiar, you can create custom ggplot themes and ggplot color paletes to make this process seamless, but I don't want to get into all of that here, as it can be a bit overwhelming to learn all that at once.

# Customize the graphs with your company's color palette
p <- p + scale_color_manual(name = '',
                            labels = c('Black', 'Red', 'Gray'),
                            values = c('#000000', '#EC0108', '#ACAEAD')) +
    theme_ipsum() +
    theme(plot.title = element_text(color = "#EC0108"),
          plot.caption = element_text(color = "#EC0108", face = 'bold'))

p

Finally, let's add your company's logo to the graph for a complete, branded, and publication-ready graph. Download a moderately high resolution logo and save it somewhere on your machine. The workhorse here is the grid.raster function, which can render an image on top of a pre-existing image (in this case, your graph). The trick is to get the positioning and sizing right. This can be a bit confusing when you're first starting with these image manipulations, so I'll walk through them one-by-one:

x: this controls the x-position of where you place the logo. This should be a numeric value between 0 and 1, where 0 represents a position all the way on the left of the graph and 1 represents a position all the way on the right.
y: this controls the y-position of where you place the logo. This should be a numeric value between 0 and 1, where 0 represents a position all the way on the bottom of the graph and 1 represents a position all the way on the top.
just: this is a set of two values, the first corresponding to the horizontal justification and the second corresponding to the vertical justification. With x and y we chose a position on the grid to place the logo. just lets us choose how to justify the image at that location. Here, I've selected 'left' and 'bottom' justification, which means that the left bottom corner of the logo will be placed at the x-y coordinates specified.
width: this scales the logo down to a smaller size so it can be placed on the image. Here, I'm scaling the logo down to a 1-inch size, but the size you'll want will depend on the size of the graph you've created. Play around with different sizes until you find one that feels right.

# Add your company's logo to the graph you created
logo <- image_read("~/dev/michaeltoth/output/images/logo.jpg")
p
grid::grid.raster(logo, x = 0.07, y = 0.03, just = c('left', 'bottom'), width = unit(1, 'inches'))

And there we have it: a branded graph that would be suitable for a sales meeting, marketing presentation, or investor deck!

Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, sign up here!

I'm Blogging Again!

2019-01-10T00:00:00-05:00

2018 was a big year for me, filled with life developments, new adventures, and lots of change. Having spent 3 years working at Orchard, a financial technology startup in the online lending space, I was ready for a change. I decided in early 2018 to leave and take some time off, not having a clear sense of what I wanted to do next. This was the first time since graduating from college that I had left a job without a plan, and I was both nervous and excited. Luckily, my fears that this would make me unemployable, cripple my ability to negotiate a salary, and generally ruin my life turned out to be unfounded! I spent four months traveling, working on side projects, exploring New York City, and trying to figure out what I wanted to do next. I learned to ski, and I got really into it, spending a total of ten days on the slopes between January and March. I got engaged and will be getting married this year! I also spent more time visiting home, Ohio, than I think I have in total in all other years since I graduated from college. Finally, I was able to devote more time to creating the data-driven art that I sell on Etsy, growing my sales by 3x and my revenues more than 5x from 2017!

As I considered what I wanted to do next, I thought back on my career and where I was uniquely qualified to help people. In my time at Orchard, I had grown from a data scientist into a senior data scientist, built out a team of my own, and ultimately became the head of research. I had a broad mandate to dig through Orchard's massive financial dataset to find meaningful insights that might help our clients and to write about them in blog posts, white papers, and newsletters. I enjoyed the work of finding new insights that others had not yet uncovered, and I especially enjoyed the task of visualizing and explaining those insights to others in an accessible way. I became quite good at it, and my writing brought tens of thousands of readers--and potential clients--to Orchard. My writing also gained significant press coverage, with my work covered by the Financial Times, CNBC, Inc, and MarketWatch, among others. I was excited to be able to contribute so much to Orchard's brand and market position with my data analysis and writing. Everybody in the industry was familiar with Orchard and the quality work we did, which enabled our sales team to get meetings with nearly any company they wanted.

What surprised me at the time--and still does today--was how few companies were leveraging their data to create content and expand their reach. In 2019, it's hard to find a company that isn't using their data to improve internal decision-making and forecasting. "Big Data", "Machine Learning", and "Artificial Intelligence" are buzzwords that can be heard across boardrooms from Silicon Valley to New York City. Still, the unique datasets possessed by most technology companies today are vastly underutilized when it comes to generating compelling content, building a brand, and establishing credibility. Almost nobody was doing what I had done at Orchard, and nobody was doing it to the extent that I had done it. Those companies that do share insights from their data--Airbnb, Facebook, Netflix, Stitch Fix, Uber--usually do so through an engineering blog, with highly technical content aimed more at finding new employees than finding new customers. There's nothing wrong with that, of course, but companies should also be using their data to create compelling content for their customers that helps build awareness, credibility, and trust. This idea is already well established at large financial companies, who often provide market research for free with the goal of building client trust, gaining credibility, establishing relationships, and making more sales. In technology, however, where companies can generate even more impact given the unique datasets at their disposal, almost nobody is doing this! I knew that this was too big an opportunity to let pass, and I knew that I was the perfect person to help companies better leverage their data in this way. I created a company, Michael Toth Analytics, and started working on this idea full time in October. It's still early days, but I'm getting positive responses and am optimistic about the new year.

The State of This Blog in 2019

Working as a consultant means that I don't have many long-term projects like I would at a typical job. In some ways, this is nice, as it means I don't end up maintaining projects that have grown boring and uninteresting to me. But there are downsides as well. I like the feeling of accomplishment that comes with building something up over time and seeing it evolve. With that in mind, I've decided to start blogging more frequently again. In particular, one of my New Year's Resolutions for 2019 is to blog at least biweekly, for a total of 26 blog posts in 2019. I credit much of my professional success in recent years to work directly related to this blog. It helped me secure my first data science job at Orchard, moving from a financial analytics role at BlackRock where I had spent four years. I knew that data science was a competitive field to break into, and I was worried about my lack of formal training. I had self-studied data science through Coursera and similar sites, but that was it. So, before applying to Orchard, I took the time to research and write a blog post using publicly available data from Lending Club, then the largest company in Orchard's industry. As a direct result of that blog post, my resume floated to the top of their pile, my interviews felt largely like a formality, as they already knew I was capable of the exact work they wanted me to do, and I ended up securing a position. This was a huge success for a relatively small time investment on my part.

This is a brief aside, but I highly recommend this approach to aspiring data scientists, or to anybody looking to make a move in their career. If you can demonstrate to your target company that you're skilled, competent, and hard-working ahead of time, you will be lightyears ahead of your competitition. Look--I know that applying for jobs is daunting. A quick LinkedIn search for data scientist positions showed me jobs that had been posted for two days that already had hundreds of applicants. Having been on the hiring side, I can also confirm that these figures are true. But they're also incredibly misleading. Jobs I've hired for have indeed received hundreds of applications, but the vast majority of those applications were immediately filtered out. Probably 5%-10% of the people who applied for my jobs demonstrated that they had relevant skills to do the work. Probably only 1%-2% had any specific domain expertise relevant to the position. For a job that received 500 applicants, I might have 5-10 people who seemed like qualified applicants. Imagine if you were a qualified applicant and had also written a blog post about the very challenges that company is facing, demonstrating your commitment, your work ethic, and your expertise! The worst thing that could happen is you spend a few hours working on a blog post, learn a few things, and don't ultimately get the job. The best thing that could happen is you get the exact job you're looking for, and you come into the company looking like the expert that you are, possibly commanding a significantly higher salary than you would otherwise! Almost nobody does this, but it's such a huge game changer that I cannot possibly recommend this approach more highly. Now, moving on!

This blog has also helped me to establish my presence online. In 2017, I wrote a sentiment analysis of Warren Buffett's Letters to Shareholders that went viral and earned me significant press coverage, which was pretty exciting! I've also been able to grow my following on Twitter, with over 4000 like-minded professional contacts following me there. Additionally, the blog has allowed me to explore new data science and data visualization techniques that have significantly improved my abilities over the years. All I need to do to confirm this is look at some of my earlier posts to see how far I've come.

Anyway, all this to say, I've seen enormous benefits over the years as a direct result of the work that I've done on this blog. Still, sitting down and writing a blog post that may or may not gain any traction has always been a challenge. I obsess over each post, wanting it to be perfect, which ends up taking a significant chunk of time. To address that, I want to force myself to blog more frequently to remove some of the pressure I feel for any particular post to be perfect. I'm reminded of the story of the ceramics teacher from the book Art & Fear by David Bayles & Ted Orland:

The ceramics teacher announced on opening day that he was dividing the class into two groups. All those on the left side of the studio, he said, would be graded solely on the quantity of work they produced, all those on the right solely on its quality.

His procedure was simple: on the final day of class he would bring in his bathroom scales and weigh the work of the “quantity” group: fifty pounds of pots rated an “A”, forty pounds a “B”, and so on. Those being graded on “quality”, however, needed to produce only one pot – albeit a perfect one – to get an “A”.

Well, came grading time and a curious fact emerged: the works of highest quality were all produced by the group being graded for quantity. It seems that while the “quantity” group was busily churning out piles of work – and learning from their mistakes – the “quality” group had sat theorizing about perfection, and in the end had little more to show for their efforts than grandiose theories and a pile of dead clay.

When you produce a large body of work, some of the work you create will be great, and some will be poor. But over time you develop your skills and your work improves. I think my goal of biweekly blog posts will help me to overcome the fear of putting work out there and hopefully result in a significant improvement to my abilities over the course of the year. We'll see!

While this blog has always been and will continue to be primarily a data science and data visualization blog, I also intend to write some different content this year, similar to today's post. There are two reasons for this. First, I feel that only posting data analysis is unnecessarily constraining. Second, I think that I have relied on data science blog posts as a bit of a crutch, allowing me to put myself out there without really exposing myself to any vulnerability or potential criticism. As long as I'm presenting facts based on data analysis, it feels that there is not much room for criticism. This type of post allows me to express my thoughts and opinions, be vulnerable, and become more comfortable putting myself and my ideas into the world. This is a skill I definitely want to develop, and I hope to continue this both here and on Twitter in 2019.

This blog post has gotten quite long, so I will end here. While 2018 was a big year for me, I'm looking forward to 2019, and I think that it might be even better! I'll be publishing here every two weeks, and you can follow me on Twitter if you're interested in getting notified when a new one is available!

If you enjoy the work I do and are interested in working together, you can visit my consulting website or contact me at michael@michaeltothanalytics.com!

How to Write Pelican Blog Posts using RMarkdown & Knitr, 2.0

2017-06-14T00:00:00-04:00

Back in January I wrote a post discussing how to get RMarkdown and Pelican to work together to make the R analysis > blog post workflow a bit easier. While I had high hopes, I was never really happy with the setup I put together then, so I set out to update it.

In this post I'm going to talk about my new, improved way of publishing Pelican blog posts using RMarkdown. I'm assuming you already have a Pelican blog set up, so I won't be covering that in today's post. If you're interested but haven't yet set up a blog for yourself, it's quite straightforward! I recommend checking out these links:

Issue with Old Setup

The setup I recommended in my prior post used a Pelican plugin called rmd_reader to convert .Rmd files to standard .md files that Pelican could read. For taking a static .Rmd post and creating a published post, this worked pretty well. One of my favorite things about my Pelican setup, though, was using the development server feature. Basically, this runs a web server locally, monitoring your content directory for any changes, and automatically regenerates your site whenever it finds a change. This feature did not play nice with the rmd_reader plugin. When you start the development server, rmd_reader starts converting any .rmd files to .md files. This action would trigger the development server to restart, as it identified changes in the content directory, and you'd find yourself stuck in an infinite loop of regeneration. Admittedly, it's a minor issue, and I probably could have hacked together a solution, but I didn't want to make changes to the base Pelican code or the rmd_reader code, because I wanted this to be portable to other systems. So in the end, I decided I needed another solution.

New Solution & Setup

The challenge is to find a way to run your .Rmd code, producing any desired figures and code chunks, then store the results in a .md file that is readable by Pelican. I remembered having read about how David Robinson built his site using a custom R script to convert each .Rmd file to a .md file using knitr commands, and I set out to see if I could modify that for my purposes.

Below is the final knitpages.R file I'm using, having made a few minor changes to David's file, which was optimized for Jekyll blogs:

#!/usr/local/bin/Rscript --vanilla

# compiles all .Rmd files in _R directory into .md files in blog directory,
# if the input file is older than the output file.

# run ./knitpages.R to update all knitr files that need to be updated.
# run this script from your base content directory

library(knitr)

KnitPost <- function(input, outfile, figsfolder, cachefolder, base.url="/") {
  opts_knit$set(base.url = base.url)
  fig.path <- paste0(figsfolder, sub(".Rmd$", "", basename(input)), "/")
  cache.path <- file.path(cachefolder, sub(".Rmd$", "", basename(input)), "/")

  opts_chunk$set(fig.path = fig.path)
  opts_chunk$set(cache.path = cache.path)
  opts_chunk$set(fig.cap = "center")
  render_markdown()
  knit(input, outfile, envir = parent.frame())
}

knit_folder <- function(infolder, outfolder, figsfolder, cachefolder, force = F) {
  for (infile in list.files(infolder, pattern = "*.Rmd", full.names = TRUE)) {

    print(infile)
    outfile = paste0(outfolder, "/", sub(".Rmd$", ".md", basename(infile)))
    print(outfile)

    # knit only if the input file is the last one modified
    if (!file.exists(outfile) | file.info(infile)$mtime > file.info(outfile)$mtime) {
        KnitPost(infile, outfile, figsfolder, cachefolder)
    }
  }
}

knit_folder("_R", "blog", "figures/", "_caches")

The only real change to David's script is updating render_jekyll() to render_markdown(). I also had to change the path to the Rscript executable (first line), which you may need to do based on your OS. Run which Rscript from your terminal to find the correct path.

You should modify the knit_folder command at the bottom to reflect your own blog's directory structure. Here's how this script works:

1) The script finds all .Rmd files in your infolder, ignoring old & unchanged files 2) New & updated files are passed to the KnitPost function, which runs the Rmd file, saving any generated images to the figsfolder directory and storing any cached data to the cachefolder. 3) An output .md file is created in the outfolder directory with links to any figures generated by R

New Process

Here's my updated process for publishing today:

1) From the content directory of my blog, I run knitpages.R, which will convert any new or updated .Rmd files to Pelican-readable .md files 2) Next I generate my Pelican site 3) When I'm satisfied with the results locally, I can easily push to my web server and make publish from there to generate my site

I like this solution because it's a bit cleaner and requires less overhead than the one I wrote about previously. I've noticed that personally one of the biggest issues preventing me from publishing more frequently has been friction in the publishing process. This solution goes a long way toward solving that, and I hope this helps me increase my frequency of publishing. I'd like to extend a big thanks to David for sharing his knitpages.R code and examples on his blog, it made the process of setting this up so much easier!

Sentiment Analysis of Warren Buffett's Letters to Shareholders

2017-03-20T00:00:00-04:00

Last week, I was reading through Warren Buffett's most recent letter to Berkshire Hathaway shareholders. Every year, he writes a letter that he makes publicly available on the Berkshire Hathaway website. In the letters he talks about the performance of Berkshire Hathaway and their portfolio of businesses and investments. But he also talks about his views on business, the market, and investing more generally, and it's after this wisdom that many investors, including me, read what he has to say.

In many ways Warren Buffett's letters are atypical. When most companies report their financial performance, they fill their reports with dense, technical language designed to obscure and confuse. Mr. Buffett does not follow this approach. His letters are written in easily understandable language, beacuse he wants them to be accessible to everybody. Warren Buffett is not often swayed by what others are doing. He goes his own way, and that has been a source of incredible strength. In annually compounded returns, Berkshire stock has gained 20.8% since 1965, while the S&P 500 as a whole has gained only 9.7% over the same period. To highlight how truly astounding this performance is, one dollar invested in the S&P in 1965 would have grown to $112.34 by the end of 2016, while the same dollar invested in Berkshire stock would have grown to the massive sum of $15,325.46!

I've been reading the annual Berkshire letters when they come out for the last few years. One day I'll sit down and read through all of them, but I haven't gotten around to it yet. But while I was reading through his most recent letter last week, I got to thinking. I wondered whether there are any trends in his letters over time, or how strongly his writings are influenced by external market factors. I decided I could probably answer some of these questions through a high-level analysis of the text in his letters, which brings me to the subject of this blog post.

In this post I'm going to be performing a sentiment analysis of the text of Warren Buffett's letters to shareholders from 1977 - 2016. A sentiment analysis is a method of identifying and quantifying the overall sentiment of a particular set of text. Sentiment analysis has many use cases, bu a common one is to determine how positive or negative a particular text document is, which is what I'll be doing here. For this, I'll be using bing sentiment analysis, developed by Bing Liu of the University of Illinois at Chicago. For this type of sentiment analysis, you first split a text document into a set of distinct words, and then for each word determining whether it is positive, negative, or neutral.

In the graph below, I show something called the 'Net Sentiment Ratio' for each of Warren Buffett's letters, beginning in 1977 and ending with 2016. The net sentiment ratio tells how positive or negative a particular text is. I'm definining the net sentiment ratio as:

(Number of Positive Words - Number of Negative Words) / (Number of Total Words)

The results here show that overall, Warren Buffett's letters have been positive. Over the forty years of letters I'm analyzing here, only 5 show a negative net sentiment score. The five years that do show negative net sentiment scores are closely tied with major negative economic events:

1987: The market crash that happened on October 19th, 1987 (Black Monday) is widely known as the largest single-day percentage decline ever experienced for the Dow-Jones Industrial Average, 22.61% in one day.
1990: The recession of 1990, triggered by an oil price shock following the United States' invasion of Kuwait, resulted in a notable increase in unemployment.
2001: Following the 1990s, which represented the longest period of growth in American history, 2001 saw the collapse of the dot-com bubble and associated declining market values, as well as the September 11th attacks.
2002: The market, already falling in 2001, continued to see declines throughout much of 2002.
2008: The Great Recession was a large worldwide economic recession, characterized by the International Monetary Fund as the worst global recession since before World War II. Other related events during this period included the financial crisis of 2007-2008 and the subprime mortgage crisis of 2007-2009.

Another interesting topic to examine is which words were actually the strongest contributors to the positive and negative sentiment in the letters. For this exercise, I analyzed the letters as one single text, and present the most common positive and negative words in the graph below.

The results here are interesting. Many of the most common words--'gain', 'gains', 'loss', 'losses', 'worth', 'liability', and 'debt'--are what we'd expect given the financial nature of these documents. I find the adjectives that make their way into this set particularly interesting, as they give insight into the way Warren Buffett thinks. On the positive side we have 'significant', 'outstanding', 'excellent', 'extraordinary', and 'competitive'. On the negative side there are 'negative', 'unusual', 'difficult', and 'bad'. One interesting inclusion that shows some of the limitations of sentiment analysis is 'casualty', where Mr. Buffett is not referring to death, but to the basket of property and casualty insurance companies that make up a significant portion of his business holdings.

While the above is interesting, and helps us to highlight the most frequent positive and negative words, it's a bit limited in the number of words we can present before the graph becomes too crowded. To see a larger number of words, we can use a word cloud. The word cloud below shows 400 of the most commonly used words, split by positive and negative sentiment.

If you're interested in reproducing this blog post or analysis, please check out the R code I used to produce this document

How to Write Pelican Blog Posts using RMarkdown & Knitr

2017-01-05T00:00:00-05:00

UPDATE: This is no longer my preferred method for syncing RMarkdown analyses with my blog. Please check out my new post

In this post I'm going to be talking about how to easily modify your Pelican blog configuration to let you directly publish blog posts using RMarkdown. I'm assuming you already have a Pelican blog set up, so I won't be covering that in today's post. If you're interested but haven't yet set up a blog for yourself, it's quite straightforward! I recommend checking out these links:

Until now, I've been writing posts on this blog using standard markdown. This means I'd do an analysis in R, produce a series of graphs and results that I would store locally in image files, and put it all together on my own in a markdown document. It's not that bad a process, but it is a bit inefficient, and I wanted to see if there was a better way. Luckily, there's a very easy-to-use Pelican plugin called rmd_reader that will automatically convert any RMarkdown posts you have into Pelican-compliant html documents. In figuring out how to set this up, I drew heavily on these resources:

Setup Instructions

First, let's install the RMD Reader extension so that Pelican knows what to do. We'll do this by cloning the pelican-plugins github repository and referencing this in our Pelican configuration file. This has the added benefit of allowing you to easily use other Pelican plugins, should you decide you want to do that.

Execute the following command from the directory where you want to store this repository.
(Run from terminal):

git clone --recursive https://github.com/getpelican/pelican-plugins

Add the following to your Pelican config file. If you already have these variables defined, simple add the new path and plugin to the end of your existing list.
(Edit pelicanconf.py):

PLUGIN_PATHS = ['your-path-to/pelican-plugins']
PLUGINS = ['rmd_reader']

Make sure you have the rpy2 python package installed.
(Run from terminal):

pip install rpy2

Also make sure you have the knitr R package installed.
(Run from R):

install.packages('knitr')

Additional Setup

The above is the core setup, but there are a few more tweaks that I recommend you do in order to make your life easier down the road.

Add the following to your Pelican config file. Essentially what we're doing here is giving knitr instructions on how to name & where to store image files to reduce the likelihood of you having conflicts and overwriting files from older blog posts. There are several ways to do this, but this seemed the best solution to me. For further details, check out the official rmd_reader documentation.
(Edit pelicanconf.py):

STATIC_PATHS = ['figure']
RMD_READER_RENAME_PLOT = 'directory'
RMD_READER_KNITR_OPTS_CHUNK = {'fig.path': 'figure/'}

Testing & Examples

Finally, we're ready to test out our new setup. Try this out with your own .Rmd document or use this one, available on my Github, if you're just looking for a quick test. The steps are relatively simple:

Save your .Rmd file into the same content folder where you'd put any other .md file for your Pelican blog
Run your Pelican blog like you would normally.

That's it. rmd_reader will automatically execute your .Rmd file, produce the relevant graphics, and set up the html for your blog just like base Pelican would.

Just to confirm everythng is working correctly, let's do some basic operations on the iris dataset.

First let's see a simple summary of the data:

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Let's finish with a simple k-means cluster analysis:

library(broom)
library(dplyr)
library(ggplot2)

iris_sub <- select(iris, x1 = Petal.Length, x2 = Petal.Width)

kclusts <- data.frame(k=1:6) %>% group_by(k) %>% do(kclust=kmeans(iris_sub, .$k))
clusters <- kclusts %>% group_by(k) %>% do(tidy(.$kclust[[1]]))
assignments <- kclusts %>% group_by(k) %>% do(augment(.$kclust[[1]], iris_sub))
clusterings <- kclusts %>% group_by(k) %>% do(glance(.$kclust[[1]]))

ggplot(assignments, aes(x = x1, y = x2)) + 
    facet_wrap(~ k) +
    geom_point(aes(color=.cluster)) + 
    geom_point(data=clusters, size=10, shape="x")

Closing Remarks

That's it! I've been meaning to get this set up for a while, and I'm pretty excited about it. Since most of my blog posts are R analyses, this is going to really simplify my workflow, which should make it much for me to actually finalize and post my results, something I've had issues with before. I'm also glad I'll be able to make greater use of R Markdown/Knitr, which will help me to organize my thoughts while analyzing as well as create reproducible research documents to share. I hope you find this useful as well!

Popularity of Baby Names Since 1880

2016-11-20T12:00:00-05:00

A while back I spent some time figuring out how to serve interactive shiny apps through my website, but I haven't had a chance to build anything until recently. I set out to create a few simple shiny apps in R that I could use as a sort of test run, and I'm writing those up here.

In this post I'm going to be analyzing some open data provided by the Social Security Administration on the popularity of baby names over the years--specifically, since 1880. Data comes from the Social Security Administration

In the application below, you can view the 10 most popular baby names for any year since 1880, either for males or females. You can also click the play icon directly below the year slider to view an animated history of the most common names.

In the next application, you can enter any name, and the graph will display how the popularity of that name has changed over time. Be sure to also select whether the name is for males or females, or you'll likely see some unexpected results!

After building the shiny applications above, I got interested in whether I could identify any meaningful trends over time in the data. I wanted to see whether the concentration (the proportion of babies born with a given name) of the most popular names was relatively static over time, or whether this fluctuated. I was also interested in finding trends in the number of babies born with each of the most popular names. To investigate these, I used a subset of the original data, grabbing the 10 most common male and female names for each year since 1880. I went through several iterations of how best to display the data, and ultimately arrived at the graph below, which I quite like.

I was excited that this project gave me an opportunity to make use of David Robinson's gganimate package, which I must regrettably admit I hadn't had a chance to experiment with previously. For those unfamiliar, this package makes it incredibly easy to create animated ggplot graphs, and it's awesome!

I wanted to create some kind of trailing visualization to make it clear how patterns and trends were changing over time. The implementation here was adapted from Thomas Pedersen's example which he used to produce this image.

In the graph below we can use the trailing effect to easily identify trends that occur over a series of years. The grey background data also helps us to visualize how any given year compares with the overall history. I see 5 key periods present themselves in the data:

Early Years (1880 - 1910): This period is characterized by a low number of babies born (due to the much lower population at this time) as well as a high concentration of the most popular names, in some cases reaching almost 10%. This means that the most common names were being used by a very high percentage of the population during this period. Toward the end of this time period, we begin to see declines in the concentration statistic.
World War I Years (1910 - 1920): In this period we see an explosion in the number of babies born with the most popular names. We don't see much change in the concentration of names over this period.
Intra-War Period (1921 - 1940): In this period both the number and the concentration of births is remarkably consistent, with almost no changes on a year-over-year basis.
WWII & Baby Boom (1941 - 1957): In this period we again see a huge increase in the number of babies born, corresponding to the well-known baby boom that occurred in the post-WWII years. We actually see this increase beginning during the war, in 1941. We also see a slight decrease in the concentration of the most popular baby names during this period.
"Modern" Era (1958 - 2015): This period is characterized by a steady decrease in both the concentration of the most popular names and in the number of babies born with those names. Though overall birth rates did begin to decline in recent years, this change is not due primarily to a decline in birth rates, but rather to an increased equity in the popularity of names, which means that the most popular names make up a much smaller percentage of overall births.

The code for this image is available below:

library(dplyr)
library(ggplot2)
library(gganimate) #devtools::install_github("dgrtwo/gganimate")
library(readr)
library(scales)

# Load pre-processed data. For additional details check Github below
top_10_each_year <- read_csv('input/top_10_each_year.csv')

# Create fading animation effect by replicating the data frame and adding an exponentially decaying fade parameter to previous years
anim <- lapply(1:10, function(i) {top_10_each_year$year <- top_10_each_year$year + i; top_10_each_year$fade <- 1 / (i + 2); top_10_each_year})
top_10_each_year$fade <- 1
top_10_with_fade <- rbind(top_10_each_year, do.call(rbind, anim))
top_10_with_fade <- filter(top_10_with_fade, year <= 2015)

p <- ggplot(top_10_with_fade, aes(x = proportion, y = count)) +
    geom_point(color = '#e6e6e6', size = 4) +
    geom_point(aes(color = sex, frame = year, alpha = fade), size = 4) +
    ggtitle('Top 10 Male & Female Baby Names\nYear:') +
    xlab('\nProportion (by sex) Born with Name') +
    ylab('Number Born with Name') +
    scale_color_manual(name = '', values = c('#ff7f00', '#377eb8'), labels = c('Female', 'Male')) +
    scale_x_continuous(labels = percent) + 
    scale_y_continuous(labels = comma) +
    scale_alpha(guide = 'none') + # Remove alpha legend from plot output
    theme_bw() +
    theme(panel.border = element_blank(),
          panel.grid = element_blank(),
          axis.ticks = element_blank(),
          legend.key = element_blank(),
          legend.position = 'bottom',
          axis.text = element_text(size = 14),
          axis.title = element_text(size = 16),
          legend.text = element_text(size = 12))

gg_animate(p, filename = 'yearly-birth-names-with-trails.gif', interval = 0.2, ani.width = 800, ani.height = 600)

For the full code behind the shiny applications and the animation produced above, check out my Github

Installing and Running Shiny Server from Source on 32-bit Ubuntu

2016-03-20T20:00:00-04:00

I recently migrated this site from a shared web hosting service to DigitalOcean because I was interested in learning about how to host my own site and how web servers work. I also wanted to play with and host shiny applications on my own site, rather than relying on a third-party service provider.

In this post I'll talk about the steps I followed to get my shiny server running on a 32-bit Ubuntu DigitalOcean cloud instance. This article assumes that you have

A 32-bit Ubuntu OS with space available. For this I was using a DigitalOcean instance, but this should work just as well on a local computer or another provider's service
A decent understanding of the Linux command line
(Optional) A running nginx server that is serving your current site (where you want to deploy your shiny apps)

Installing Required Dependencies

Before proceeding with the installation, make sure that you have all of the required dependencies installed on your machine. The following software is all needed (sudo apt-get install any of the below that are missing). If you run into any issues below, double check that you've installed the below (with the specified versions) properly

python 2.6 or 2.7
cmake >= 2.8.10
gcc
g++
git

R Installations

Assuming you don't have R installed, you'll need to do that. We'll need r-base and r-devel both installed for some shiny apps to run properly, so let's do that:

sudo apt-get update
sudo apt-get install r-base r-base-dev

Next we'll install the shiny R package. While we could start up an R session and simply run install.packages('shiny') from that instance, that would install the shiny package only for one user. Instead, we'll run the command below which will install the shiny package for all users on the machine:

sudo su - -c "R -e \"install.packages('shiny', repos='http://cran.rstudio.com/')\""

When I ran the above command, the packages downloaded but did not install. After some investigation, I realized this was because my instance was running out of memory, so I added some swap space to the instance and ran the command again, which completed successfully. You may or may not have this issue, but if you do, the below command gives you 1GB swap space which should allow your installation to complete.

sudo /bin/dd if=/dev/zero of=/var/swap.1 bs=1M count=1024
sudo /sbin/mkswap /var/swap.1
sudo /sbin/swapon /var/swap.1

Shiny Server 32-bit Installation from Source

RStudio provides a convenient .deb install with pre-compiled binaries for 64-bit architecture, but if you're running 32-bit architecture you'll need to build shiny server from source as described below. This involves manually compiling the required binary files and making some changes to setup directories and config files. This can be a bit complicated (luckily not too complicated), so if you do have 64-bit architecture/OS you can follow the simpler instructions from RStudio to install directly. Otherwise, follow along with my instructions below

cd into the directory where you'd like the shiny-server repository. I'll be installing mine to ~/dev, but any location will work. Note that this location will be temporary, so your decision here is not too important.

cd ~/dev

# Clone the repository from GitHub and cd into the new directory
git clone https://github.com/rstudio/shiny-server.git
cd shiny-server

# Add the bin directory to the path so we can reference
DIR=`pwd`
PATH=$DIR/bin:$PATH

PYTHON=`which python`

# If Python version is not 2.6.x or 2.7.x, you'll need to modify to 
# reference one of these versions (e.g. which python26). This may
# or may not require you to install a new Python version.  For more
# details, review the "Python" section of the RStudio documentation: 
# https://github.com/rstudio/shiny-server/wiki/Building-Shiny-Server-from-Source
$PYTHON --version

# Use cmake to prepare the make step. Modify the "--DCMAKE_INSTALL_PREFIX"
# if you wish the install the software at a different location.
mkdir tmp; cd tmp
cmake -DCMAKE_INSTALL_PREFIX=/usr/local -DPYTHON="$PYTHON" ../

# Recompile the npm modules included in the project
make
mkdir ../build
(cd .. && ./bin/npm --python="$PYTHON" rebuild)
# Need to rebuild our gyp bindings since 'npm rebuild' won't run gyp for us.
(cd .. && ./bin/node ./ext/node/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js --python="$PYTHON" rebuild)

# Install the software at the predefined location
sudo make install

Configuration

Now we've successfully installed Shiny Server in the location we defined above. You can now safely delete the shiny-server git repo if you would like. There are a few configuration issues we need to finalize before we can use shiny-server, which we'll cover next.

# Place a shortcut to the shiny-server executable in /usr/bin. 
# As /usr/bin should already be in your PATH variable, you won't need
# to permanently modify your PATH to reflect the change we made above
sudo ln -s /usr/local/shiny-server/bin/shiny-server /usr/bin/shiny-server

#Create shiny user on your system. On some systems, you may need to specify the full path to 'useradd'
sudo useradd -r -m shiny

# Create log, config, and application directories for shiny
sudo mkdir -p /var/log/shiny-server
sudo mkdir -p /srv/shiny-server
sudo mkdir -p /var/lib/shiny-server
sudo chown shiny /var/log/shiny-server
sudo mkdir -p /etc/shiny-server

Shiny server will look for resources in certain file paths. Certain log directories and application directories can be modified in a configuration file stored at /etc/shiny-server/shiny-server.conf. By default, there will be no file at that location. The RStudio instructions claim that the default configuration(link) will be used if no file exists, but that was not the case for me and I received an error message when trying to run. If the same happens for you, simply copy the default configuration into /etc/shiny-server/shiny-server.conf and you should be all set.

Get the RStudio Upstart script which will allow you to run shiny in the background as you would for running your nginx or other server. This will let you run shiny automatically when you boot your system, or to run it continuously on Digital Ocean

sudo wget https://raw.github.com/rstudio/shiny-server/master/config/upstart/shiny-server.conf\
  -O /etc/init/shiny-server.conf

Place any shiny scripts in the /srv/shiny-server/your_app Write a little bit about how you can now run shiny server to serve sites on http://ip_addr:3838/your_app Starting and stopping shiny server

sudo start shiny-server
sudo stop shiny-server

(Optional) Serving to a custom domain with clean URLs (no :3838 links)

Now we have shiny installed and configured properly. You'll still need to set it up to serve the files to your actual website address however. I updated my nginx configuration (/etc/nginx/sites-enabled/default or /etc/nginx/sites-enabled/) to add a block for shiny. This allows you to host on yoursite/shiny. This is a reverse proxy and allows you to get around porting issues

location /shiny/ {
    proxy_pass http://127.0.0.1:3838/;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
}

I drew heavily on the following sources in navigating this process:

How to write an R Data Frame to an SQL Table

2015-07-08T20:45:00-04:00

Frequently I find I need to perform an analysis that requires querying some values in a database table based on a series of IDs that might vary depending on some input. As an example, assume we have the following:

A table that contains historical stock prices for 2000 stocks for the last 30 years
Some input that contain's a user's portfolio of stock tickers

Often, we'll want to pull the price history over a certain date range for all stocks in the user's portfolio. We could of course query all values in the stock prices table and then subset, but this is incredibly inefficient and also means we can't make use of any SQL aggregation functions in our query. Something I've done before when working in an SQL IDE is to create a temp table where I insert a list of the IDs that I am trying to look up, and then join on that table for my query. This is an ideal solution when we're talking about looking up more than a few securities. It took me a while to find an easy way to do this directly in R, but it turns out to be quite simple. I'm hoping my solution helps anybody else who might have this same issue.

Assumptions:

A table called stock_prices that contains stock price history
A data frame called tickers that contains a list of stock tickers (column name = ticker)
Here I am using PostgreSQL, but this should work essentially the same for any SQL variant

Code

Start by setting up the connection:

library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, user='user', password='password', dbname='my_database', host='host')

Next create the temp table and insert values from our data frame. The key here is the dbWriteTable function which allows us to write an R data frame directly to a database table. The data frame's column names will be used as the database table's fields

# Drop table if it already exists
if (dbExistsTable(con, "temp_tickers"))
    dbRemoveTable(con, "temp_tickers")

# Write the data frame to the database
dbWriteTable(con, name = "temp_tickers", value = tickers, row.names = FALSE)

Finally, join the stock prices table to the table we just created and query the subsetted values

sql <- " 
    select sp.ticker, sp.date, sp.price
    from stock_prices sp
    join temp_tickers tt on sp.ticker = tt.ticker
    where date between '2000-01-01' and '2015-07-08'
"

results <- dbGetQuery(con, sql)

# Free up resources
dbDisconnect(con)
dbUnloadDriver(drv)

And that's it. It turned out to not be very complicated, and many may already know this, but it took me a while to figure out how this should be done. I spent a lot of time messing around with INSERT statements before scrapping that idea and coming up with this solution. Let me know if you find this helpful or if you have any ideas on how to do this better!

Analyzing Historical Default Rates of Lending Club Notes

2015-03-09T21:28:00-04:00

In case you're unfamiliar, Lending Club is the world's largest peer-to-peer lending company, offering a platform for borrowers and lenders to work directly with one another, eliminating the need for a financial intermediary like a bank. Removing the middle-man generally allows both borrowers and lenders to benefit from better interest rates than they otherwise would, which makes peer-to-peer lending an attractive proposition. This post will be the first in a series of posts analyzing the probability of default and expected return of Lending Club notes. In this first post, I'll cover some of the background on Lending Club, talk about getting and cleaning the loan data, and perform some exploratory analysis on the available variables and outcomes. In subsequent posts, I'll work on developing a predictive model for determining the loan default probabilities. Before investing, it is always important to fully understand the risks, and this post does not constitute investment advice in either Lending Club or in Lending Club notes.

Background and Gathering Data

Lending Club makes all past borrower data freely available on their website for review, and I will be referencing the 2012-2013 data throughout this post.

To download the 2012-2013 data from Lending Club:

# Download and extract data from Lending Club
if (!file.exists("LoanStats3b.csv")) {
    fileUrl <- "https://resources.lendingclub.com/LoanStats3b.csv.zip"
    download.file(fileUrl, destfile = "LoanStats3b.csv.zip", method="curl")
    dateDownloaded <- date()
    unzip("LoanStats3b.csv.zip")
}

# Read in Lending Club Data
if (!exists("full_dataset")) {
  full_dataset <- read.csv(file="LoanStats3b.csv", header=TRUE, skip = 1)
}

For each loan in the file, Lending Club provides an indication of the current loan status. Because many of the loan statuses represent similar outcomes, I've mapped them from Lending Club's 7 down to only 2, simplifying the problem of classifying loan outcomes without much loss of information useful for investment decisions. My two outcomes "Performing" and "NonPerforming" seek to separate those loans likely to pay in full from those likely to default. Below I include a table summarizing the mappings:

Now that we've loaded the data, let's extract the fields we need and do some cleaning. We can eliminate any fields that would not have been known at the time of issuance, as we'll be trying to make decisions on loan investments using available pre-issuance data. We can also eliminate a few indicative data fields that are repetitive or too granular to be analyzed, and make some formatting changes to get the data ready for analysis. Finally, we'll map the loan statuses to the binary "Performing" and "NonPerforming" classifiers as discussed above.

# Select variables to keep and subset the data
variables <- c("id", "loan_amnt", "term", "int_rate", "installment", "grade", 
               "sub_grade", "emp_length", "home_ownership", "annual_inc", 
               "is_inc_v", "loan_status", "purpose", "addr_state", "dti", 
               "delinq_2yrs", "earliest_cr_line", "inq_last_6mths", 
               "mths_since_last_delinq", "mths_since_last_record", "open_acc", 
               "pub_rec", "revol_bal", "revol_util", "total_acc", 
               "initial_list_status", "collections_12_mths_ex_med", 
               "mths_since_last_major_derog")
train <- full_dataset[variables]


# Reduce loan status to binary "Performing" and "NonPerforming" Measures:
train$new_status <- factor(ifelse(train$loan_status %in% c("Current", "Fully Paid"), 
                                  "Performing", "NonPerforming"))

# Convert a subset of the numeric variables to factors
train$delinq_2yrs <- factor(train$delinq_2yrs)
train$inq_last_6mths <- factor(train$inq_last_6mths)
train$open_acc <- factor(train$open_acc)
train$pub_rec <- factor(train$pub_rec)
train$total_acc <- factor(train$total_acc)

# Convert interest rate numbers to numeric (strip percent signs)
train$int_rate <- as.numeric(sub("%", "", train$int_rate))
train$revol_util <- as.numeric(sub("%", "", train$revol_util))

Analyzing Predictive Power of Variables

Lending Club Grades and Subgrades

All types of borrowers are using peer-to-peer lending for a variety of purposes. This raises the question of how to determine appropriate interest rates given the varying levels of risk across borrowers. Luckily for us, Lending Club handles this for us. They use an algorithm to determine a borrower's level of risk, and then set the interest rates according to the level of risk. Specifically, Lending Club maps borrowers to a series of grades [A-F] and subgrades [A-F][1-5] based on their risk profile. Loans in each subgrade are then given appropriate interest rates. The specific rates will change over time according to market conditions, but generally they will fall within a tight range for each subgrade.

Let's take a look at the proportions of performing and non-performing loans by Lending Club's provided grades:

by_grade <- table(train$new_status, train$grade, exclude="")
prop_grade <- prop.table(by_grade,2)
barplot(prop_grade, main = "Loan Performance by Grade", xlab = "Grade", 
        col=c("darkblue","red"), legend = rownames(prop_grade))

by_subgrade <- table(train$new_status, train$sub_grade, exclude="")
prop_subgrade <- prop.table(by_subgrade,2)
barplot(prop_subgrade, main = "Loan Performance by Sub Grade", xlab = "SubGrade",
        col=c("darkblue","red"),legend = rownames(prop_subgrade))

We can see from the chart below that rates of default steadily increase as the loan grades worsen from A to G, as expected.

We see a similar pattern for the subgrades, although the trend begins to weaken across the G1-G5 subgrades. On further investigation, I found that there are only a few hundred data points for each of these subgrades, in contrast to thousands of data points for the A-F subgrades, and these differences are not large enough to be significant.

In general, it looks like the Lending Club grading system does a pretty great job of predicting ultimate loan performance, but let's check out some of the other available data to see what other trends we might be able to find in the data.

Home Ownership

The Lending Club data has 3 main classifications for home ownership: mortgage (outstanding mortgage payment), own (home is owned outright), and rent. I would expect those with mortgages to default less frequently than those who rent, both because there are credit requirements to get a mortgage and because those with mortgages might in general tend to be more established. Let's see whether this is actually the case:

ownership_status <- table(train$new_status,train$home_ownership,
                     exclude=c("OTHER","NONE",""))

prop_ownership <- round(prop.table(ownership_status, 2) * 100, 2)

So those with mortgages default the least, followed by those who own their homes outright and finally those who rent. The differences here are much smaller than when comparing different grades, but they are still notable. Let's verify whether these are statistically significant:

# Calculate the counts of mortgage, owners, and renters:
count_m <- sum(train$home_ownership == "MORTGAGE")
count_o <- sum(train$home_ownership == "OWN")
count_r <- sum(train$home_ownership == "RENT")

# Calculate the counts of default for mortgages, owners, and renters:
dflt_m <- sum(train$home_ownership == "MORTGAGE" & train$new_status == "NonPerforming")
dflt_o <- sum(train$home_ownership == "OWN" & train$new_status == "NonPerforming")
dflt_r <- sum(train$home_ownership == "RENT" & train$new_status == "NonPerforming")

# 1-sided proportion test for mortgage vs owners
prop.test(c(dflt_m,dflt_o), c(count_m,count_o), alternative = "less")

# 1-sided proportion test for owners vs renters
prop.test(c(dflt_o,dflt_r), c(count_o,count_r), alternative = "less")

The p-value of the first test was 6.377*10^-12 and the p-value for the second test was 3.787*10^-8, indicating that the differences in both of these proportions are very statistically significant. Although the differences in the default probabilities are somewhat small, on the order of 1.5%, the number of data points is in the high tens of thousands, which contributes to the significance. Given this result, we can generally conclude that similar differences in default probabilities for other factors should also be significant, so long as a similar quantity of data points is available.

Note: for the remaining analysis, the code for each variable becomes a bit repetitive, so in the interest of minimizing the length of this post I will present only the results. If you are interested to see the actual code, you will find it in the appendix at the bottom of this post. You can also read the complete code on Github.

Debt to Income Ratio

Debt to income ratio indicates the ratio between a borrowers monthly debt payment and monthly income. This was originally formatted as a continuous numerical variable, but I bucketed it into 5% increments to better visualize the effect on loan performance. As we might expect, there is a steady increase in the percentage of non-performing loans as DTI increases, reflecting the constraints that increased debt put onto borrower ability to repay:

Revolving Utilization Percent

Revolving utilization percent is the portion of a borrower's revolving credit limit (i.e. credit card limit) that they actually are using at any given point. For example, if a borrower's total credit limit is $15,000 and their outstanding balance is $1,500 their utilization rate would be 10%. We can see below that the percentage of non-performing loans steadily increases with utilization rate. Borrowers with high utilization rates are more likely to have high fixed credit card payments which might affect their ability to repay their loans. Also, a high utilization rate often reflects a lack of other financing options, with borrowers turning to peer-to-peer lending as a last resort. This is in contrast to those borrowers with low utilization rates, who may be using peer-to-peer lending opportunistically to pursue lower interest payments.

Loan Purpose

Loan purpose refers to the borrower's stated reason for taking out the loan. We see below that credit card and debt consolidation tend to have better performance, along with home improvement, cars, and other major purchases. Luxury spending on vacations and weddings and unexpected medical and moving expenses generally have worse performance. Small business loans perform very poorly, perhaps reflecting the fact that those borrowers unable to get bank financing for their small business may have poor credit or business plans that aren't fully developed.

Inquiries in the Past 6 Months

Number of inquiries refers to the number of times a borrower's credit report is accessed by financial institutions, which generally happens when the borrower is seeking a loan or credit line. More inquiries leads to higher rates of nonperformance, perhaps indicating that increased borrower desperation to access credit might highlight poor financial health. Interestingly, we see an increase in loan performance in the 4+ inquiries bucket. These high levels of inquiries may reflect financially savvy borrowers shopping around for mortgage loans or credit cards.

Number of Total Accounts

A larger number of total accounts indicates a longer credit history and a high level of trust between the borrower and financial institutions, both of which point to financial health and lower rates of default. We see steady increases in the rates of performing loans as the number of accounts increases from 7 to around 20, but diminishing effects after that.

Annual Income

As we might expect, the higher a borrower's annual income the more likely they are to be able to repay their loans. Below I've broken the income data into quintiles, and we can see that those in the top 20% of annual incomes ($95000 +) are approximately 6% more likely to be performing borrowers than those in the bottom 20% (less than $42000).

Loan Amount

As the amount borrowed increases, we see increasing rates of nonperforming loans. The difference between the first two buckets is only around 1% (and the intra-bucket differences are very small), but we see a larger decrease in loan quality in the $30,000 - $35,000 bucket. Noting that the Lending Club maximum loan is $35,000, this may indicate particularly desperate borrowers who are maximizing their possible borrowing.

Employment Length

We'd expect those who have been employed longer to be more stable, and thus less likely to default. Looking into the data, 3 key groups emerged: the unemployed, those employed less than 10 years, and those employed for 10+ years:

Delinquencies in the Past 2 Years

The number of delinquencies in the past 2 years indicates the number of times a borrower has been behind on payments. I combined all values 3 or larger into a single bucket for analysis, as this was a long right-tailed distribution. Interestingly, those with a single delinquency seem to perform more often than those with none. In general however, the differences between 0, 1, and 2 delinquencies are relatively small, while those with greater than 3 show a significant decrease in performance.

Number of Open Accounts

Unlike the number of total accounts above, which we saw to be quite significant, the number of open accounts variable was not a particularly strong indicator:

Verified Income Status

Lending Club categorizes income verification into three statuses: not verified, source verified, and verified. Verified income means that Lending Club independently verified both the source and size of reported income, source verified means that they verified only the source of the income, and not verified means there was no independent verification of the reported values. Interestingly, we see that as income verification increases, the loan performance actually worsens. During the mortgage crisis, non-verified "no-doc" loans were among the worst performing, so the reversal here is interesting. This likely reflects the fact that Lending Club only verifies those borrowers who seem to be of worse credit quality, so there may be confounding variables present here.

Number of Public Records

Public records generally refer to bankruptcies, so we would expect those with more public records to show worse performance. Actually, performance increases as we move from 0 to 1 to 2 public records. This possibly indicates stricter lending standards from Lending Club on those borrowers with public records:

Variables that were not significant:

Months since last delinquency
Months since last major derogatory note
Collections previous 12 months (too few data points on which to make any conclusions or form predictions)

Summary

Lending club grade and subgrade variables provide the most predictive power for determining expected loan performance.
A large number of the other variables also provide strong indications of expected performance. Among the most telling are debt-to-income ratio, credit utilization rate, home ownership status, loan purpose, annual income, inquiries in the past 6 months, and number of total accounts.
Verified income status and number of public records show results opposite from what we would expect. This is likely due to increased standards on borrowers with poorer credit history, so all else equal we see outperformance in these loans.

We've gotten a good understanding of the available borrower data, and we've seen which variables give the best indiciations of future loan performance. In the next post, We'll work on developing a predictive model for projecting the probability of default for newly issued loans.

Appendix

Below I've included the code used to generate the numbers in the tables above. You can also find the complete code available on Github

### Explore the relationships between default rates and factor levels
### I take a few different approaches, but the key idea is the same


# Home Ownership (exclude status "OTHER" and "NONE" because of few data points)
home_ownership <- table(train$new_status,train$home_ownership,
                     exclude=c("OTHER","NONE",""))
prop_home_ownership <- round(prop.table(home_ownership, 2) * 100, 2)

# Test for significance of the difference in proportions for home ownership factors
# Calculate the counts of mortgage, owners, and renters:
count_m <- sum(train$home_ownership == "MORTGAGE")
count_o <- sum(train$home_ownership == "OWN")
count_r <- sum(train$home_ownership == "RENT")

# Calculate the counts of default for mortgages, owners, and renters:
dflt_m <- sum(train$home_ownership == "MORTGAGE" & train$new_status == "NonPerforming")
dflt_o <- sum(train$home_ownership == "OWN" & train$new_status == "NonPerforming")
dflt_r <- sum(train$home_ownership == "RENT" & train$new_status == "NonPerforming")

# 1-sided proportion test for mortgage vs owners
prop.test(c(dflt_m,dflt_o), c(count_m,count_o), alternative = "less")
# 1-sided proportion test for owners vs renters
prop.test(c(dflt_o,dflt_r), c(count_o,count_r), alternative = "less")


# Debt to Income Ratio (break into factors at 5% levels)
train$new_dti <- cut(train$dti, breaks = c(0, 5, 10, 15, 20, 25, 30, 35))
dti <- table(train$new_status, train$new_dti)
prop_dti <- round(prop.table(dti, 2) * 100, 2)


# Revolving Utilization (break into 0 - 20, then factors of 10, then 80+)
train$new_revol_util <- cut(train$revol_util, breaks = c(0, 20, 30, 40, 50, 60, 70, 80, 141))
revol_util <- table(train$new_status, train$new_revol_util)
prop_revol_util <- round(prop.table(revol_util, 2) * 100, 2)


# Loan Purpose (exclude renewable energy because so few data points)
purpose <- table(train$new_status,train$purpose, exclude = c("renewable_energy",""))
prop_purpose <- round(prop.table(purpose, 2) * 100, 2)


# Inquiries in the last 6 months (combine factor levels for any > 4)
levels(train$inq_last_6mths) <- c("0", "1", "2", "3", rep("4+", 5))
inq_last_6mths <- table(train$new_status, train$inq_last_6mths)
prop_inq_last_6mths <- round(prop.table(inq_last_6mths, 2) * 100, 2)


# Number of total accounts (combine factor levels into groups of 5, then 23+)
levels(train$total_acc) <- c(rep("<= 7", 5), rep("8 - 12", 5), 
                             rep("13 - 17", 5), rep("18 - 22", 5), 
                             rep("23+", 68))
total_acc <- table(train$new_status, train$total_acc)
prop_total_acc <- round(prop.table(total_acc, 2) * 100, 2)


# Annual Income (factor into quantiles of 20%)
train$new_annual_inc <- cut(train$annual_inc,
                            quantile(train$annual_inc, na.rm = TRUE,
                                     probs = c(0, 0.2, 0.4, 0.6, 0.8, 1)))
annual_inc <- table(train$new_status, train$new_annual_inc)
prop_annual_inc <- round(prop.table(annual_inc, 2) * 100, 2)


# Loan Amount (break into < 15k, 15k - 30k, 30k - 35k)
train$new_loan_amnt <- cut(train$loan_amnt,c(0, 15000, 30000, 35000))
loan_amnt <- table(train$new_status, train$new_loan_amnt)
prop_loan_amnt <- round(prop.table(loan_amnt, 2) * 100, 2)


# Employment Length (combine factor levels for better comparison)
levels(train$emp_length) <- c("None", "< 10 years", "< 10 years", "10+ years",
                              rep("< 10 years", 8), "None")
emp_length <- table(train$new_status, train$emp_length)
prop_emp_length <- round(prop.table(emp_length, 2) * 100, 2)


# Delinquencies in the past 2 Years (combine factors levels for any > 3)
levels(train$delinq_2yrs) <- c("0", "1", "2", rep("3+", 17))
delinq_2yrs <- table(train$new_status, train$delinq_2yrs)
prop_delinq_2yrs <- round(prop.table(delinq_2yrs, 2) * 100, 2)


# Number of Open Accounts (combine factor levels into groups of 5)
levels(train$open_acc) <- c(rep("<= 5", 6), rep("6 - 10", 5), 
                            rep("11 - 15", 5), rep("16+", 38))
open_acc <- table(train$new_status, train$open_acc)
prop_open_acc <- round(prop.table(open_acc, 2) * 100, 2)


# Verified income status
is_inc_v <- table(train$new_status, train$is_inc_v, exclude = "")
prop_is_inc_v <- round(prop.table(is_inc_v, 2) * 100, 2)


# Number of Public Records (break factor levels into 0, 1, 2+)
levels(train$pub_rec) <- c("0", "1", rep("2+", 12))
pub_rec <- table(train$new_status, train$pub_rec)
prop_pub_rec <- round(prop.table(pub_rec, 2) * 100, 2)


# Months Since Last Record (compare blank vs. non-blank)
na_last_record <- sum(is.na(train$mths_since_last_record))
not_na_last_record <- sum(!is.na(train$mths_since_last_record))
na_last_rec_dflt <- sum(is.na(train$mths_since_last_record) & train$new_status == "NonPerforming")
not_na_last_rec_dflt <- sum(!is.na(train$mths_since_last_record) & train$new_status == "NonPerforming")

not_na_last_rec_pct_dflt <- not_na_last_rec_dflt / not_na_last_record
na_last_rec_pct_dflt <- na_last_rec_dflt/na_last_record


# Months since last delinquency (break factor levels in increments of 10)
train$mths_since_last_delinq <- cut(train$mths_since_last_delinq, 
                                   breaks = c(0, 10, 20, 30, 40, 50, 60, 156))
mths_since_last_delinq <- table(train$new_status, train$mths_since_last_delinq)
prop_mths_since_last_delinq <- round(prop.table(mths_since_last_delinq, 2) * 100, 2)


# Collections last 12 months
collections <- table(train$new_status, train$collections_12_mths_ex_med)
prop_collections <- round(prop.table(collections, 2) * 100, 2)

Plotting the Evolution of the U.S. Treasury Yield Curve

2014-11-12T20:01:00-05:00

Last week I came across a graphic that plots changes in the treasury yield curve from 1982 through 2012. For those unfamiliar, the yield curve shows the level of interest rates available to investors at a series of times to maturity or terms. I won't go into too much detail here, but for more information you may find the Yield Curve page on Wikipedia helpful. Note that while 20-year and 30-year treasuries are currently available, they were not available for this entire period and are therefore excluded from this analysis.

Building on the original plot mentioned above, I pulled post-2012 data directly from the treasury website and added it to the original data to produce an extended graphic of yield curve changes. I also updated the plot formatting and highlighted periods with inverted yield curves using a bright red line.

I've always liked graphics like this that show some changing feature over time, and I think the yield curve illustration is particularly informative. You can clearly see the extreme high interest rates that prevailed throughout the 1980s, and later the characteristically flat yield curve associated with the Zero Interest Rate Policy regime post-2008. The few yield curve inversions (referring to the situation where short term rates are higher than long term rates), are visible as well, highlighted in red. Yield curve inversions are generally thought to signal an impending economic decline, so highlighting these is informative. The plot also shows clearly that long-term rates historically are much less volatile than short term rates, but that the opposite has held true since 2008, with Federal Reserve action keeping short term rates near zero.

I've been experimenting with parameter and plotting settings in R, as part of the Exploratory Data Analysis Coursera class, and I thought this was a good opportunity to experiment with different options. I derived many of the ideas and aesthetics in these plots from examples in Flowing Data's Moving Past Default Charts post.

Technical Details and R Code

The FedYieldCurve data from the YieldCurve package only contains data through 2012. I wanted to expand the time history from the end of 2012 to current, so I pulled the additional yield curve data from the Treasury website. I downloaded the raw text data and subset/formatted on the command line before uploading the results to Google Drive. I then pull this information in using the fetchGoogle function from the mosaic package.

After converting the 2012-2014 data to .xts format and combining with the FedYieldCurve data, I modified the graphical parameters to make for a more interesting plot. Then, using the saveGIF function and a for loop, I was able to create the above GIF with a single frame for each month in the data series. Within the loop, I included an if statement to determine whether the 3M rate exceeded the 10Y point (inverted yield curve), and plotted these periods using a red line to highlight this.

I've included the complete R code below. You can also access the code on Github

library(YieldCurve)
library(animation)
library(lubridate)
library(XML)
library(mosaic)

# Getting yield curve data through 2012
data(FedYieldCurve)

# Pull 2013 and 2014 data separately from Google Docs (Source U.S. Treasury)
end_curve <- fetchGoogle("https://docs.google.com/spreadsheets/d/1Yc3Og9g0Ko_SMh6l0EEZcqIQ85godDxgpnkbfK_N-Gk/export?format=csv&id")

# Change formatting to xts and combine with FedYieldCurve data
end_curve$Date <- as.POSIXct(as.character(end_curve$Date), format="%m/%d/%Y")
end_curve_xts <- xts(end_curve[,2:9], order.by = end_curve$Date)
final_curves <- rbind(FedYieldCurve, end_curve_xts)

maturities <- c(3/12,6/12,1,2,3,5,7,10)
numloops <- nrow(final_curves)

# Set graphical parameters
par(bg="#DCE6EC", mar=c(5,4,3,2), xpd=FALSE, mgp=c(2.8,0.3,0.5), font.main=2,
    col.lab="black", col.axis="black", col.main="black", cex.axis=0.8, 
    cex.lab=0.8, cex.main=0.9, family="Helvetica", lend=1, 
    tck=0, las=1, bty="n")
opar <- par()

# Note: must install ImageMagick program for saveGIF function to work
saveGIF({
    # Create one panel for each date
    for (i in 1:numloops) {
        par(opar)
        plot(0, 0, type="n", xlab=expression(italic("Maturity")), 
             ylab=expression(italic("Rates")), ylim=c(0,15), xlim=c(0,10), 
             xaxt="n", yaxt="n")
        title(main=paste("Yield Curve: ", year(time(final_curves[i]))))
        grid(NA, NULL, col="white", lty="solid", lwd=1.5)
        axis(1, tick=FALSE, col.axis="black")
        axis(2, tick=FALSE, col.axis="black")

        # If yield curve is inverted, plot in red, else dark blue
        if (final_curves$R_3M[i] > final_curves$R_10Y[i]) {
            lines(maturities, final_curves[i,], lwd=3, col="red")
        }
        else {
            lines(maturities, final_curves[i,], lwd=3, col="#244A5D")
        }
    }
},
interval=.1,
movie.name="yieldOutput.gif", 
ani.width=400,
ani.height=400)

Using Javascript to Visualize a Percolation System

2014-10-16T19:27:00-04:00

In this post I will discuss the background for my percolation visualization page and the details of my implementation. I hope to provide a good introduction to percolation theory and the union find algorithm in particular. This is the first non-trivial Javascript application I've created, and later in the post I will discuss some of the biggest challenges I faced and things I learned along the way.

Background

The inspiration and idea for this project came directly from the similar programming assignment in Robert Sedgwick's and Kevin Wayne's Algorithms class on Coursera. In the class I implemented a percolation system in Java, and afterward I thought that it would be an interesting challenge to port that to Javascript and create a visualization on my website. For the web version I use Javascript for all of the calculations, and I use the HTML canvas element to draw the visualization to the screen.

Connectivity and the Union Find Algorithm

A connectivity problem seeks to determine, given a directed graph of sites such as the one below, whether two sites are connected via any path.

For a small set of sites like this one, a brute force approach would solve the problem effectively. If we wanted to see whether site 2 was connected to site 8, we could recursively check each site's neighbors to ultimately determine that they are connected. As the number of sites grows large however, this method does not scale, and we need a better solution. Instead of thinking of the grid of sites as a directed graph, we can convey the same information as a grouping of components as seen in the image below. Now, determining whether any two sites are connected is as simple as checking whether they are both members of the same component. Connecting two sites under this new representation involves merging their components rather than drawing paths.

The Union Find data structure, sometimes called a disjoint set data structure or a merge-find set, allows for high performance operations on a component grouping as described above. The Union Find data structure keeps track of a set of elements partitioned into a number of disjoint subsets (components). The data structure supports two main operations:

Find: Return the id of the component to which the given site is a member
Union: Connect two sites by combining their two components into a single component with the same id

A new Union Find data structure of size N is initialized with N distinct components. In a numerical representation, the id of each component when initialized is simply the value of the site. Calls to the Union operation create a tree of components such that when two sets are combined, the members of one set will point to those of the other set. The id of the merged component is the id of the root node of this tree. The find operation returns the id of a site by traversing to the top of the tree to find the root member component id.

In this implementation, the find operation takes time proportional to the depth of the tree. A naive implementation of the union operation could allow trees to potentially become very deep, which would slow the performance of this algorithm. Instead, if we modify the union function such that we always append the smaller component to the larger component, we reduce the maximum depth of the tree and ensure that the algorithm takes O(nlg(n)) time in the worst case.

Abbreviated Proof:
- Note: lg = log base 2
- For a given node x, its depth in the tree will increase by 1 when its tree T1 is merged into another tree T2.
- When this happens, the size of x's tree will at least double, because our union operation requires size(T2) > size(T1) for T1 to point to T2.
- Through any number of union operations, the size of x's tree can double at most lg(N) times [ lg(2^N) = N ], and N is the total number of nodes.
- Therefore: Traversals take O(nlg(n)) in the worst case

Below I have included my complete Javascript code for the Union Find data structure. This code is essentially the same as the Java code presented in the Coursera class mentioned above. We include two operations in addition to the find and union operations: a connected operation that returns true if two sites are connected and a count variable that returns the number of unique components.

function WeightedQuickUnionUF(N) {

    // Constructor
    var id = [];
    var sz = []; 
    for (var i = 0; i < N; i++) {
        id[i] = i; // id[i] = parent of i
        sz[i] = 1; // sz[i] = number of objects in subtree with root i
    }

    // Returns the number of components, which initializes at N
    this.count = N;

    // Returns the component id for the containing site
    this.find = function(p) {
        while (p != id[p]) {
            p = id[p];
        }
        return p;
    }

    // Returns true if two elements are part of the same component
    this.connected = function(p, q) {
        return this.find(p) === this.find(q);
    }

    // Connects the components of two elements
    this.union = function(p, q) {
        var rootP = this.find(p);
        var rootQ = this.find(q);
        if (rootP === rootQ) { return; }

        // make smaller tree point to larger one
        if (sz[rootP] < sz[rootQ]) {
            id[rootP] = rootQ; sz[rootQ] += sz[rootP];
        } else {
            id[rootQ] = rootP; sz[rootP] += sz[rootQ];
        }
        this.count--;
    }
}

Percolation

The percolation problem assumes a grid of sites that can be set to either open or closed. If we imagine that water is flowing across the top of the grid, an open site will become full with water when it can be connected by other open sites in an unbroken path to the top of the grid. The system percolates when an open site on the bottom of the grid can be connected by other open sites in an unbroken path to the top of the grid, such that water will flow freely through the system, as seen in the image below.

Modeling percolation with the Union-Find algorithm
The Union Find implementation is efficient for determining whether any 2 sites are connected, but this structure would require N^2 time to check whether any of the N sites in the top row is connected to any of the N sites in the bottom row of the percolation grid. Similarly, this structure would require time proportional to N to determine whether a given site is full (i.e. whether it is connected to any of the N top row sites).

To address these issues, we create two "virtual sites" on the top and bottom of the grid. We automatically connect these virtual sites to sites on the top and bottom rows as they are opened. To determine whether a site is full, we check whether it is connected to the top virtual site. To determine whether the system percolates we check whether the top and bottom virtual sites are connected.

Interesting aside
For a large square grid of sites, there exists a percolation threshold probability p such that if we open fewer than p percent of the sites the system will not percolate, and if we open greater than p percent of the sites the system will percolate. An exact expression for the percolation threshold of a square grid is currently unknown, but through experimentation we know the value to be approximately 0.592746. That is, for a large square grid, the system should first percolate after we have opened 59.2746% of the sites.

Challenges

Creating the Visualization

Initially I had thought to create the grid of sites using a grid of divs which I would then be able to color according to their status, similar to my previous Mondrian Painting Project. I wanted to support the ability to change the size of the grid however, and a large number of divs seemed unnecessarily cumbersome. I did some searching on Codepen.io for ideas, and found some examples using the HTML5 canvas element, which was exactly what I needed. In particular, I liked that this output as an image that was saveable. I got to learn a lot about the HTML canvas and how it worked, and this was a fun part of the project.

Converting Implementation from Java

I had trouble initially with porting the Java code to Javascript. In Java, the well-defined class relationships were clear to me, but I did not at first understand how to implement similar structures in Javascript. After doing some research, I found that there were many ways in Javascript to accomplish a similar thing, and I ultimately used functions to implement this as they were most syntactically similar and I found this approach easiest to understand. I am still working to understand many aspects of Javascript, but this project helped my greatly with learning about how to modularize my code.

Iteratively Opening Sites and Drawing

Initially I implemented the process of opening sites and drawing to the canvas using a while loop, but this was not ideal. I could run the entire percolation simulation and then draw to the canvas, but if I tried drawing to the canvas after each site was opened, the browser would freeze. I wanted to be able to show each site being opened successively, so I needed to find a way to delay the opening of new sites until I could draw to the screen. I ended up accomplishing this through the use of Javascript's setInterval method. I ultimately used both implementations, allowing the user to choose whether to output the results instantly (using the while loop) or to output the results iteratively at two different speeds (using setInterval). The timing is controlled by the code shown below:

function outputInstantly() {
    // While loop runs until the system percolates
    while (!perc.percolates()) { // Calls the percolates method of the perc object
        openRandom();
        count++;
    }
    // Once the system percolates, draw to the screen a single time
    drawPerc.drawGrid();
}

// The user controls delay variable by selecting radio buttons on the page
if (delay === 0) {
    outputInstantly();
} else {
    // Use the setInterval function to repeatedly open sites and draw to screen (checkPerc function)
    interval = setInterval(checkPerc, delay); 
    interval();
}

Bug when running multiple calculations simultaneously

My initial implementation suffered from a bug where if I reran the simulation, either by refreshing the page or by clicking the button to run again, the first percolation run would continue in the background. This caused issues with the display and text output to the screen. I knew that the solution would be to clear the interval on rerun, but I did not know how I could access the id of setInterval, an instance variable of the previous simulatePercolation instance, when creating a new instance of percolation. After some experimentation, I found that if I declared the interval variable in the head of my HTML, rather than in a separate javascript file, I could assign it to setInterval when running simulatePercolation, which would eliminate the possibility of duplicate intervals running simultaneously, and correct the issues that I had been facing.

See the full code for my percolation visualization on my Github: Percolation

Run my percolation visualization

Learn R Programming & Build a Data Science Career | Michael Toth

A Detailed Guide to ggplot colors

By the end of this tutorial, you’ll know how to:

Introducing video tutorials!

Get my free workbook to build a deeper understanding of ggplot colors!

A high-level overview of ggplot colors

Modifying ggplot colors: static color vs. color mapping

Working with Color Palettes for Continuous Data

On sequential color scales and ggplot colors

Modifying our ggplot colors for continuous data using scale_color_gradient

On diverging gradient scales and ggplot colors

Working with Color Palettes for Categorical Data

My favorite tool for building categorical color palette: Color Picker for Data

Mapping Categorical Data to Color in ggplot

Modifying our ggplot colors for categorical data using scale_color_manual

A Summary of Working with ggplot Colors

Don't Forget to Practice!

10 Steps to Better Graphs in R

Why Do You Need a Checklist

Before you graph

1. Decide who this graph is for

2. Structure your graph to answer a question

3. Decide which type of graph to use

4. Decide how to handle outliers

Building Your Graph

5. Remove unnecessary data

6. Don't be misleading

Styling Your Graph

7. Decide on an appropriate color palette

8. Make all of your axis titles and labels horizontal

9. Adjust your titles, labels, and legend text

BONUS: Add your company logo and branding

Exporting Your Graph

10. Save your graph in a readable high-resolution format

Conclusion

One Step to Quickly Improve the Readability and Visual Appeal of ggplot Graphs

Creating a Base Graph to Work From

Rotate Your Y-Axis Title To Improve Readability

More on Text Rotation in ggplot

Detailed Guide to the Bar Chart in R with ggplot

Introduction to ggplot

Follow Along With the Workbook

Investigating our dataset

How to create a simple bar chart in R using geom_bar

Changing bar color in a ggplot bar chart

Mapping bar color to a variable in a ggplot bar chart

More Details on Stacked Bar Charts in ggplot

Dodged Bars in ggplot

Scaling bar size to a variable in your data

Revisiting color in geom_bar

A deeper review of aes() (aesthetic) mappings in ggplot

Reviewing the list of geom_bar aesthetic mappings

Aesthetic mappings vs. parameters in ggplot

Common errors with aesthetic mappings and parameters in ggplot

Trying to include aesthetic mappings outside your aes() call

Trying to specify parameters inside your aes() call

Solidify Your Understanding

Getting a Data Science Job is not a Numbers Game!

My First (Non Data Science) Job Search

My Big Break

What I Learned

Getting My First Data Science Job with a Strategic R Blog Post

Avoid the Spray and Pray Approach when Applying for Data Science Jobs

A Detailed Guide to the ggplot Scatter Plot in R

Scatter Plot of Adam Sandler Movies from FiveThirtyEight

The Famous Gapminder Scatter Plot of Life Expectancy vs. Income by Country

Follow Along With the Workbook

Introduction to ggplot

Investigating our dataset

How to create a simple scatter plot in R using geom_point()

Changing point color in a ggplot scatter plot

Converting the am variable to a factor

Specifying color = am and moving it within the aes() parentheses

Changing point shapes in a ggplot scatter plot

A deeper review of aes() (aesthetic) mappings in ggplot

Reviewing the list of geom_point aesthetic mappings

Changing the size aesthetic mapping in a ggplot scatter plot

Changing transparency in a ggplot scatter plot with the alpha aesthetic

Aesthetic mappings vs. parameters in ggplot

Common errors with aesthetic mappings and parameters in ggplot

Get my free workbook to build a deeper understanding of `ggplot` colors!

A high-level overview of `ggplot` colors

Modifying `ggplot` colors: static color vs. color mapping

On sequential color scales and `ggplot` colors

Modifying our `ggplot` colors for continuous data using scale_color_gradient

On diverging gradient scales and `ggplot` colors

Mapping Categorical Data to Color in `ggplot`

Modifying our `ggplot` colors for categorical data using scale_color_manual

A Summary of Working with `ggplot` Colors

More on Text Rotation in `ggplot`

How to create a simple bar chart in R using `geom_bar`

Changing bar color in a `ggplot` bar chart

Mapping bar color to a variable in a `ggplot` bar chart

More Details on Stacked Bar Charts in `ggplot`

Revisiting `color` in `geom_bar`

A deeper review of `aes()` (aesthetic) mappings in ggplot

Trying to include aesthetic mappings outside your `aes()` call

Trying to specify parameters inside your `aes()` call

Converting the `am` variable to a factor

Specifying `color = am` and moving it within the `aes()` parentheses

A deeper review of `aes()` (aesthetic) mappings in ggplot

Changing the `size` aesthetic mapping in a ggplot scatter plot

Changing transparency in a ggplot scatter plot with the `alpha` aesthetic

Trying to include aesthetic mappings outside your `aes()` call

Trying to specify parameters inside your `aes()` call

Changing line color in `ggplot + geom_line`

Specifying `color = Tree` and moving it within the `aes()` parentheses

Changing linetype in `ggplot + geom_line`

A deeper review of `aes()` (aesthetic) mappings in ggplot

Changing the `group` aesthetic mapping in `ggplot + geom_line`

Changing transparency in `ggplot + geom_line` with the `alpha` aesthetic

Changing the `size` aesthetic mapping in `ggplot + geom_line`

Trying to include aesthetic mappings outside your `aes()` call

Trying to specify parameters inside your `aes()` call

Converting a character string to a POSIXct datetime using the `as.POSIXct()` R function