Learn R Programming & Build a Data Science Career | Michael Tothhttps://michaeltoth.me/2019-05-14T10:00:00-04:00A Detailed Guide to ggplot colors2019-05-14T10:00:00-04:00Michael Tothtag:michaeltoth.me,2019-05-14:a-detailed-guide-to-ggplot-colors.html<p>Once you've figured out how to create the standard <a href="https://michaeltoth.me/a-detailed-guide-to-the-ggplot-scatter-plot-in-r.html">scatter plots</a>, <a href="https://michaeltoth.me/detailed-guide-to-the-bar-chart-in-r-with-ggplot.html">bar charts</a>, and <a href="https://michaeltoth.me/a-detailed-guide-to-plotting-line-graphs-in-r-using-ggplot-geom_line.html">line graphs</a> in <code>ggplot</code>, the next step to really elevate your graphs is to master working with color.</p>
<p>Strategic use of color can really help your graphs to stand out and <a href="https://michaeltoth.me/10-steps-to-better-graphs-in-r.html">make an impact</a>.</p>
<p>In this guide, you'll learn how to incorporate your own custom color palettes into your graphs by modifying the base <code>ggplot</code> colors.</p>
<h2>By the end of this tutorial, you’ll know how to:</h2>
<ul>
<li>Change all items in a graph to a static color of your choice</li>
<li>Differentiate between <strong>setting a static color</strong> and <strong>mapping a variable in your data to a color palette</strong> so that each color represents a different level of the variable</li>
<li>Customize your own continuous color palettes using the <code>scale_color_gradient</code>, <code>scale_fill_gradient</code>, <code>scale_color_gradient2</code>, and <code>scale_fill_gradient2</code> functions</li>
<li>Customize your own color palettes for categorical data using the <code>scale_color_manual</code> and <code>scale_fill_manual</code> functions</li>
</ul>
<h2>Introducing video tutorials!</h2>
<p>I'm also excited to try something new in this guide! I'll be adding video tutorials to accompany the content, so please let me know what you think about these and if you find them helpful. I'd love to do more of this in the future if you find them valuable!</p>
<h2>Get my free workbook to build a deeper understanding of <code>ggplot</code> colors!</h2>
<p>Have you ever read a tutorial or guide, learned a bunch of interesting things, only to forget them shortly after you finish reading? </p>
<p>Me too. And it's really annoying!</p>
<p>Unfortunately, our brains aren't good at remembering what we read. We need to think critically and be engaged in solving problems to learn information so it sticks.</p>
<p>That's why I've created a free workbook to accompany this post. The workbook is an R file that includes additional questions and exercises to help you engage with this material. </p>
<p><a href="https://mailchi.mp/2c40b4d25a09/ggplot-colors">Get your free workbook to master working with colors in <code>ggplot</code></a></p>
<h2>A high-level overview of <code>ggplot</code> colors</h2>
<p>By default, <code>ggplot</code> graphs use a black color for lines and points and a gray color for shapes like the rectangles in bar graphs.</p>
<p>Sometimes this is fine for your purposes, but often you'll want to modify these colors to something different. </p>
<p>Depending on the type of graph you're working with, there are two primary attributes that affect the colors in a graph. </p>
<p>You use the <code>color</code> attribute to change the <em>outline</em> of a shape and you use the <code>fill</code> attribute to fill the <em>inside</em> of a shape. </p>
<p>Specifically, we use the <code>color</code> attribute to change the color of any points and lines in your graph. This is because points and graphs are 0- and 1-dimensional objects, so they don't have any inside to fill! </p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>tidyverse<span class="p">)</span>
ggplot<span class="p">(</span>iris<span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> Sepal.Width<span class="p">,</span> y <span class="o">=</span> Sepal.Length<span class="p">),</span> color <span class="o">=</span> <span class="s">'red'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_colors/color-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>In contrast, bars and other 2-dimensional shapes <em>do</em> have an inside to fill, so you will be using the <code>fill</code> attribute to change the color of these items in your graph:</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mpg<span class="p">)</span> <span class="o">+</span>
geom_bar<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">class</span><span class="p">),</span> fill <span class="o">=</span> <span class="s">'red'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_colors/fill-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>Side note: technically you can also use the <code>color</code> attribute to change the outline of shapes like bars in a bar graph. I use this functionality very rarely, and for the sake of simplicity I will not go into this in further detail in this guide.</p>
<p>Except for the difference in naming, <code>color</code> and <code>fill</code> operate very similarly in <code>ggplot</code>. As you'll see, the functions that exist for modifying your <code>ggplot</code> colors all come in both <code>color</code> and <code>fill</code> varieties. But before we get to modifying the colors in your graphs, there's one other thing we need to touch on first.</p>
<h2>Modifying <code>ggplot</code> colors: static color vs. color mapping</h2>
<iframe align="middle" width="560" height="315" src="https://www.youtube.com/embed/c7Smep_qXfA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>We need to distinguish between two different ways of modifying colors in a <code>ggplot</code> graph. The two things we can do are:</p>
<ol>
<li><em>setting a static color</em> for our entire graph</li>
<li><em>mapping a variable to a color</em> so each level of the variable is a different color in our graph</li>
</ol>
<p>In the earlier examples, we used a static color (red) to modify all of the points and bars in the two graphs that we created. </p>
<p>It's often the case, however, that we want to use color to convey additional information in our graph. Usually, we do this by mapping a variable in our dataset to the <code>color</code> or <code>fill</code> aesthetic, which tells <code>ggplot</code> to use a different color for each level of that variable in the data. </p>
<p>Setting a static color is pretty straightforward, and you can use the two examples above as references for how to accomplish that. </p>
<p>In the rest of this guide, I'm going to show you how you can map variables in your data to colors in your graph. You'll learn about the different functions in <code>ggplot</code> to set your own color palettes and how they differ for continuous and categorical variables.</p>
<h2>Working with Color Palettes for Continuous Data</h2>
<iframe align="middle" width="560" height="315" src="https://www.youtube.com/embed/7cQqA5ibXj4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>Let's start with a simple example of the default continuous color palette in <code>ggplot</code>. First, we'll generate some random data that we'll use for our graph.</p>
<div class="highlight"><pre><span></span>df <span class="o"><-</span> <span class="kt">data.frame</span><span class="p">(</span>
x <span class="o">=</span> runif<span class="p">(</span><span class="m">100</span><span class="p">),</span> <span class="c1"># 100 uniformly distributed random values</span>
y <span class="o">=</span> runif<span class="p">(</span><span class="m">100</span><span class="p">),</span> <span class="c1"># 100 uniformly distributed random values</span>
z1 <span class="o">=</span> rnorm<span class="p">(</span><span class="m">100</span><span class="p">),</span> <span class="c1"># 100 normally distributed random values</span>
z2 <span class="o">=</span> <span class="kp">abs</span><span class="p">(</span>rnorm<span class="p">(</span><span class="m">100</span><span class="p">))</span> <span class="c1"># 100 normally distributed random values mapped to positive</span>
<span class="p">)</span>
</pre></div>
<h4>On sequential color scales and <code>ggplot</code> colors</h4>
<p>When we map a continuous variable to a color scale, we map the values for that variable to a color gradient. You can see the default <code>ggplot</code> color gradient below. </p>
<p><center>
<img alt="Sequential color gradient from dark to light blue" src="../images/20190416_ggplot_colors/gradient_1.png" width="600px" />
</center></p>
<p>This is called a sequential color scale, because it maps data sequentially from one color to another color. The minimum value will in your dataset will be mapped to the left side (dark blue) of this sequential color gradient, while the maximum value will be mapped to the right side (light blue) of this sequential color gradient.</p>
<p>You can imagine stretching a number line across this gradient of colors. Then, for every value in your data, you find it on the number line, take the color at that location, and graph using that resulting color.</p>
<p>Let's see how this works in practice. Using the random data we generated above, we'll graph a <a href="https://michaeltoth.me/a-detailed-guide-to-the-ggplot-scatter-plot-in-r.html">scatter plot</a> of the x and y variables. To illustrate the color gradient, we'll map the z2 variable to the <code>color</code> aesthetic: </p>
<div class="highlight"><pre><span></span><span class="c1"># Default colour scale colours from light blue to dark blue</span>
g1 <span class="o"><-</span> ggplot<span class="p">(</span>df<span class="p">,</span> aes<span class="p">(</span>x<span class="p">,</span> y<span class="p">))</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>color <span class="o">=</span> z2<span class="p">))</span>
g1
</pre></div>
<p><img src="/figures/20190416_ggplot_colors/default_continuous-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>Note the contrast between this syntax and the syntax before where we set a static color for our graph. Here, we aren't specifying the color to use, we're simply telling <code>ggplot</code> to map the <code>z2</code> variable to the color aesthetic by including the mapping <code>color = z2</code> within the <code>aes</code> function. </p>
<p>In the dataset that I created, the minimum value for the <code>z2</code> variable is 0.0024422, while the maximum value is 2.6241346. All values--and therefore all colors--fall between these minimum and maximum levels. </p>
<h4>Modifying our <code>ggplot</code> colors for continuous data using scale_color_gradient</h4>
<p>Now that you understand how <code>ggplot</code> can map a continuous variable to a sequential color gradient, let's go into more detail on how you can modify the specific colors used within that gradient. </p>
<p>Instead of the default blue gradient that <code>ggplot</code> uses, we can use any color gradient we want! To modify the colors used in this scale, we'll be using the <code>scale_color_gradient</code> function to modify our <code>ggplot</code> colors.</p>
<p>Side note: if we were instead graphing bars or other fillable shapes, we would use the <code>scale_fill_gradient</code> function. For brevity, I won't be including an example of this function. It operates in exactly the same way as the <code>scale_color_gradient</code> function, so you can easily modify this code to work for filling graphs with color as well.</p>
<p>Using the same graph from before, we simply add a call to the <code>scale_color_gradient</code> function to modify our color palette. Here, we can specify our own values for <code>low</code> and <code>high</code> to customize the gradient of colors in our graph. In this case, we'll be mapping low values to greenyellow and high values to forestgreen. </p>
<div class="highlight"><pre><span></span>g1 <span class="o">+</span>
scale_color_gradient<span class="p">(</span>low <span class="o">=</span> <span class="s">'greenyellow'</span><span class="p">,</span> high <span class="o">=</span> <span class="s">'forestgreen'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_colors/scale_color_gradient-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>By modifying the values you're passing to the <code>scale_color_gradient</code> function, you can create a sequential color scale between any two colors! </p>
<p>Under the hood, <code>ggplot</code> was already using this color scale with the dark blue and light blue colors that show up by default. By adding this color scale to the graph and specifying your own colors, you're simply overriding the default values that <code>ggplot</code> was already using.</p>
<h4>On diverging gradient scales and <code>ggplot</code> colors</h4>
<p>Sequential color scales are great when you want to easily differentiate between low and high values in a dataset.</p>
<p>Sometimes, however, that's not what you want. Sometimes you want to look at deviations from a certain baseline value, and you care about distinguishing both positive and negative deviations. For this type of data, we use what's called a diverging color scale. </p>
<p>A diverging color scale creates a gradient between three different colors, allowing you to easily identify low, middle, and high values within your data. You can see an example of a diverging color scale below.</p>
<p><center>
<img alt="Continuous color gradient from blue to red" src="../images/20190416_ggplot_colors/gradient_2.png" width="600px" />
</center></p>
<p>In this color scale, we see that blue is associated with values on the low end, white with values in the middle, and red with values on the high end. Among other things, this type of scale is often used when presenting United States presidential election results.</p>
<p>Instead of the <code>scale_color_gradient</code> function that we used for a sequential color palette, we're now going to use the <code>scale_color_gradient2</code> to produce a diverging palette.</p>
<p>Side note: Again, there is a similar function called <code>scale_fill_gradient2</code> that we would use if we were instead graphing bars or other fillable shapes. I won't be including an example of this function, but it operates in exactly the same way as the <code>scale_color_gradient2</code> function, so you can easily modify this code to work for filling graphs with color as well.</p>
<p>As before, we tell the <code>scale_color_gradient2</code> function which colors to map to low and high values of our variable. In addition, we also specify a color to map the mid values to. As in the color scale we just reviewed, we'll use the blue-white-red color palette for this example:</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>df<span class="p">,</span> aes<span class="p">(</span>x<span class="p">,</span> y<span class="p">))</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>colour <span class="o">=</span> z1<span class="p">))</span> <span class="o">+</span>
scale_color_gradient2<span class="p">(</span>low <span class="o">=</span> <span class="s">'blue'</span><span class="p">,</span> mid <span class="o">=</span> <span class="s">'white'</span><span class="p">,</span> high <span class="o">=</span> <span class="s">'red'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_colors/unnamed-chunk-1-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>While you can technically specify any 3 colors for a diverging color scale, the convention is to use a light color like white or light yellow in the middle and darker colors of different hues for both low and high values, like we've done here. </p>
<h2>Working with Color Palettes for Categorical Data</h2>
<p>When working with continuous data, each value in your dataset was automatically mapped to a value on a 2-color sequential gradient or 3-color diverging gradient, as we just saw. The goal was to show a smooth transition between colors, highlighting low and high values or low, middle, and high values in the data.</p>
<p>When working with categorical data, each distinct level in your dataset will be mapped to a distinct color in your graph. With categorical data, the goal is to have highly differentiated colors so that you can easily identify data points from each category.</p>
<p>There are built-in functions within <code>ggplot</code> to generate categorical color palettes. That said, I've always preferred the control I get from generating my own, and that's what I'm going to show you how to do here.</p>
<h4>My favorite tool for building categorical color palette: Color Picker for Data</h4>
<iframe align="middle" width="560" height="315" src="https://www.youtube.com/embed/QBROdDKzQoY" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>My favorite way of generating beautiful color palettes is to use Tristen Brown's tool <a href="http://tristen.ca/hcl-picker/#/hlc/6/1.05/603548/D4E966">Color Picker for Data</a>. It offers an intuitive visual interface to build and export a color palette that you can use directly within <code>ggplot</code>.</p>
<h4>Mapping Categorical Data to Color in <code>ggplot</code></h4>
<iframe align="middle" width="560" height="315" src="https://www.youtube.com/embed/h8dn6nbCznQ" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<p>For this example, we'll be working with the mtcars dataset. We're going to create a scatter plot of weight and miles per gallon. Then, we'll use the color aesthetic to map 4-, 6-, and 8-cylinder engines each to a different color using the default <code>ggplot</code> colors:</p>
<div class="highlight"><pre><span></span>g2 <span class="o"><-</span> ggplot<span class="p">(</span>mtcars<span class="p">,</span> aes<span class="p">(</span>x <span class="o">=</span> wt<span class="p">,</span> y <span class="o">=</span> mpg<span class="p">))</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>color <span class="o">=</span> <span class="kp">factor</span><span class="p">(</span>cyl<span class="p">)))</span>
g2
</pre></div>
<p><img src="/figures/20190416_ggplot_colors/categorical_colors-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>Similar to how we worked with categorical data, we simply map a variable to the color aesthetic by including the code <code>color = your_variable</code> within the <code>aes</code> function of our <code>geom_point</code> call. </p>
<p>The one caveat is that here we're converting the <code>cyl</code> variable to a factor before we do this mapping. Because <code>cyl</code> is recorded as a numerical variable, it will by default map to the color gradients we saw before, which isn't what we want in this case, as we're treating <code>cyl</code> as a categorical variable with 3 levels. Remember, just because a variable happens to have numeric values does not necessarily mean it should be mapped as a continuous scale! </p>
<p>We know how we can map a categorical variable to the color aesthetic to produce different colors in our graph for each level in our dataset. How can we modify those colors to a color palette of our choice? </p>
<h4>Modifying our <code>ggplot</code> colors for categorical data using scale_color_manual</h4>
<p>Once you have your color palette, you can use the <code>scale_color_manual</code> function to map the levels in your dataset to different colors in your generated color palette.</p>
<p>Side note: Can you guess? Yes, again, there is a similar function called <code>scale_fill_manual</code> that we would use if we were instead graphing bars or other fillable shapes. I won't be including an example of this function, but it operates in exactly the same way as the <code>scale_color_manual</code> function, so you can easily modify this code to work for filling graphs with color as well.</p>
<p>Here, we start by creating a vector that maps the different levels in our data, in this case "4", "6", and "8", to different colors. </p>
<p>We then use the <code>scale_color_manual</code> function and specify the mapping by passing our <code>colors</code> vector to the <code>values</code> argument of <code>scale_color_manual</code>. It will then go through each entry in the <code>cyl</code> column, mapping it to the relevant color in our <code>colors</code> vector.</p>
<div class="highlight"><pre><span></span>colors <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="s">"4"</span> <span class="o">=</span> <span class="s">"#D9717D"</span><span class="p">,</span> <span class="s">"6"</span> <span class="o">=</span> <span class="s">"#4DB6D0"</span><span class="p">,</span> <span class="s">"8"</span> <span class="o">=</span> <span class="s">"#BECA55"</span><span class="p">)</span>
g2 <span class="o">+</span>
scale_color_manual<span class="p">(</span>values <span class="o">=</span> colors<span class="p">)</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_colors/scale_color_manual-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<h2>A Summary of Working with <code>ggplot</code> Colors</h2>
<p>Congratulations! You know know how to work with colors in your <code>ggplot</code> graphs.</p>
<p>In this guide, you learned:</p>
<ul>
<li>How to change all items in a graph to a static color of your choice</li>
<li>To distinguish between two ways of modifying color in your <code>ggplot</code> graph:<ol>
<li>Setting a static color for all elements in your graph </li>
<li>Mapping a variable in your data to a color palette so that each color represents a different level of the variable</li>
</ol>
</li>
<li>How you can customize your sequential color scales using the <code>scale_color_gradient</code> and <code>scale_fill_gradient</code> functions for continuous data</li>
<li>How to customize your diverging color scales using the <code>scale_color_gradient2</code>, and <code>scale_fill_gradient2</code> functions for continuous data</li>
<li>How to customize your color palettes for categorical data using the <code>scale_color_manual</code> and <code>scale_fill_manual</code> functions</li>
</ul>
<h2>Don't Forget to Practice!</h2>
<p>Right now, you should have a pretty good understanding of how you can work with and modify the colors in your <code>ggplot</code> graphs. But if you don't practice, you're going to forget this stuff! </p>
<p>That's why I've created a free workbook that you can use to apply what you've learned in this guide. The workbook is an R file that includes additional questions and exercises to help you engage with this material. </p>
<p><a href="https://mailchi.mp/2c40b4d25a09/ggplot-colors">Get your free workbook to master working with colors in <code>ggplot</code></a></p>10 Steps to Better Graphs in R2019-05-07T10:44:00-04:00Michael Tothtag:michaeltoth.me,2019-05-07:10-steps-to-better-graphs-in-r.html<p>Over the last 5 years, I have created <strong>a LOT</strong> of graphs. And let me tell you, they haven't all been pretty. But with each new graph that I've created, I've improved my knowledge of what works and what doesn't. </p>
<p>And I've used that knowledge to develop a set of best practices that I follow every time I'm working on a new project that involves communicating results or information with graphs. </p>
<p>You see, when I'm making a graph, I'm not doing it just to explore some data or show something "interesting". No. I want my graphs to speak to my audience and help them to <em>understand</em> and <em>take action</em> based on what they learn. </p>
<p>A good data scientist needs to be able to not only analyze data, but also to convey the insights hidden in that data in a way that convinces people to take appropriate action. </p>
<p>If you can use your data analysis skills to consistently drive change in your organization, you will quickly find yourself on a path toward promotion, increased responsibility, and greater control over your own work. </p>
<p>In my own experience, learning to graph effectively was the single biggest thing that helped me increase my impact and ability to drive change at work. That's why I think it's so important for you to learn to graph effectively, and that's why I'm sharing this checklist with you today.</p>
<p>This is the exact checklist I go through when I'm working on graphs for big consulting projects with my clients. I keep a printed copy on my desk and I refer to it every time I'm working on a new graph. It helps me, and I think it will help you too. Be sure to get a copy of the checklist for yourself so you'll always have it handy when you need it!</p>
<p><a href="https://mailchi.mp/114dc86f2f2b/graphing-checklist">Get Your Free 10-Step ggplot Graphing Checklist</a></p>
<h2>Why Do You Need a Checklist</h2>
<p>Graphs are a versatile tool that can be used for a variety of different goals. That's one of the reasons why I think learning to graph effectively is one of the highest priorities for data scientists. </p>
<p>But I also think that leads to a lot of problems with graphs. You see, you can use graphs for exploratory data analysis, and you can also use graphs for presentation and sharing results. The problem that I see <strong>ALL THE TIME</strong> is that people try to use the same graphs for both of these things. Ahhh! </p>
<p>Look, I've been there. It's tempting and easy to throw a few quick graphs together and call it a day. </p>
<p>I used to work in finance, and I was once tasked with building a model to predict of how likely borrowers were to default on loans they had taken out. </p>
<p>It was a challenging problem, and I spent around 6 weeks creating a sophisticated model that predicted defaults based on all kinds of data about the borrower's income level, where they lived, how big their loan was, and what their interest rate was.</p>
<p>I was pretty pleased that the model I created worked really well! I knew our clients would find a lot of value in this model once we got it implemented into our platform.</p>
<p>Before that, I needed to summarize my results and share them with the rest of the company. At the time, I didn't see this as an opportunity to advocate for my work and its benefits to our clients. Instead, I saw it as an annoying obligation that I had to do on top of my already extensive analysis.</p>
<p>I knew my analysis was good and that this change was valuable. I was sure other people understood that as well. </p>
<p>So I threw some graphs together, gave a quick presentation on the topic to a room full of glazed-over eyes, and went back to my desk.</p>
<p>It took weeks for us to build this into the platform, when it could have been done in a matter of days if there was sufficient motivation.</p>
<p>Whose fault was that? At the time, it was easy for me to blame the engineering team for the slow implementation. But the reality is, <strong>it was my fault!</strong></p>
<p>Everybody else is busy with their own work, and for the most part, they don't really know what it is you do all day. You're often so deep in the weeds of your own analysis that things you think are simple and obvious aren't even on the minds of everybody else. That's why it's important to treat every presentation and every graph you share as an opportunity to educate others and inspire them to move forward with an action.</p>
<p>I'd love to say that I immediately changed my presentation and graphing style after that experience, but I didn't. It took me years to develop the knowledge and skills to do this effectively.</p>
<p>But it doesn't have to take years! I'm here to help you learn <em>today</em> how to effectively use graphs to communicate ideas and drive change. </p>
<p>If you implement these ideas consistently, you <strong>will</strong> see improvements in the impact of your work and your influence in the organization.</p>
<p>Now, let's get into the checklist. And remember, if you want to keep a copy of this for yourself so you'll always have it to refer back to, you can get that here:</p>
<p><a href="https://mailchi.mp/114dc86f2f2b/graphing-checklist">Get Your Free 10-Step Checklist to Graphing for Impact</a></p>
<h2>Before you graph</h2>
<h4>1. Decide who this graph is for</h4>
<p>In order to create an effective graph, you need to know who will be using this information. Many of your design decisions stem from this key point. If you understand your audience’s background, goals, and challenges, you’ll be far more effective in creating a graph to help them make a decision, which is what this is all about! In particular, remember that this graph is not for you. You have all kinds of specialized knowledge that your audience likely does not have. You need your graph to appeal to them.</p>
<h4>2. Structure your graph to answer a question</h4>
<p>Your graphs should answer an important question that your audience has. “How have our revenue numbers changed over time?” or “Which of our services has the lowest level of customer satisfaction?” Design your graph to answer their question, not just to explore data. This serves two purposes.</p>
<ol>
<li>It gives people a reason to pay attention. If you're answering a big question they have, they're going to listen to you.</li>
<li>It provides a clear path from <strong>data</strong> to <strong>action</strong>, which is ultimately what you want. Remember: the entire point of this field is to extract insights from data to <strong>help businesses make better decisions!</strong></li>
</ol>
<h4>3. Decide which type of graph to use</h4>
<p>The graph you use will depend on the data and the question you are answering.</p>
<ul>
<li><strong><a href="https://michaeltoth.me/a-detailed-guide-to-plotting-line-graphs-in-r-using-ggplot-geom_line.html">Line Graph</a></strong>: Use line graphs to track trends over time or show the relationship between two variables.</li>
<li><strong><a href="https://michaeltoth.me/detailed-guide-to-the-bar-chart-in-r-with-ggplot.html">Bar Charts</a></strong>: Use bar charts to compare quantitative data across multiple categories.</li>
<li><strong><a href="https://michaeltoth.me/a-detailed-guide-to-the-ggplot-scatter-plot-in-r.html">Scatter Plots</a></strong>: Use scatter plots to assess the relationship between two variables.</li>
<li><strong>Pie Charts</strong>: Use pie charts to show parts of a whole. I personally do not use pie charts, and I advise you to be very careful with them. If you must use them, limit the number of categories, as more than 3 or 4 makes them unreadable.</li>
</ul>
<h4>4. Decide how to handle outliers</h4>
<p>Outliers are an inevitability. You need to decide how to handle this. Sometimes, the outlier itself can be a point of focus in your graph that you want to highlight. Other times, it can be a distraction from your message that you would prefer to remove. Before removing an outlier, think critically about why the outlier exists and make a judgment call as to whether removing it helps to clarify your point without being misleading.</p>
<h2>Building Your Graph</h2>
<h4>5. Remove unnecessary data</h4>
<p>Your audience should be able to clearly understand the point of your graph. Excessive and unnecessary data can distract from this goal. Decide what is necessary to answer your question and cut the rest. </p>
<h4>6. Don't be misleading</h4>
<p>There are many ways that graphs can be misleading, either intentionally or unintentionally. These two seem to come up most frequently:</p>
<p>If you’re using a bar chart, the baseline for the y-axis must start at 0. Otherwise, your graph will be misleading by amplifying the actual differences across the categories.</p>
<p>Your titles and captions should accurately describe your data. Titles and captions are a great way to bring salience to your graph, but you need to ensure the text reinforces what the data says, rather than changing the message.</p>
<h2>Styling Your Graph</h2>
<h4>7. Decide on an appropriate color palette</h4>
<p>Color is an important and often-neglected aspect of graphs. For single-color graphs, choose a color that’s related to your organization’s brand or thematically related to the graph (for example, green for forestry data). For multicolor graphs, use Color Picker for Data, an excellent tool for building visually pleasing color palettes.</p>
<h4>8. Make all of your axis titles and labels horizontal</h4>
<p>All of the axis titles and labels in your graph should be horizontal. <a href="https://michaeltoth.me/one-step-to-quickly-improve-the-readability-and-visual-appeal-of-ggplot-graphs.html">Horizontal labels</a> greatly improve the readability and visual appeal of a graph. </p>
<h4>9. Adjust your titles, labels, and legend text</h4>
<p>Give your graph a compelling title, and add descriptive and well formatted names to the axis titles and legend. A good choice for your graph title is to simply state the question you’re trying to answer. I also like to use a subtitle that drives home the message you want people to take away. You can use the labs function in ggplot to modify these labels.</p>
<h4>BONUS: Add your company logo and branding</h4>
<p>If you’re sharing this graph with clients or the public, <a href="https://michaeltoth.me/you-need-to-start-branding-your-graphs-heres-how-with-ggplot.html">adding your company logo and branding</a> elements can help your graph to stand out and to build credibility for your organization. This is great for you, because it will help you to grow your own influence and visibility within the company. Read my guide on this subject for more details on how to implement this tip.</p>
<h2>Exporting Your Graph</h2>
<h4>10. Save your graph in a readable high-resolution format</h4>
<p>Think about how your graph is going to be read. Will it be online, printed, or in a slide for a presentation? Each format may require different adjustments to text and graph sizing to be readable. Be sure to test for yourself to ensure you can read your graph in its final format. This will avoid frustrating reworks or--even worse--sharing an unreadable graph! Use the ggsave function to save your graph and modify the resolution. Then, adjust sizes until you’re satisfied with the final result.</p>
<h2>Conclusion</h2>
<p>Some of these tips may have been obvious, and others may have seemed like revelations. The important thing is to think through these steps and apply them <em>consistently</em> with every graph you produce. I promise you that if you incorporate this checklist into your workflow, you're going to see a big change in how people respond to your analysis at work.</p>
<p>Remember to get a copy of this graphing checklist so you can be sure to go through it every time!</p>
<p><a href="https://mailchi.mp/114dc86f2f2b/graphing-checklist">Get Your Free 10-Step Checklist to Graphing for Impact</a></p>One Step to Quickly Improve the Readability and Visual Appeal of ggplot Graphs2019-05-03T06:00:00-04:00Michael Tothtag:michaeltoth.me,2019-05-03:one-step-to-quickly-improve-the-readability-and-visual-appeal-of-ggplot-graphs.html<p>There's something wonderful about a graph that communicates a point clearly. You know it when you see it. It's the kind of graph that makes you pause and say 'wow!'. </p>
<p>There are all kinds of different graphs that fit this description, but they usually have a few things in common:</p>
<ul>
<li>Clarity: The message of the graph is clear</li>
<li>Simplicity: Extraneous details are removed</li>
<li>Visual appeal: The graph should be pleasing to look at</li>
</ul>
<p>Of course, your graph also needs to be communicating something worthwhile. But I see so many graphs that ultimately fall short of their potential because they don't meet these three points above! </p>
<p>I've been there myself. Some of my <a href="https://michaeltoth.me/analyzing-historical-default-rates-of-lending-club-notes.html">earliest</a> <a href="https://michaeltoth.me/plotting-the-evolution-of-the-us-treasuryyield-curve.html">graphs</a> in R fall short, in retrospect. But the key to improving is to keep learning new things and keep getting better over time.</p>
<p>It seems like many people learn how to create basic <a href="https://michaeltoth.me/detailed-guide-to-the-bar-chart-in-r-with-ggplot.html">bar charts</a>, <a href="https://michaeltoth.me/a-detailed-guide-to-the-ggplot-scatter-plot-in-r.html">scatter plots</a>, and <a href="https://michaeltoth.me/a-detailed-guide-to-plotting-line-graphs-in-r-using-ggplot-geom_line.html">line graphs</a> in R, and then stop developing their skills further. But you shouldn't stop there! </p>
<p>I don't think most people are doing this intentionally. In fact, I think most people simply don't know what's possible and what they should be aiming for. </p>
<p>If the only graphs you've ever seen are basic examples from statistics textbooks or code documentation, how would you possibly know that you can do better? How would you know that you can create graphs that capture attention, drive action, or inspire awe? <strong>You wouldn't</strong>.</p>
<p>I want to teach you how to make graphs that get your point across with clarity, simplicity, and visual appeal. There are quick fixes you can make to your graphs <strong>right now</strong> that will get you much closer to making that a reality.</p>
<p>Today, I'm going to show you how you can use axis text rotation to greatly improve both the readability and visual appeal of your graphs. </p>
<p><center>
<img alt="Drake Does Not Like Vertical Axis Labels in ggplot" src="../images/20190503_rotate_labels/drake.png" width="600px" />
</center></p>
<p>Are you ready? Let's go!</p>
<h2>Creating a Base Graph to Work From</h2>
<p>To start, let's load in the libraries we'll be using throughout this post: <code>tidyverse</code> (for graphing and data manipulation), and <code>hrbrthemes</code> (a useful package that I use to improve the styling of my graphs).</p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>hrbrthemes<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>tidyverse<span class="p">)</span>
</pre></div>
<p>For this post, we'll be using the <code>mtcars</code> dataset to illustrate these graphing techniques. Here, I group the cars in that dataset by the number of cylinders in their engines (4, 6, or 8), and then calculate the average horsepower for each group.</p>
<div class="highlight"><pre><span></span><span class="c1"># Calculate average horsepower for cars with 4-, 6-, and 8-cylinder engines</span>
hp_by_cyl <span class="o"><-</span> mtcars <span class="o">%>%</span> group_by<span class="p">(</span>cyl<span class="p">)</span> <span class="o">%>%</span>
summarise<span class="p">(</span>avg_hp <span class="o">=</span> <span class="kp">mean</span><span class="p">(</span>hp<span class="p">))</span>
</pre></div>
<p>Finally, I create a simple bar chart in <code>ggplot</code> to show this data. Let's review this code briefly, so we're all on the same page:</p>
<ul>
<li>The first two lines (<code>ggplot</code> and <code>geom_bar</code>) are what creates the base bar chart. </li>
<li>The next 5 lines use the <code>labs</code> function to assign labels to the graph. </li>
<li>The last line uses <code>theme_ipsum</code> from the <code>hrbrthemes</code> package to apply some nice styling to the graph.</li>
</ul>
<div class="highlight"><pre><span></span><span class="c1"># Creating a base graph without formatting the axis text</span>
g <span class="o"><-</span> ggplot<span class="p">(</span>hp_by_cyl<span class="p">)</span> <span class="o">+</span>
geom_bar<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">factor</span><span class="p">(</span>cyl<span class="p">),</span> y <span class="o">=</span> avg_hp<span class="p">),</span> stat <span class="o">=</span> <span class="s">'identity'</span><span class="p">)</span> <span class="o">+</span>
labs<span class="p">(</span>title <span class="o">=</span> <span class="s">'Average Horsepower for Cars with 4-, 6-, and 8-Cylinder Engines'</span><span class="p">,</span>
subtitle <span class="o">=</span> <span class="s">'Based on Data for 32 Cars from the 1974 Motor Trend Magazine'</span><span class="p">,</span>
x <span class="o">=</span> <span class="s">'Cylinders'</span><span class="p">,</span>
y <span class="o">=</span> <span class="s">'Horsepower'</span><span class="p">,</span>
caption <span class="o">=</span> <span class="s">'michaeltoth.me / @michael_toth'</span><span class="p">)</span> <span class="o">+</span>
theme_ipsum<span class="p">(</span>axis_title_size <span class="o">=</span> <span class="m">12</span><span class="p">)</span>
g
</pre></div>
<p><img src="/figures/20190503_Rotate_Labels/first_graph-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>This graph is fine, for the most part. But I don't aim for fine, and you shouldn't either! </p>
<p>I have a high attention to detail for graphing, and I want my graphs to be excellent. The easier I can make it for people to read and understand a graph, the better job I'll do of convincing them to move forward with a particular course of action. </p>
<h2>Rotate Your Y-Axis Title To Improve Readability</h2>
<p>There are several things we could do to improve this graph, but in this guide let's focus on rotating the y-axis label. This simple change will make your graph <strong>so much better</strong>. That way, people won't have to tilt their heads like me to understand what's going on in your graph:</p>
<p><center>
<img alt="Sideways head tilt to read ggplot axis title" src="../images/20190503_rotate_labels/sideways.jpg" width="600px" />
</center></p>
<p>That's not a look you want. Luckily, it's super easy to rotate your axis title in <code>ggplot</code>! To do this, we'll modify some parameters using <code>ggplot</code>'s <code>theme</code> function, which can also be used to adjust all kinds of things in your graph like axis labels, gridlines, and text sizing. </p>
<p>Here, we specifically want to adjust the y-axis, which we can do using the <code>axis.title.y</code> parameter. To adjust a text element, we use <code>element_text</code>. You can use <code>element_text</code> this to adjust things like font, color, and size. But here, we're interested in rotation, so we're going to use <code>angle</code>. Setting the <code>angle</code> to 0 will make the y-axis text horizontal. Take a look: </p>
<div class="highlight"><pre><span></span><span class="c1"># Modifying the graph from before (stored as g), to make the text horizontal</span>
g <span class="o">+</span> theme<span class="p">(</span>axis.title.y <span class="o">=</span> element_text<span class="p">(</span>angle <span class="o">=</span> <span class="m">0</span><span class="p">))</span>
</pre></div>
<p><img src="/figures/20190503_Rotate_Labels/rotate_labels-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>It's a minor change, but this graph is <strong>far more readable</strong> and <strong>more visually appealing</strong> than the graph we had before. That's what we're going for. Simplicity, clarity, and visual appeal.</p>
<h2>More on Text Rotation in <code>ggplot</code></h2>
<p>As we just saw, when you need to rotate text in <code>ggplot</code>, you can accomplish this by adjusting the <code>angle</code> within <code>element_text</code>. Here we did this to adjust the axis title, but this works the same way for any text that you want to rotate in ggplot.</p>
<p>The <code>angle</code> parameter can take any value between 0 and 360, corresponding to the angle of rotation from a horizontal baseline. To briefly illustrate how different angle values work in ggplot, take a look at the following graph, where I explore four different angle rotations:</p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>gridExtra<span class="p">)</span>
<span class="c1"># 0-Degree angle</span>
g1 <span class="o"><-</span> g <span class="o">+</span> theme<span class="p">(</span>axis.title.y <span class="o">=</span> element_text<span class="p">(</span>angle <span class="o">=</span> <span class="m">0</span><span class="p">))</span> <span class="o">+</span>
labs<span class="p">(</span>title <span class="o">=</span> <span class="s">'Y-Axis Title at 0 degrees'</span><span class="p">,</span>
subtitle <span class="o">=</span> <span class="s">'Using theme(axis.title.y = element_text(angle = 0))'</span><span class="p">,</span>
caption <span class="o">=</span> <span class="s">''</span><span class="p">)</span>
<span class="c1"># 90-Degree angle</span>
g2 <span class="o"><-</span> g <span class="o">+</span> theme<span class="p">(</span>axis.title.y <span class="o">=</span> element_text<span class="p">(</span>angle <span class="o">=</span> <span class="m">90</span><span class="p">))</span> <span class="o">+</span>
labs<span class="p">(</span>title <span class="o">=</span> <span class="s">'Y-Axis Title at 90 degrees'</span><span class="p">,</span>
subtitle <span class="o">=</span> <span class="s">'Using theme(axis.title.y = element_text(angle = 90))'</span><span class="p">,</span>
caption <span class="o">=</span> <span class="s">''</span><span class="p">)</span>
<span class="c1"># 180-Degree angle</span>
g3 <span class="o"><-</span> g <span class="o">+</span> theme<span class="p">(</span>axis.title.y <span class="o">=</span> element_text<span class="p">(</span>angle <span class="o">=</span> <span class="m">180</span><span class="p">))</span> <span class="o">+</span>
labs<span class="p">(</span>title <span class="o">=</span> <span class="s">'Y-Axis Title at 180 degrees'</span><span class="p">,</span>
subtitle <span class="o">=</span> <span class="s">'Using theme(axis.title.y = element_text(angle = 180))'</span><span class="p">,</span>
caption <span class="o">=</span> <span class="s">''</span><span class="p">)</span>
<span class="c1"># 270-Degree angle</span>
g4 <span class="o"><-</span> g <span class="o">+</span> theme<span class="p">(</span>axis.title.y <span class="o">=</span> element_text<span class="p">(</span>angle <span class="o">=</span> <span class="m">270</span><span class="p">))</span> <span class="o">+</span>
labs<span class="p">(</span>title <span class="o">=</span> <span class="s">'Y-Axis Title at 270 degrees'</span><span class="p">,</span>
subtitle <span class="o">=</span> <span class="s">'theme(axis.title.y = element_text(angle = 270))'</span><span class="p">)</span>
<span class="c1"># Add all graphs to a grid</span>
grid.arrange<span class="p">(</span>g1<span class="p">,</span> g2<span class="p">,</span> g3<span class="p">,</span> g4<span class="p">,</span> nrow <span class="o">=</span> <span class="m">2</span><span class="p">,</span> ncol <span class="o">=</span> <span class="m">2</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190503_Rotate_Labels/axis_rotations-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>You should now have a better understanding of how you can use axis title rotation to improve the readability and visual appeal of your graphs in <code>ggplot</code>! </p>
<p>You really need to think about minor details like this, especially when you're going to be using a graph in a presentation or a report to a broader audience. Small details can really improve your graph, which in turn will make it easier for you to educate your audience, convince people of your conclusions, and drive change in your organization. </p>
<hr />
<p>I will help you learn the specific skills you need to work more effectively, grow your income, and improve your career.</p>
<p><a href="http://eepurl.com/gmYioz">Sign up here to receive my best tips</a></p>Detailed Guide to the Bar Chart in R with ggplot2019-05-01T05:40:00-04:00Michael Tothtag:michaeltoth.me,2019-05-01:detailed-guide-to-the-bar-chart-in-r-with-ggplot.html<p>When it comes to data visualization, flashy graphs can be fun. Believe me, I'm as big a fan of <a href="https://michaeltoth.me/how-to-create-a-bar-chart-race-in-r-mapping-united-states-city-population-1790-2010.html">flashy</a> <a href="https://michaeltoth.me/mapping-legal-marijuana-states-and-medical-marijuana-states-1995-2019.html">graphs</a> as anybody. But if you're trying to convey information, especially to a broad audience, flashy isn't always the way to go. </p>
<p>Whether it's the <a href="https://michaeltoth.me/a-detailed-guide-to-plotting-line-graphs-in-r-using-ggplot-geom_line.html">line graph</a>, <a href="https://michaeltoth.me/a-detailed-guide-to-the-ggplot-scatter-plot-in-r.html">scatter plot</a>, or bar chart (the subject of this guide!), choosing a well-understood and common graph style is usually the way to go for most audiences, most of the time. And if you're just getting started with your R journey, it's important to master the basics before complicating things further.</p>
<p>So in this guide, I'm going to talk about creating a bar chart in R. Specifically, I'll show you exactly how you can use the <code>ggplot</code> <code>geom_bar</code> function to create a bar chart.</p>
<p>A bar chart is a graph that is used to show comparisons across discrete categories. One axis--the x-axis throughout this guide--shows the categories being compared, and the other axis--the y-axis in our case--represents a measured value. The heights of the bars are proportional to the measured values.</p>
<p><center>
<img alt="Bar Chart of Things That Put Your Life in Danger" src="../images/20190501_geom_bar/got.jpg" width="600px" />
</center></p>
<p>For example, in this extremely scientific bar chart, we see the level of life threatening danger for three different actions. All dangerous, to be sure, but I think we can all agree this graph gets things right in showing that Game of Thrones spoilers are most dangerous of all. </p>
<h2>Introduction to ggplot</h2>
<p>Before diving into the <code>ggplot</code> code to create a bar chart in R, I first want to briefly explain <code>ggplot</code> and why I think it's the best choice for graphing in R. </p>
<p><code>ggplot</code> is a package for creating graphs in R, but it's also a method of thinking about and decomposing complex graphs into logical subunits. </p>
<p><code>ggplot</code> takes each component of a graph--axes, scales, colors, objects, etc--and allows you to build graphs up sequentially one component at a time. You can then modify each of those components in a way that's both flexible and user-friendly. When components are unspecified, <code>ggplot</code> uses sensible defaults. This makes <code>ggplot</code> a powerful and flexible tool for creating all kinds of graphs in R. It's the tool I use to create nearly every graph I make these days, and I think you should use it too!</p>
<h2>Follow Along With the Workbook</h2>
<p>To accompany this guide, I've created a <a href="https://mailchi.mp/7502a8913249/workbook-ggplot-bar-chart">free workbook</a> that you can work through to apply what you're learning as you read. </p>
<p>The workbook is an R file that contains all the code shown in this post as well as additional guided questions and exercises to help you understand the topic even deeper.</p>
<p>If you want to really learn how to create a bar chart in R so that you'll still remember weeks or even months from now, you need to practice. </p>
<p>So Download the workbook now and practice as you read this post!</p>
<p><a href="https://mailchi.mp/7502a8913249/workbook-ggplot-bar-chart">Download your free ggplot bar chart workbook!</a></p>
<h2>Investigating our dataset</h2>
<p>Throughout this guide, we'll be using the <code>mpg</code> dataset that's built into ggplot. This dataset contains data on fuel economy for 38 popular car models. Let's take a look:</p>
<p><center>
<img alt="A snippet of the mpg dataset" src="../images/20190501_geom_bar/mpg.png" width="400px" />
</center></p>
<p>The mpg dataset contains 11 columns: </p>
<ul>
<li><code>manufacturer</code>: Car Manufacturer Name</li>
<li><code>model</code>: Car Model Name</li>
<li><code>displ</code>: Engine Displacement (liters)</li>
<li><code>year</code>: Year of Manufacture</li>
<li><code>cyl</code>: Number of Cylinders</li>
<li><code>trans</code>: Type of Transmission</li>
<li><code>drv</code>: f = front-wheel drive, r = rear-wheel drive, 4 = 4wd</li>
<li><code>cty</code>: City Miles per Gallon</li>
<li><code>hwy</code>: Highway Miles per Gallon</li>
<li><code>fl</code>: Fuel Type</li>
<li><code>class</code>: Type of Car</li>
</ul>
<h2>How to create a simple bar chart in R using <code>geom_bar</code></h2>
<p><code>ggplot</code> uses geoms, or geometric objects, to form the basis of different types of graphs. Previously I have talked about <a href="https://michaeltoth.me/a-detailed-guide-to-plotting-line-graphs-in-r-using-ggplot-geom_line.html"><code>geom_line</code> for line graphs</a> and <a href="https://michaeltoth.me/a-detailed-guide-to-the-ggplot-scatter-plot-in-r.html"><code>geom_point</code> for scatter plots</a>. Today I'll be focusing on <code>geom_bar</code>, which is used to create bar charts in R. </p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>tidyverse<span class="p">)</span>
ggplot<span class="p">(</span>mpg<span class="p">)</span> <span class="o">+</span>
geom_bar<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">class</span><span class="p">))</span>
</pre></div>
<p><img src="/figures/20190426_ggplot_geom_bar/simple_bar_chart-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>Here we are starting with the simplest possible <code>ggplot</code> bar chart we can create using <code>geom_bar</code>. Let's review this in more detail:</p>
<p>First, we call <code>ggplot</code>, which creates a new <code>ggplot</code> graph. Basically, this creates a blank canvas on which we'll add our data and graphics. Here we pass mpg to <code>ggplot</code> to indicate that we'll be using the mpg data for this particular <code>ggplot</code> bar chart.</p>
<p>Next, we add the <code>geom_bar</code> call to the base <code>ggplot</code> graph in order to create this bar chart. In <code>ggplot</code>, you use the <code>+</code> symbol to add new layers to an existing graph. In this second layer, I told <code>ggplot</code> to use <code>class</code> as the x-axis variable for the bar chart.</p>
<p>You'll note that we don't specify a y-axis variable here. Later on, I'll tell you how we can modify the y-axis for a bar chart in R. But for now, just know that if you don't specify anything, <code>ggplot</code> will automatically count the occurrences of each x-axis category in the dataset, and will display the <code>count</code> on the y-axis.</p>
<p>And that's it, we have our bar chart! We see that SUVs are the most prevalent in our data, followed by compact and midsize cars. </p>
<h2>Changing bar color in a <code>ggplot</code> bar chart</h2>
<p>Expanding on this example, let's change the colors of our bar chart!</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mpg<span class="p">)</span> <span class="o">+</span>
geom_bar<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">class</span><span class="p">),</span> fill <span class="o">=</span> <span class="s">'blue'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190426_ggplot_geom_bar/fill-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>You'll note that this <code>geom_bar</code> call is identical to the one before, except that we've added the modifier <code>fill = 'blue'</code> to to end of the line. Experiment a bit with different colors to see how this works on your machine. You can use most color names you can think of, or you can use specific hex colors codes to get more granular.</p>
<p>If you're familiar with line graphs and scatter plots in ggplot, you've seen that in those cases we changed the color by specifing <code>color = 'blue'</code>, while in this case we're using <code>fill = 'blue'</code>. </p>
<p>In ggplot, <code>color</code> is used to change the <em>outline</em> of an object, while <code>fill</code> is used to <em>fill the inside</em> of an object. For objects like points and lines, there is no inside to fill, so we use <code>color</code> to change the color of those objects. With bar charts, the bars <em>can be filled</em>, so we use <code>fill</code> to change the color with <code>geom_bar</code>. </p>
<p>This distinction between <code>color</code> and <code>fill</code> gets a bit more complex, so stick with me to hear more about how these work with bar charts in ggplot!</p>
<h2>Mapping bar color to a variable in a <code>ggplot</code> bar chart</h2>
<p>Now, let's try something a little different. Compare the <code>ggplot</code> code below to the code we just executed above. There are 2 differences. See if you can find them and guess what will happen, then scroll down to take a look at the result. If you've read my previous <code>ggplot</code> guides, this bit should look familiar!</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mpg<span class="p">)</span> <span class="o">+</span>
geom_bar<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">class</span><span class="p">,</span> fill <span class="o">=</span> drv<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190426_ggplot_geom_bar/fill_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>This graph shows the same data as before, but now instead of showing solid-colored bars, we now see that the bars are stacked with 3 different colors! The red portion corresponds to 4-wheel drive cars, the green to front-wheel drive cars, and the blue to rear-wheel drive cars. Did you catch the 2 changes we used to change the graph? They were:</p>
<ol>
<li>Instead of specifying <code>fill = 'blue'</code>, we specified <code>fill = drv</code></li>
<li>We moved the fill parameter inside of the <code>aes()</code> parentheses</li>
</ol>
<p>Before, we told <code>ggplot</code> to change the color of the bars to blue by adding <code>fill = 'blue'</code> to our <code>geom_bar()</code> call. </p>
<p>What we're doing here is a bit more complex. Instead of specifying a single color for our bars, we're telling <code>ggplot</code> to <em>map</em> the data in the <code>drv</code> column to the <code>fill</code> aesthetic. </p>
<p>This means we are telling <code>ggplot</code> to use a different color for each value of <code>drv</code> in our data! This mapping also lets <code>ggplot</code> know that it also needs to create a legend to identify the drive types, and it places it there automatically!</p>
<h3>More Details on Stacked Bar Charts in <code>ggplot</code></h3>
<p>As we saw above, when we map a variable to the <code>fill</code> aesthetic in <code>ggplot</code>, it creates what's called a stacked bar chart. A stacked bar chart is a variation on the typical bar chart where a bar is divided among a number of different segments. </p>
<p>In this case, we're dividing the bar chart into segments based on the levels of the <code>drv</code> variable, corresponding to the front-wheel, rear-wheel, and four-wheel drive cars.</p>
<p>For a given <code>class</code> of car, our stacked bar chart makes it easy to see how many of those cars fall into each of the 3 <code>drv</code> categories. </p>
<p>The main flaw of stacked bar charts is that they become harder to read the more segments each bar has, especially when trying to make comparisons across the x-axis (in our case, across car <code>class</code>). To illustrate, let's take a look at this next example: </p>
<div class="highlight"><pre><span></span><span class="c1"># Note we convert the cyl variable to a factor to fill properly</span>
ggplot<span class="p">(</span>mpg<span class="p">)</span> <span class="o">+</span>
geom_bar<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">class</span><span class="p">,</span> fill <span class="o">=</span> <span class="kp">factor</span><span class="p">(</span>cyl<span class="p">)))</span>
</pre></div>
<p><img src="/figures/20190426_ggplot_geom_bar/stacked_bar-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>As you can see, even with four segments it starts to become difficult to make comparisons between the different categories on the x-axis. For example, are there more 6-cylinder minivans or 6-cylinder pickups in our dataset? What about 5-cylinder compacts vs. 5-cylinder subcompacts? With stacked bars, these types of comparisons become challenging. <strong>My recommendation is to generally avoid stacked bar charts with more than 3 segments</strong>.</p>
<h3>Dodged Bars in ggplot</h3>
<p>Instead of stacked bars, we can use side-by-side (dodged) bar charts. In ggplot, this is accomplished by using the <code>position = position_dodge()</code> argument as follows:</p>
<div class="highlight"><pre><span></span><span class="c1"># Note we convert the cyl variable to a factor here in order to fill by cylinder</span>
ggplot<span class="p">(</span>mpg<span class="p">)</span> <span class="o">+</span>
geom_bar<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">class</span><span class="p">,</span> fill <span class="o">=</span> <span class="kp">factor</span><span class="p">(</span>cyl<span class="p">)),</span> position <span class="o">=</span> position_dodge<span class="p">(</span>preserve <span class="o">=</span> <span class="s">'single'</span><span class="p">))</span>
</pre></div>
<p><img src="/figures/20190426_ggplot_geom_bar/dodged_bar-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>Now, the different segments for each class are placed side-by-side instead of stacked on top of each other. </p>
<p>Revisiting the comparisons from before, we can quickly see that there are an equal number of 6-cylinder minivans and 6-cylinder pickups. There are also an equal number of 5-cylinder compacts and subcompacts. </p>
<p>While these comparisons are easier with a dodged bar graph, comparing the total count of cars in each class is far more difficult. </p>
<p>Which brings us to a general point: different graphs serve different purposes! You shouldn't try to accomplish too much in a single graph. If you're trying to cram too much information into a single graph, you'll likely confuse your audience, and they'll take away exactly none of the information. </p>
<h2>Scaling bar size to a variable in your data</h2>
<p>Up to now, all of the bar charts we've reviewed have scaled the height of the bars based on the count of a variable in the dataset. First we counted the number of vehicles in each <code>class</code>, and then we counted the number of vehicles in each <code>class</code> with each <code>drv</code> type. </p>
<p>What if we don't want the height of our bars to be based on count? What if we already have a column in our dataset that we want to be used as the y-axis height? Let's say we wanted to graph the average highway miles per gallon by <code>class</code> of car, for example. How can we do that in ggplot?
There are two ways we can do this, and I'll be reviewing them both. To start, I'll introduce <code>stat = 'identity'</code>:</p>
<div class="highlight"><pre><span></span><span class="c1"># Use dplyr to calculate the average hwy_mpg by class</span>
by_hwy_mpg <span class="o"><-</span> mpg <span class="o">%>%</span> group_by<span class="p">(</span><span class="kp">class</span><span class="p">)</span> <span class="o">%>%</span> summarise<span class="p">(</span>hwy_mpg <span class="o">=</span> <span class="kp">mean</span><span class="p">(</span>hwy<span class="p">))</span>
ggplot<span class="p">(</span>by_hwy_mpg<span class="p">)</span> <span class="o">+</span>
geom_bar<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">class</span><span class="p">,</span> y <span class="o">=</span> hwy_mpg<span class="p">),</span> stat <span class="o">=</span> <span class="s">'identity'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190426_ggplot_geom_bar/stat_identity-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>Now we see a graph by <code>class</code> of car where the y-axis represents the average highway miles per gallon of each <code>class.</code> How does this work, and how is it different from what we had before?</p>
<p>Before, we did not specify a y-axis variable and instead let <code>ggplot</code> automatically populate the y-axis with a count of our data. Now, we're explicityly telling <code>ggplot</code> to use <code>hwy_mpg</code> as our y-axis variable. And there's something else here also: <code>stat = 'identity'</code>. What does that mean?</p>
<p>We saw earlier that if we omit the y-variable, <code>ggplot</code> will automatically scale the heights of the bars to a count of cases in each group on the x-axis. If we instead want the values to come from a column in our data frame, we need to change two things in our <code>geom_bar</code> call:</p>
<ol>
<li>Add <code>stat = 'identity'</code> to <code>geom_bar()</code></li>
<li>Add a y-variable mapping</li>
</ol>
<p>Adding a y-variable mapping alone without adding <code>stat='identity'</code> leads to an error message:</p>
<p><center>
<img alt="Bar chart without stat identity error message" src="../images/20190501_geom_bar/error1.png" width="600px" />
</center></p>
<p>Why the error? If you don't specify <code>stat = 'identity'</code>, then under the hood, <code>ggplot</code> is automatically passing a default value of <code>stat = 'count'</code>, which graphs the counts by group. A y-variable is not compatible with this, so you get the error message. </p>
<p>If this is confusing, that's okay. For now, all you need to remember is that if you want to use <code>geom_bar</code> to map the heights of a column in your dataset, you need to add <strong>BOTH</strong> a y-variable mapping <strong>AND</strong> <code>stat = 'identity'</code>.</p>
<p>I'll be honest, this was highly confusing for me for a long time. I hope this guidance helps to clear things up for you, so you don't have to suffer the same confusion that I did. But if you have a hard time remembering this distinction, <code>ggplot</code> also has a handy function that does this work for you. Instead of using <code>geom_bar</code> with <code>stat = 'identity'</code>, you can simply use the <code>geom_col</code> function to get the same result. Let's see:</p>
<div class="highlight"><pre><span></span><span class="c1"># Use dplyr to calculate the average hwy_mpg by class</span>
by_hwy_mpg <span class="o"><-</span> mpg <span class="o">%>%</span> group_by<span class="p">(</span><span class="kp">class</span><span class="p">)</span> <span class="o">%>%</span> summarise<span class="p">(</span>hwy_mpg <span class="o">=</span> <span class="kp">mean</span><span class="p">(</span>hwy<span class="p">))</span>
ggplot<span class="p">(</span>by_hwy_mpg<span class="p">)</span> <span class="o">+</span>
geom_col<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">class</span><span class="p">,</span> y <span class="o">=</span> hwy_mpg<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190426_ggplot_geom_bar/geom_col-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>You'll notice the result is the same as the graph we made above, but we've replaced <code>geom_bar</code> with <code>geom_col</code> and removed <code>stat = 'identity'</code>. <code>geom_col</code> is the same as <code>geom_bar</code> with <code>stat = 'identity'</code>, so you can use whichever you prefer or find easier to understand. For me, I've gotten used to <code>geom_bar</code>, so I prefer to use that, but you can do whichever you like! </p>
<h2>Revisiting <code>color</code> in <code>geom_bar</code></h2>
<p>Above, we showed how you could change the color of bars in <code>ggplot</code> using the <code>fill</code> option. I mentioned that <code>color</code> is used for line graphs and scatter plots, but that we use <code>fill</code> for bars because we are filling the inside of the bar with color. That said, <code>color</code> does still work here, though it affects only the outline of the graph in question. Take a look:</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mpg<span class="p">)</span> <span class="o">+</span>
geom_bar<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">class</span><span class="p">),</span> color <span class="o">=</span> <span class="s">'blue'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190426_ggplot_geom_bar/color-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>This created graphs with bars filled with the standard gray, but outlined in blue. That outline is what <code>color</code> affects for bar charts in ggplot!</p>
<p>I personally only use <code>color</code> for one specific thing: modifying the outline of a bar chart where I'm already using <code>fill</code> to create a better looking graph with a little extra pop. The standard <code>fill</code> is fine for most purposes, but you can step things up a bit with a carefully selected <code>color</code> outline:</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mpg<span class="p">)</span> <span class="o">+</span>
geom_bar<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">class</span><span class="p">),</span> fill <span class="o">=</span> <span class="s">'#003366'</span><span class="p">,</span> color <span class="o">=</span> <span class="s">'#add8e6'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190426_ggplot_geom_bar/color_and_fill-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>It's subtle, but this graph uses a darker navy blue for the fill of the bars and a lighter blue for the outline that makes the bars pop a little bit. </p>
<p>This is the only time when I use <code>color</code> for bar charts in R. Do you have a use case for this? I'd love to hear it, so let me know in the comments!</p>
<h2>A deeper review of <code>aes()</code> (aesthetic) mappings in ggplot</h2>
<p>We saw above how we can create graphs in <code>ggplot</code> that use the <code>fill</code> argument map the <code>cyl</code> variable or the <code>drv</code> variable to the color of bars in a bar chart. <code>ggplot</code> refers to these mappings as <em>aesthetic</em> mappings, and they include everything you see within the <code>aes()</code> in <code>ggplot</code>.</p>
<p>Aesthetic mappings are a way of mapping <em>variables in your data</em> to particular <em>visual properties</em> (aesthetics) of a graph. </p>
<p>I know this can sound a bit theoretical, so let's review the specific aesthetic mappings you've already seen as well as the other mappings available within geom_bar.</p>
<h3>Reviewing the list of geom_bar aesthetic mappings</h3>
<p>The main aesthetic mappings for a ggplot bar graph include:</p>
<ul>
<li><code>x</code>: Map a variable to a position on the x-axis</li>
<li><code>y</code>: Map a variable to a position on the y-axis</li>
<li><code>fill</code>: Map a variable to a bar color</li>
<li><code>color</code>: Map a variable to a bar outline color</li>
<li><code>linetype</code>: Map a variable to a bar outline linetype</li>
<li><code>alpha</code>: Map a variable to a bar transparency</li>
</ul>
<p>From the list above, we've already seen the <code>x</code> and <code>fill</code> aesthetic mappings. We've also seen <code>color</code> applied as a parameter to change the outline of the bars in the prior example.</p>
<p>I'm not going to review the additional aesthetics in this post, but if you'd like more details, check out the free workbook which includes some examples of these aesthetics in more detail!</p>
<p><a href="https://mailchi.mp/7502a8913249/workbook-ggplot-bar-chart">Download your free ggplot bar chart workbook!</a></p>
<h2>Aesthetic mappings vs. parameters in ggplot</h2>
<p>I often hear from my R training clients that they are confused by the distinction between aesthetic mappings and parameters in ggplot. Personally, I was quite confused by this when I was first learning about graphing in ggplot as well. Let me try to clear up some of the confusion!</p>
<p>Above, we saw that we could use <code>fill</code> in two different ways with <code>geom_bar</code>. First, we were able to set the color of our bars to blue by specifying <code>fill = 'blue'</code> <em>outside</em> of our <code>aes()</code> mappings. Then, we were able to <em>map</em> the variable <code>drv</code> to the color of our bars by specifying <code>fill = drv</code> <em>inside</em> of our <code>aes()</code> mappings. </p>
<p>What is the difference between these two ways of working with <code>fill</code> and other aesthetic mappings?</p>
<p>When you include <code>fill</code>, <code>color</code>, or another aesthetic <em>inside the <code>aes()</code></em> of your <code>ggplot</code> code, you're telling <code>ggplot</code> to map a variable to that aesthetic in your graph. This is what we did when we said <code>fill = drv</code> above to fill different drive types with different colors.</p>
<p>Each of the aesthetic mappings you've seen can also be used as a <em>parameter</em>, that is, a fixed value defined outside of the <code>aes()</code> aesthetic mappings. You saw how to do this with <code>fill</code> when we made the bar chart bars blue with <code>fill = 'blue'</code>. You also saw how we could outline the bars with a specific color when we used <code>color = '#add8e6'</code>. </p>
<p>Whenever you're trying to map a variable in your data to an aesthetic to your graph, you want to specify that <strong>inside the <code>aes()</code> function</strong>. And whenever you're trying to hardcode a specific parameter in your graph (making the bars blue, for example), you want to specify that <strong>outside the <code>aes()</code> function</strong>. I hope this helps to clear up any confusion you have on the distinction between aesthetic mappings and parameters! </p>
<h2>Common errors with aesthetic mappings and parameters in ggplot</h2>
<p>When I was first learning R and ggplot, this difference between aesthetic mappings (the values included <em>inside</em> your <code>aes()</code>), and parameters (the ones <em>outside</em> your <code>aes()</code>) was constantly confusing me. Luckily, over time, you'll find that this becomes second nature. But in the meantime, I can help you speed along this process with a few common errors that you can keep an eye out for.</p>
<h5>Trying to include aesthetic mappings <em>outside</em> your <code>aes()</code> call</h5>
<p>If you're trying to map the <code>drv</code> variable to <code>fill</code>, you should include <code>fill = drv</code> within the <code>aes()</code> of your <code>geom_bar</code> call. What happens if you include it outside accidentally, and instead run <code>ggplot(mpg) + geom_bar(aes(x = class), fill = drv)</code>? You'll get an error message that looks like this:</p>
<p><center>
<img alt="ggplot geom_bar error message" src="../images/20190501_geom_bar/error2.png" width="600px" />
</center></p>
<p>Whenever you see this error about object not found, be sure to check that you're including your aesthetic mappings <em>inside</em> the <code>aes()</code> call!</p>
<h5>Trying to specify parameters <em>inside</em> your <code>aes()</code> call</h5>
<p>On the other hand, if we try including a specific parameter value (for example, <code>fill = 'blue'</code>) inside of the <code>aes()</code> mapping, the error is a bit less obvious. Take a look:</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mpg<span class="p">)</span> <span class="o">+</span>
geom_bar<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="kp">class</span><span class="p">,</span> fill <span class="o">=</span> <span class="s">'blue'</span><span class="p">))</span>
</pre></div>
<p><img src="/figures/20190426_ggplot_geom_bar/fill_blue_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>In this case, <code>ggplot</code> actually does produce a bar chart, but it's not what we intended. </p>
<p>For starters, the bars in our bar chart are all red instead of the blue we were hoping for! Also, there's a legend to the side of our bar graph that simply says 'blue'. </p>
<p>What's going on here? Under the hood, <code>ggplot</code> has taken the string 'blue' and created a new hidden column of data where every value simple says 'blue'. Then, it's <em>mapped</em> that column to the <code>fill</code> aesthetic, like we saw before when we specified <code>fill = drv</code>. This results in the legend label and the color of all the bars being set, not to blue, but to the default color in <code>ggplot</code>.</p>
<p>If this is confusing, that's okay for now. Just remember: when you run into issues like this, double check to make sure you're including the parameters of your graph <em>outside</em> your <code>aes()</code> call!</p>
<p>You should now have a solid understanding of how to create a bar chart in R using the <code>ggplot</code> bar chart function, <code>geom_bar</code>! </p>
<h2>Solidify Your Understanding</h2>
<p>Experiment with the things you've learned to solidify your understanding. You can <a href="https://mailchi.mp/7502a8913249/workbook-ggplot-bar-chart">download my free workbook</a> with the code from this article to work through on your own.</p>
<p>I've found that working through code on my own is the best way for me to learn new topics so that I'll actually remember them when I need to do things on my own in the future. </p>
<p><a href="https://mailchi.mp/7502a8913249/workbook-ggplot-bar-chart">Download your free ggplot bar chart workbook!</a></p>Getting a Data Science Job is not a Numbers Game!2019-04-29T06:23:00-04:00Michael Tothtag:michaeltoth.me,2019-04-29:getting-a-data-science-job-is-not-a-numbers-game.html<p><center>
<img alt="Getting a data science job by throwing darts at a board" src="../images/20190429_data_science_job_numbers_game/darts.png" width="600px" />
</center></p>
<h2>My First (Non Data Science) Job Search</h2>
<p>Let me tell you a story about my first job search. It was 2010, and data science jobs weren't really a thing yet. I'll get to that in a minute, but bear with me first because there's a point to all this.</p>
<p>At the time, I was a junior at the University of Pennsylvania, where I was studying finance and statistics. </p>
<p>Every year, there was months-long on-campus recruiting season where all of the students frantically applied to secure prestigious jobs and internships from big banks and other financial companies.</p>
<p>And I knew I <strong>NEEDED</strong> to get one of those jobs. </p>
<p>But unlike a lot of the other students at Penn, I didn't grow up in New York City. I was from a working class family in the midwest. So when I started my job search, I had no connections in the industry. Zero.</p>
<p>Obviously, my applications weren't going to be fast tracked. But more importantly, I had nobody to ask questions about the different job opportunities available. </p>
<p>It felt like everybody else in school understood all of the different types of jobs in finance and which ones to apply to.</p>
<p>Not me. I had no clue. But, what the hell, I figured. I was smart. I was near the top of my class! I had a 3.8 GPA and an extremely difficult and technical course load. Somebody would hire me! And I didn't really care what <strong>type of job</strong> I got, I just needed a job. So I applied to everything. </p>
<p>I applied to trading jobs. I applied to investment risk jobs. I applied to investment banking jobs. Over 100 jobs in total. And then... crickets. Most of the companies didn't even respond to me! It was demoralizing.</p>
<p>I was freaking out. I needed to land a job. My first job was extremely important, and it would set me up for my entire career. I worried if I didn't get a job now, I would have to move back to Ohio after I graduated, probably closing the door on a prestigious finance job forever.</p>
<h2>My Big Break</h2>
<p>Finally, after weeks of anguish, I got a big break. I had been asked to interview for one of the jobs I had applied to, a technical trading job at Allstate. </p>
<p>I was so excited. When I went to the interview, I met a guy named Mark. Mark was the one who had decided to interview me, and the hiring decision was ultimately his to make.</p>
<p>Mark and I got along well, from what I remember. He had a relatively senior role at the company, but his schooling background had been similar to mine. He'd studied finance and engineering, and he was looking for somebody smart with a strong combination of finance and technical skills.</p>
<p>He must have seen some potential in me, because shortly after the interview he offered me the job. Of the hundred-or-so jobs I applied to, this was the only offer I received.</p>
<h2>What I Learned</h2>
<p>So what's my point? I tell you your data science job search is not a numbers game, but then I tell you how I applied to over a hundred jobs to get a single offer. What gives?</p>
<p>Here's the thing: I got this job because I had the exact combination of finance and technical skills that Mark was looking for. Most of the other jobs I applied for, I never had a shot of getting, because I didn't have the right background. All of those applications, and all of my effort in applying, were a waste.</p>
<p>If I had instead focused on only applying to the technical finance jobs where I had a unique advantage, I'd have had a much higher success rate. </p>
<h2>Getting My First Data Science Job with a Strategic R Blog Post</h2>
<p>My experience during that application process changed how I thought about job searches. Years later, when I was trying to break into data science from my finance job at BlackRock, I took a different approach.</p>
<p>I knew I wanted to get into data science, but with no formal training and potentially hundreds of applicants for each position, I also knew I needed to stand out. So I scoured job boards searching for <strong>jobs where I knew my skills would give me an advantage</strong>. </p>
<p>I was very good at financial data manipulation, something the majority of data scientists know absolutely nothing about. </p>
<p>So when I found a financial technology startup specialized in online lending analytics that was looking for a data scientist, I knew it was the perfect opportunity for me. </p>
<p>What did I do? Did I send in my application and then just hope they contacted me?</p>
<p>Nope.</p>
<p>I wrote a <a href="https://michaeltoth.me/analyzing-historical-default-rates-of-lending-club-notes.html">detailed blog post</a> analyzing historical default rates for Lending Club loans. This was exactly the type of work I'd be expected to do at the company. I probably spent 10 hours doing the research and analysis and then writing that blog post. </p>
<p>What do you imagine happened? They called me in for an on-site interview. And I pretty much breezed through it. The interview was primarily to assess my cultural fit, not my data science skills, because I'd already proven to them I was capable!</p>
<h2>Avoid the Spray and Pray Approach when Applying for Data Science Jobs</h2>
<p>When you carefully select the companies you're going to apply to based on an alignment of their needs and your skills, you can dedicate more time to each application. That extra time is how you stand out in a pool of hundreds of job applicants for a single position. You could:</p>
<ul>
<li>Write a blog post showing your ability to do the work</li>
<li>Send them a detailed list of metrics they should be tracking to improve their business</li>
<li>Analyze a relevant public dataset and tell the company how they could incorporate that data into their product</li>
</ul>
<p>The specific thing you do will vary across companies and industries. But if you can do <strong>something</strong> to add value and differentiate yourself from the hundreds of other applications, you will vastly improve your chances of getting a job.</p>
<p><strong>ALWAYS REMEMBER THIS</strong>: The job you're applying for exists because the company has a problem that they're trying to solve. They're not looking for a generic person with a generic set of data analysis skills. They're looking for a <strong>specific person</strong> that can help them solve their problems. </p>
<p>That doesn't mean you need to know everything, but it does mean that you should lean into your specific strengths when deciding where to apply. If you can show the company that you can solve their problems, you become a top-5 candidate immediately. </p>
<p>For your next data science job search, don't apply for a hundred jobs. Find 10 jobs where you bring unique skills to the table, and try to do something that demonstrates your unique skills to those employers. I promise you'll find far more success with this approach.</p>
<hr />
<p>I will help you learn the specific skills you need to work more effectively, grow your income, and improve your career.</p>
<p><a href="http://eepurl.com/gmYioz">Sign up here to receive my best tips</a></p>A Detailed Guide to the ggplot Scatter Plot in R2019-04-24T08:05:00-04:00Michael Tothtag:michaeltoth.me,2019-04-24:a-detailed-guide-to-the-ggplot-scatter-plot-in-r.html<p>When it comes to data visualization, flashy graphs can be fun. But if you're trying to convey information, especially to a broad audience, flashy isn't always the way to go. </p>
<p>Last week I showed how to work with <a href="https://michaeltoth.me/a-detailed-guide-to-plotting-line-graphs-in-r-using-ggplot-geom_line.html">line graphs</a> in R. </p>
<p>In this article, I'm going to talk about creating a scatter plot in R. Specifically, we'll be creating a <code>ggplot</code> scatter plot using <code>ggplot</code>'s <code>geom_point</code> function. </p>
<p>A scatter plot is a two-dimensional data visualization that uses points to graph the values of two different variables - one along the x-axis and the other along the y-axis. Scatter plots are often used when you want to assess the relationship (or lack of relationship) between the two variables being plotted.</p>
<h5>Scatter Plot of Adam Sandler Movies from FiveThirtyEight</h5>
<p><center>
<img alt="FiveThirtyEight Scatter Plot of Adam Sandler Movies" src="../images/20190422_geom_point/sandler.png" width="600px" />
</center></p>
<p>For example, in this graph, FiveThirtyEight uses Rotten Tomatoes ratings and Box Office gross for a series of Adam Sandler movies to create this scatter plot. They've additionally grouped the movies into 3 categories, highlighted in different colors.</p>
<h5>The Famous Gapminder Scatter Plot of Life Expectancy vs. Income by Country</h5>
<p><center>
<img alt="Gapminder Scatter Plot of Life Expectancy vs. Income by Country" src="../images/20190422_geom_point/gapminder.png" width="600px" />
</center></p>
<p>This scatter plot, initially created by Hans Rosling, is famous among data visualization practitioners. It graphs the life expectancy vs. income for countries around the world. It also uses the size of the points to map country population and the color of the points to map continents, adding 2 additional variables to the traditional scatter plot. </p>
<p>Hans Rosling used a famously provocative and animated presentation style to make this data come alive. He used his presentations to advocate for sustainable global development through the Gapminder Foundation. </p>
<p>Hans Rosling's example shows how simple graphic styles can be powerful tools for communication and change when used properly! Convinced? Let's dive into this guide to creating a ggplot scatter plot in R!</p>
<h2>Follow Along With the Workbook</h2>
<p>I've created a <a href="https://mailchi.mp/213333232fb2/workbook-ggplot-scatter-plot">free workbook</a> to help you apply what you're learning as you read. </p>
<p>The workbook is an R file that contains all the code shown in this post as well as additional questions and exercises to help you understand the topic even deeper.</p>
<p>If you want to really learn how to create a scatter plot in R so that you'll still remember weeks or even months from now, you need to practice. </p>
<p>So <a href="https://mailchi.mp/213333232fb2/workbook-ggplot-scatter-plot">Download the workbook now</a> and practice as you read this post!</p>
<h2>Introduction to ggplot</h2>
<p>Before we get into the ggplot code to create a scatter plot in R, I want to briefly touch on <code>ggplot</code> and why I think it's the best choice for plotting graphs in R. </p>
<p><code>ggplot</code> is a package for creating graphs in R, but it's also a method of thinking about and decomposing complex graphs into logical subunits. </p>
<p><code>ggplot</code> takes each component of a graph--axes, scales, colors, objects, etc--and allows you to build graphs up sequentially one component at a time. You can then modify each of those components in a way that's both flexible and user-friendly. When components are unspecified, <code>ggplot</code> uses sensible defaults. This makes <code>ggplot</code> a powerful and flexible tool for creating all kinds of graphs in R. It's the tool I use to create nearly every graph I make these days, and I think you should use it too!</p>
<h2>Investigating our dataset</h2>
<p>Throughout this post, we'll be using the <code>mtcars</code> dataset that's built into R. This dataset contains details of design and performance for 32 cars. Let's take a look to see what it looks like:</p>
<p><center>
<img alt="A snippet of the mtcars dataset" src="../images/20190422_geom_point/mtcars.png" width="400px" />
</center></p>
<p>The mtcars dataset contains 11 columns: </p>
<ul>
<li><code>mpg</code>: Miles/(US) gallon</li>
<li><code>cyl</code>: Number of cylinders</li>
<li><code>disp</code>: Displacement (cu.in.)</li>
<li><code>hp</code>: Gross horsepower</li>
<li><code>drat</code>: Rear axle ratio</li>
<li><code>wt</code>: Weight (1000 lbs)</li>
<li><code>qsec</code>: 1/4 mile time</li>
<li><code>vs</code>: Engine (0 = V-shaped, 1 = straight)</li>
<li><code>am</code>: Transmission (0 = automatic, 1 = manual)</li>
<li><code>gear</code>: Number of forward gears</li>
<li><code>carb</code>: Number of carburetors</li>
</ul>
<h2>How to create a simple scatter plot in R using geom_point()</h2>
<p><code>ggplot</code> uses geoms, or geometric objects, to form the basis of different types of graphs. Previously I talked about <a href="https://michaeltoth.me/a-detailed-guide-to-plotting-line-graphs-in-r-using-ggplot-geom_line.html">geom_line</a>, which is used to produce line graphs. Today I'll be focusing on <code>geom_point</code>, which is used to create scatter plots in R. </p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>tidyverse<span class="p">)</span>
ggplot<span class="p">(</span>mtcars<span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> wt<span class="p">,</span> y <span class="o">=</span> mpg<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190422_ggplot_geom_point/simple_scatter_plot-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>Here we are starting with the simplest possible <code>ggplot</code> scatter plot we can create using <code>geom_point</code>. Let's review this in more detail:</p>
<p>First, I call <code>ggplot</code>, which creates a new <code>ggplot</code> graph. It's essentially a blank canvas on which we'll add our data and graphics. In this case, I passed mtcars to <code>ggplot</code> to indicate that we'll be using the mtcars data for this particular <code>ggplot</code> scatter plot.</p>
<p>Next, I added my <code>geom_point</code> call to the base <code>ggplot</code> graph in order to create this scatter plot. In <code>ggplot</code>, you use the <code>+</code> symbol to add new layers to an existing graph. In this second layer, I told <code>ggplot</code> to use <code>wt</code> as the x-axis variable and <code>mpg</code> as the y-axis variable. </p>
<p>And that's it, we have our scatter plot! It shows that, on average, as the weight of cars increase, the miles-per-gallon tends to fall. </p>
<h2>Changing point color in a ggplot scatter plot</h2>
<p>Expanding on this example, we can now play with colors in our scatter plot.</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mtcars<span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> wt<span class="p">,</span> y <span class="o">=</span> mpg<span class="p">),</span> color <span class="o">=</span> <span class="s">'blue'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190422_ggplot_geom_point/color-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>You'll note that this <code>geom_point</code> call is identical to the one before, except that we've added the modifier <code>color = 'blue'</code> to to end of the line. Experiment a bit with different colors to see how this works on your machine. You can use most color names you can think of, or you can use specific hex colors codes to get more granular.</p>
<p>Now, let's try something a little different. Compare the <code>ggplot</code> code below to the code we just executed above. There are 3 differences. See if you can find them and guess what will happen, then scroll down to take a look at the result. If you've read my previous <code>ggplot</code> guides, this bit should look familiar!</p>
<div class="highlight"><pre><span></span>mtcars<span class="o">$</span>am <span class="o"><-</span> <span class="kp">factor</span><span class="p">(</span>mtcars<span class="o">$</span>am<span class="p">)</span>
ggplot<span class="p">(</span>mtcars<span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> wt<span class="p">,</span> y <span class="o">=</span> mpg<span class="p">,</span> color <span class="o">=</span> am<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190422_ggplot_geom_point/color_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>This graph shows the same data as before, but now there are two different colors! The red dots correspond to automatic transmission vehicles, while the blue dots represent manual transmission vehicles. Did you catch the 3 changes we used to change the graph? They were:</p>
<ol>
<li>First, we converted the <code>am</code> variable to a factor. What do you think happens if we don't do this? Give it a try!</li>
<li>Instead of specifying <code>color = 'blue'</code>, we specified <code>color = am</code></li>
<li>We moved the color parameter inside of the <code>aes()</code> parentheses</li>
</ol>
<p>Let's review each of these changes:</p>
<h5>Converting the <code>am</code> variable to a factor</h5>
<p>In the dataset, <code>am</code> was initially a numeric variable. You can check this by running <code>class(mtcars$am)</code>. When you pass a numeric variable to a color scale in <code>ggplot</code>, it creates a continuous color scale. </p>
<p>In this case, however, there are only 2 values for the <code>am</code> field, corresponding to automatic and manual transmission. So it makes our graph more clear to use a discrete color scale, with 2 color options for the two values of <code>am</code>. We can accomplish this by converting the <code>am</code> field from a numeric value to a factor, as we did above. </p>
<p>On your own, try graphing both with and without this conversion to factor. If you've already converted to factor, you can reload the dataset by running <code>data(mtcars)</code> to try graphing as numeric! </p>
<p>This point is a bit tricky. Check out <a href="https://mailchi.mp/213333232fb2/workbook-ggplot-scatter-plot">my workbook for this post</a> for a guided exploration of this issue in more detail!</p>
<h5>Specifying <code>color = am</code> and moving it within the <code>aes()</code> parentheses</h5>
<p>I'm combining these because these two changes work together. </p>
<p>Before, we told <code>ggplot</code> to change the color of the points to blue by adding <code>color = 'blue'</code> to our <code>geom_point()</code> call. </p>
<p>What we're doing here is a bit more complex. Instead of specifying a single color for our points, we're telling <code>ggplot</code> to <em>map</em> the data in the <code>am</code> column to the <code>color</code> aesthetic. </p>
<p>This means we are telling <code>ggplot</code> to use a different color for each value of <code>am</code> in our data! This mapping also lets <code>ggplot</code> know that it also needs to create a legend to identify the transmission types, and it places it there automatically!</p>
<h2>Changing point shapes in a ggplot scatter plot</h2>
<p>Let's look at a related example. This time, instead of changing the color of the points in our scatter plot, we will change the shape of the points:</p>
<div class="highlight"><pre><span></span>mtcars<span class="o">$</span>am <span class="o"><-</span> <span class="kp">factor</span><span class="p">(</span>mtcars<span class="o">$</span>am<span class="p">)</span>
ggplot<span class="p">(</span>mtcars<span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> wt<span class="p">,</span> y <span class="o">=</span> mpg<span class="p">,</span> shape <span class="o">=</span> am<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190422_ggplot_geom_point/shape_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>The code for this <code>ggplot</code> scatter plot is identical to the code we just reviewed, except we've substituted <code>shape</code> for <code>color</code>. The graph produced is quite similar, but it uses different shapes (triangles and circles) instead of different colors in the graph. You might consider using something like this when printing in black and white, for example.</p>
<h2>A deeper review of <code>aes()</code> (aesthetic) mappings in ggplot</h2>
<p>We just saw how we can create graphs in <code>ggplot</code> that map the <code>am</code> variable to color or shape in a scatter plot. <code>ggplot</code> refers to these mappings as <em>aesthetic</em> mappings, and they include everything you see within the <code>aes()</code> in ggplot.</p>
<p>Aesthetic mappings are a way of mapping <em>variables in your data</em> to particular <em>visual properties</em> (aesthetics) of a graph. </p>
<p>I know this can sound a bit theoretical, so let's review the specific aesthetic mappings you've already seen as well as the other mappings available within geom_point.</p>
<h5>Reviewing the list of geom_point aesthetic mappings</h5>
<p>The main aesthetic mappings for a ggplot scatter plot include:</p>
<ul>
<li><code>x</code>: Map a variable to a position on the x-axis</li>
<li><code>y</code>: Map a variable to a position on the y-axis</li>
<li><code>color</code>: Map a variable to a point color</li>
<li><code>shape</code>: Map a variable to a point shape</li>
<li><code>size</code>: Map a variable to a point size</li>
<li><code>alpha</code>: Map a variable to a point transparency</li>
</ul>
<p>From the list above, we've already seen the <code>x</code>, <code>y</code>, <code>color</code>, and <code>shape</code> aesthetic mappings. </p>
<p><code>x</code> and <code>y</code> are what we used in our first <code>ggplot</code> scatter plot example where we mapped the variables <code>wt</code> and <code>mpg</code> to x-axis and y-axis values. Then, we experimented with using <code>color</code> and <code>shape</code> to map the <code>am</code> variable to different colored points or shapes. </p>
<p>In addition to those, there are 2 other aesthetic mappings commonly used with <code>geom_point</code>. We can use the <code>alpha</code> aesthetic to change the transparency of the points in our graph. Finally, the <code>size</code> aesthetic can be used to change the size of the points in our scatter plot.</p>
<p>Note there are two additional aesthetic mappings for ggplot scatter plots, <code>stroke</code>, and <code>fill</code>, but I'm not going to cover them here. They're only used for particular <code>shapes</code>, and have very specific use cases beyond the scope of this guide. </p>
<h5>Changing the <code>size</code> aesthetic mapping in a ggplot scatter plot</h5>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mtcars<span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> wt<span class="p">,</span> y <span class="o">=</span> mpg<span class="p">,</span> size <span class="o">=</span> cyl<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190422_ggplot_geom_point/size_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>In the code above, we map the number of cylinders (<code>cyl</code>), to the size aesthetic in ggplot. Cars with more cylinders display as larger points in this graph. </p>
<p>Note: A scatter plot where the size of the points vary based on a variable in the data is sometimes called a bubble chart. The scatter plot above could be considered a bubble chart!</p>
<p>In general, we see that cars with more cylinders tend to be clustered in the bottom right of the graph, with larger weights and lower miles per gallon, while those with fewer cylinders are on the top left. That said, it's a bit hard to make out all the points in the bottom right corner. How can we solve that issue? Let's learn more about the alpha aesthetic to find out!</p>
<h5>Changing transparency in a ggplot scatter plot with the <code>alpha</code> aesthetic</h5>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mtcars<span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> wt<span class="p">,</span> y <span class="o">=</span> mpg<span class="p">,</span> alpha <span class="o">=</span> cyl<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190422_ggplot_geom_point/alpha_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>In this code we've mapped the alpha aesthetic to the variable <code>cyl</code>. Cars with fewer cylinders appear more transparent, while those with more cylinders are more opaque. But in this case, I don't think this helps us to understand relationships in the data any better. Instead, it just seems to highlight the points on the bottom right. I think this is a bad graph!</p>
<p>How else can we use the alpha aesthetic to improve the readability of our graph? Let's turn back to our code from above where we mapped the cylinders to the size variable, creating what I called a bubble chart. Remember how it was difficult to make out all of the cars in the bottom right? What if we made all of the points in the graph semi-transparent so that we can see through the bubbles that are overlapping? Let's try!</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mtcars<span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> wt<span class="p">,</span> y <span class="o">=</span> mpg<span class="p">,</span> size <span class="o">=</span> cyl<span class="p">),</span> alpha <span class="o">=</span> <span class="m">0.3</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190422_ggplot_geom_point/size_and_alpha_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>This makes it much easier to see the clustering of larger cars in the bottom right while not reducing the importance of those points in the top left! This is my favorite use of the alpha aesthetic in ggplot: adding transparency to get more insight into dense regions of points. </p>
<h2>Aesthetic mappings vs. parameters in ggplot</h2>
<p>Above, we saw that we are able to use <code>color</code> in two different ways with geom_point. First, we were able to set the color of our points to blue by specifying <code>color = 'blue'</code> <em>outside</em> of our <code>aes()</code> mappings. Then, we were able to <em>map</em> the variable <code>am</code> to color by specifying <code>color = am</code> <em>inside</em> of our <code>aes()</code> mappings. </p>
<p>Similarly, we saw two different ways to use the <code>alpha</code> aesthetic as well. First, we <em>mapped</em> the variable <code>cyl</code> to alpha by specifying <code>alpha = cyl</code> <em>inside</em> of our <code>aes()</code> mappings. Then, we set the alpha of all points to 0.3 by specifying <code>alpha = 0.3</code> <em>outside</em> of our <code>aes()</code> mappings. </p>
<p>What is the difference between these two ways of dealing with the aesthetic mappings available to us?</p>
<p>Each of the aesthetic mappings you've seen can also be used as a <em>parameter</em>, that is, a fixed value defined outside of the <code>aes()</code> aesthetic mappings. You saw how to do this with color when we made the scatter plot points blue with <code>color = 'blue'</code> above. Then, you saw how to do this with alpha when we set the transparency to 0.3 with <code>alpha = 0.3</code>. Now let's look at an example of how to do this with shape in the same manner:</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mtcars<span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> wt<span class="p">,</span> y <span class="o">=</span> mpg<span class="p">),</span> shape <span class="o">=</span> <span class="m">18</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190422_ggplot_geom_point/shape_parameter-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>Here, we specify to use shape 18, which corresponds to this diamond shape you see here. Because we specified this <em>outside</em> of the <code>aes()</code>, this applies to all of the points in this graph!</p>
<p>To review what values <code>shape</code>, <code>size</code>, and <code>alpha</code> accept, just run <code>?shape</code>, <code>?size</code>, or <code>?alpha</code> from your console window! For even more details, check out <code>vignette("ggplot2-specs")</code></p>
<h2>Common errors with aesthetic mappings and parameters in ggplot</h2>
<p>When I was first learning R and ggplot, the difference between aesthetic mappings (the values included <em>inside</em> your <code>aes()</code>), and parameters (the ones <em>outside</em> your <code>aes()</code>) was constantly confusing me. Luckily, over time, you'll find that this becomes second nature. But in the meantime, I can help you speed along this process with a few common errors that you can keep an eye out for.</p>
<h5>Trying to include aesthetic mappings <em>outside</em> your <code>aes()</code> call</h5>
<p>If you're trying to map the <code>cyl</code> variable to <code>shape</code>, you should include <code>shape = cyl</code> within the <code>aes()</code> of your <code>geom_point</code> call. What happens if you include it outside accidentally, and instead run <code>ggplot(mtcars) + geom_point(aes(x = wt, y = mpg), shape = cyl)</code>? You'll get an error message that looks like this:</p>
<p><center>
<img alt="ggplot geom_line error message" src="../images/20190422_geom_point/error_1.png" width="600px" />
</center></p>
<p>Whenever you see this error about object not found, be sure to check that you're including your aesthetic mappings <em>inside</em> the <code>aes()</code> call!</p>
<h5>Trying to specify parameters <em>inside</em> your <code>aes()</code> call</h5>
<p>On the other hand, if we try including a specific parameter value (for example, <code>color = 'blue'</code>) inside of the <code>aes()</code> mapping, the error is a bit less obvious. Take a look:</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>mtcars<span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> wt<span class="p">,</span> y <span class="o">=</span> mpg<span class="p">,</span> color <span class="o">=</span> <span class="s">'blue'</span><span class="p">))</span>
</pre></div>
<p><img src="/figures/20190422_ggplot_geom_point/unnamed-chunk-1-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>In this case, <code>ggplot</code> actually does produce a scatter plot, but it's not what we intended. </p>
<p>For starters, the points are all red instead of the blue we were hoping for! Also, there's a legend in the graph that simply says 'blue'. </p>
<p>What's going on here? Under the hood, <code>ggplot</code> has taken the string 'blue' and effectively created a new hidden column of data where every value simple says 'blue'. Then, it's <em>mapped</em> that column to the color aesthetic, like we saw before when we specified <code>color = am</code>. This results in the legend label and the color of all the points being set, not to blue, but to the default color in <code>ggplot</code>.</p>
<p>If this is confusing, that's okay for now. Just remember: when you run into issues like this, double check to make sure you're including the parameters of your graph <em>outside</em> your <code>aes()</code> call!</p>
<p>You should now have a solid understanding of how to create a scatter plot in R using the <code>ggplot</code> scatter plot function, <code>geom_point</code>! </p>
<p>Experiment with the things you've learned to solidify your understanding. You can <a href="https://mailchi.mp/213333232fb2/workbook-ggplot-scatter-plot">download my free workbook</a> with the code from this article to work through on your own.</p>
<p>I've found that working through code on my own is the best way for me to learn new topics so that I'll actually remember them when I need to do things on my own in the future. </p>
<p><a href="https://mailchi.mp/213333232fb2/workbook-ggplot-scatter-plot">Download the workbook now</a> to practice what you learned!</p>If You Want to be Effective, You Need to Approach Data Science with a Business Mindset2019-04-23T00:00:00-04:00Michael Tothtag:michaeltoth.me,2019-04-23:if-you-want-to-be-effective-you-need-to-approach-data-science-with-a-business-mindset.html<p>There's a <a href="https://towardsdatascience.com/the-third-wave-data-scientist-1421df7433c9">great article</a> by Dominik Haitz making its way around the data science world this past week. </p>
<p>The entire article is worth reading, but I'll highlight some of my favorite points.</p>
<p>In the article, Dominik talks about the importance of developing a business mindset to succeed in data science.</p>
<p>A business mindset is critical. Remember, at the end of the day, the point of your work is to create some kind of concrete value for your organization. You can, and should, prioritize learning and working on fun and interesting projects, but above all else, you need to create value.</p>
<p>Dominik argues:</p>
<blockquote>
<p>Prioritizing your work and knowing when to stop is the key to efficiency. Think of diminishing returns: Is it worth spending weeks to tweak a model for another 0.2% of precision? Quite often, good enough is the real perfect.</p>
</blockquote>
<p>As I've discussed before, prioritization is extremely important in this fast-moving field, and it's impossible for you to know everything. </p>
<p>I believe in applying the pareto principal to your learning: focus on mastering the 20% of concepts that will drive 80% of results, and only optimize further as necessary. </p>
<p>It's not that further optimization isn't valuable, but that your resources as a person are limited, and <strong>it's always better to produce something good but imperfect than it is to never produce something that would be perfect</strong>.</p>
<p>Dominik also talks about the importance of communicating your results, something I think many data scientists struggle with. </p>
<p>As difficult as it is to hear this, <strong>your analysis is meaningless if you can't convince key stakeholders in your company to take action based on what you find</strong>! </p>
<p>You need to be able to communicate your results effectively, both throughout your organization and externally. Communicating effectively means trying to see the world through the eyes of those you are communicating to. You can, and should, discuss your analysis differently when speaking with a teammate, an executive, and a client. </p>
<p>This was a great article, and I agree with a lot of what Dominik is saying. <a href="https://towardsdatascience.com/the-third-wave-data-scientist-1421df7433c9">Check out the article</a> for more!</p>Mapping Legal Marijuana States and Medical Marijuana States 1995 - 20192019-04-20T00:00:00-04:00Michael Tothtag:michaeltoth.me,2019-04-20:mapping-legal-marijuana-states-and-medical-marijuana-states-1995-2019.html<p>Today is April 20, 2019. As stoners everywhere celebrate the occasion, I thought I'd turn to creating maps.</p>
<p>As the number of legal marijuana states and medical marijuana states seems to have grown considerably in recent years, I got thinking: what has the history been of legalization and how has it grown over time?</p>
<p>In the graph below, I show legal marijuana states, medical marijuana states, and illegal marijuana states over time. Note that for medical marijuana states, there are two categories: broadly legal for medical purposes, and low-THC marijuana legal for medical purposes. Without further ado, I present the map:</p>
<p><img alt="center" src="/figures/20190419_Marijuana_Legalization/map.gif" /></p>
<p>Read on if you're interested in learning how to create this map yourself in R!</p>
<h2>Gathering the data</h2>
<p>I sourced the data for legal marijuana states and medical marijuana states from the <a href="https://en.wikipedia.org/wiki/Timeline_of_cannabis_laws_in_the_United_States">Timeline of cannabis laws in the United States</a> article on Wikipedia. </p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>albersusa<span class="p">)</span> <span class="c1"># devtools::install_github("hrbrmstr/albersusa")</span>
<span class="kn">library</span><span class="p">(</span>animation<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>ggalt<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>hrbrthemes<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>maps<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>tidyverse<span class="p">)</span>
<span class="c1"># Set up base map theme for ggplot</span>
map_theme <span class="o"><-</span> theme<span class="p">(</span>axis.line <span class="o">=</span> element_blank<span class="p">(),</span>
axis.text <span class="o">=</span> element_blank<span class="p">(),</span>
axis.ticks <span class="o">=</span> element_blank<span class="p">(),</span>
panel.background <span class="o">=</span> element_blank<span class="p">(),</span>
panel.border <span class="o">=</span> element_blank<span class="p">(),</span>
panel.grid.major <span class="o">=</span> element_blank<span class="p">(),</span>
panel.grid.minor <span class="o">=</span> element_blank<span class="p">(),</span>
plot.background <span class="o">=</span> element_blank<span class="p">(),</span>
legend.position <span class="o">=</span> <span class="s">'top'</span><span class="p">)</span>
timeline <span class="o"><-</span> read_csv<span class="p">(</span><span class="s">'~/Desktop/Marijuana_Legalization.csv'</span><span class="p">)</span>
</pre></div>
<p>Above I load in the various packages we'll be using for this analysis. I also create a blank map theme that will remove a lot of things like axis gridlines and text that we don't want on our final map.</p>
<p>Then, I load in the legal marijuana states data I gathered from the Wikipedia article above. I had to do some manual processing to convert the text in that article to a workable file that had dates of different legalization statuses by state. The file I ended up putting together looked like this:</p>
<p><center>
<img alt="Image of legal marijuana status by state" src="../images/20190419_marijuana_legalization/spreadsheet.png" width="600px" />
</center></p>
<p>In order to create the animated graph, we need to reformat the input data a bit. Ultimately, we want a dataset organized with three columns:</p>
<ul>
<li><code>Year</code> </li>
<li><code>State</code></li>
<li><code>Status</code></li>
</ul>
<p>There should be one status entry for each combination of <code>Year</code> and <code>State</code>, giving the legal marijuana status for that state in that particular year.</p>
<p>While we don't need it for this map, I'm also going to include a <code>Criminalized</code> column to indicate whether marijuana had been decriminalized in a particular year in a state. The processing to get the data in this format all happens in the section below.</p>
<p>For each state, we read in the data to determine which year, if any, marijuana became medicinally legal, recreationally legal, or medicinally legal with low-THC content. This lets us categorize each state-year combination into one of four categories:</p>
<p><code>Illegal</code>: All forms of marijuana consumption are illegal
<code>Low-THC Medicinally Legal</code>: Low-THC varieties of marijuana are legal for medicinal use
<code>Medicinally Legal</code>: Marijuana is legal for medicinal use
<code>Legal</code>: Marijuana is legal for recreational use</p>
<div class="highlight"><pre><span></span>legalized <span class="o"><-</span> <span class="kt">data.frame</span><span class="p">()</span>
<span class="kr">for</span><span class="p">(</span>state <span class="kr">in</span> timeline<span class="o">$</span>State<span class="p">)</span> <span class="p">{</span>
current_state <span class="o"><-</span> filter<span class="p">(</span>timeline<span class="p">,</span> State <span class="o">==</span> state<span class="p">)</span>
decrim_year <span class="o"><-</span> current_state<span class="o">$</span>Decriminalized
crim_year <span class="o"><-</span> current_state<span class="o">$</span>Criminalized
med_year <span class="o"><-</span> current_state<span class="o">$</span>Legalized_Medical
med_low_year <span class="o"><-</span> current_state<span class="o">$</span>Legalized_Medical_Low_THC
legal_year <span class="o"><-</span> current_state<span class="o">$</span>Legalized_Recreational
status <span class="o">=</span> <span class="s">'Illegal'</span>
criminalized <span class="o">=</span> <span class="kc">TRUE</span>
<span class="kr">for</span><span class="p">(</span>year <span class="kr">in</span> <span class="m">1960</span><span class="o">:</span><span class="m">2019</span><span class="p">)</span> <span class="p">{</span>
<span class="kr">if</span><span class="p">(</span><span class="o">!</span><span class="kp">is.na</span><span class="p">(</span>decrim_year<span class="p">)</span> <span class="o">&</span> year <span class="o">==</span> decrim_year<span class="p">)</span> <span class="p">{</span>
criminalized <span class="o">=</span> <span class="kc">FALSE</span>
<span class="p">}</span>
<span class="kr">if</span><span class="p">(</span><span class="o">!</span><span class="kp">is.na</span><span class="p">(</span>crim_year<span class="p">)</span> <span class="o">&</span> year <span class="o">==</span> crim_year<span class="p">)</span> <span class="p">{</span>
criminalized <span class="o">=</span> <span class="kc">TRUE</span>
<span class="p">}</span>
<span class="kr">if</span><span class="p">(</span><span class="o">!</span><span class="kp">is.na</span><span class="p">(</span>med_year<span class="p">)</span> <span class="o">&</span> year <span class="o">==</span> med_year<span class="p">)</span> <span class="p">{</span>
status <span class="o">=</span> <span class="s">'Legal for Medical Use'</span>
<span class="p">}</span>
<span class="kr">if</span><span class="p">(</span><span class="o">!</span><span class="kp">is.na</span><span class="p">(</span>med_low_year<span class="p">)</span> <span class="o">&</span> year <span class="o">==</span> med_low_year<span class="p">)</span> <span class="p">{</span>
status <span class="o">=</span> <span class="s">'Legal for Medical Use, Low-THC Only'</span>
<span class="p">}</span>
<span class="kr">if</span><span class="p">(</span><span class="o">!</span><span class="kp">is.na</span><span class="p">(</span>legal_year<span class="p">)</span> <span class="o">&</span> year <span class="o">==</span> legal_year<span class="p">)</span> <span class="p">{</span>
status <span class="o">=</span> <span class="s">'Legal'</span>
<span class="p">}</span>
current_status <span class="o"><-</span> <span class="kt">data.frame</span><span class="p">(</span>State <span class="o">=</span> state<span class="p">,</span>
Year <span class="o">=</span> year<span class="p">,</span>
Status <span class="o">=</span> status<span class="p">,</span>
Criminalized <span class="o">=</span> criminalized<span class="p">)</span>
legalized <span class="o"><-</span> <span class="kp">rbind</span><span class="p">(</span>legalized<span class="p">,</span> current_status<span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></div>
<p>Great, so now we have the data that we'll need to create our ultimate graph of legal marijuana states. First, I'm going to create the graph for 2019 to make sure that I have the base formatting down. </p>
<p>I start by loading in a shapefile for the United States that we'll use for graphing. I remove Washington D.C. which I didn't pull legalization data for initially, which causes issues with mapping.</p>
<p>Then I reorganize the legal status into a factor column, which lets us create explicit orders and color mappings below, something we'll need to create the final graph. </p>
<p>I filter the legalization data to only include 2019, then merge the legalization data with the map shapefile to get the information we need for graphing. Finally, I produce the ultimate graph in ggplot! </p>
<div class="highlight"><pre><span></span>us <span class="o"><-</span> usa_composite<span class="p">()</span>
us_map <span class="o"><-</span> fortify<span class="p">(</span>us<span class="p">,</span> region<span class="o">=</span><span class="s">"name"</span><span class="p">)</span> <span class="o">%>%</span> filter<span class="p">(</span>id <span class="o">!=</span> <span class="s">'District of Columbia'</span><span class="p">)</span>
legalized<span class="o">$</span>Status <span class="o"><-</span> <span class="kp">factor</span><span class="p">(</span>legalized<span class="o">$</span>Status<span class="p">,</span>
levels <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">'Illegal'</span><span class="p">,</span>
<span class="s">'Legal for Medical Use, Low-THC Only'</span><span class="p">,</span>
<span class="s">'Legal for Medical Use'</span><span class="p">,</span>
<span class="s">'Legal'</span><span class="p">))</span>
legalized_2019 <span class="o"><-</span> filter<span class="p">(</span>legalized<span class="p">,</span> Year <span class="o">==</span> <span class="m">2019</span><span class="p">)</span>
legalized_2019_map <span class="o"><-</span> <span class="kp">merge</span><span class="p">(</span>us_map<span class="p">,</span> legalized_2019<span class="p">,</span> by.x <span class="o">=</span> <span class="s">"id"</span><span class="p">,</span> by.y <span class="o">=</span> <span class="s">"State"</span><span class="p">,</span> all <span class="o">=</span> <span class="bp">T</span><span class="p">)</span> <span class="o">%>%</span>
arrange<span class="p">(</span><span class="kp">order</span><span class="p">)</span>
ggplot<span class="p">(</span>legalized_2019_map<span class="p">)</span> <span class="o">+</span>
geom_polygon<span class="p">(</span>aes<span class="p">(</span>fill <span class="o">=</span> Status<span class="p">,</span> x <span class="o">=</span> long<span class="p">,</span> y <span class="o">=</span> lat<span class="p">,</span> group <span class="o">=</span> group<span class="p">),</span> color <span class="o">=</span> <span class="s">'white'</span><span class="p">)</span> <span class="o">+</span>
coord_proj<span class="p">(</span>us_laea_proj<span class="p">)</span> <span class="o">+</span>
scale_fill_manual<span class="p">(</span>values <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">'#E9E9E9'</span><span class="p">,</span> <span class="s">'#105927'</span><span class="p">,</span> <span class="s">'#349941'</span><span class="p">,</span> <span class="s">'#61DE58'</span><span class="p">),</span>
name <span class="o">=</span> <span class="s">''</span><span class="p">,</span> limits <span class="o">=</span> <span class="kp">levels</span><span class="p">(</span>legalized<span class="o">$</span>Status<span class="p">))</span> <span class="o">+</span>
theme_ipsum<span class="p">(</span>base_size <span class="o">=</span> <span class="m">10</span><span class="p">)</span> <span class="o">+</span>
map_theme <span class="o">+</span>
labs<span class="p">(</span>title <span class="o">=</span> <span class="kp">paste0</span><span class="p">(</span><span class="s">'Legal Status of Marijuana in in '</span><span class="p">,</span> year<span class="p">),</span>
subtitle <span class="o">=</span> <span class="s">''</span><span class="p">,</span>
x <span class="o">=</span> <span class="s">'michaeltoth.me / @michael_toth'</span><span class="p">,</span> y <span class="o">=</span> <span class="s">''</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190419_Marijuana_Legalization/map it-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>A quick comparison of the above map to other maps I found shows we're capturing the legal statuses correctly, so we're all set. Now, let's get to animating so we can create the complete graph from above!</p>
<div class="highlight"><pre><span></span>legalized_map <span class="o"><-</span> <span class="kp">merge</span><span class="p">(</span>us_map<span class="p">,</span> legalized<span class="p">,</span> by.x <span class="o">=</span> <span class="s">"id"</span><span class="p">,</span> by.y <span class="o">=</span> <span class="s">"State"</span><span class="p">,</span> all <span class="o">=</span> <span class="bp">T</span><span class="p">)</span> <span class="o">%>%</span>
arrange<span class="p">(</span><span class="kp">order</span><span class="p">)</span>
saveGIF<span class="p">({</span>
<span class="c1"># Repeat 2019 6 times for a pause at the end of the animation</span>
<span class="kr">for</span> <span class="p">(</span>year <span class="kr">in</span> <span class="kt">c</span><span class="p">(</span><span class="m">1995</span><span class="o">:</span><span class="m">2019</span><span class="p">,</span> <span class="kp">rep</span><span class="p">(</span><span class="m">2019</span><span class="p">,</span> <span class="m">5</span><span class="p">)))</span> <span class="p">{</span>
yearly_map <span class="o"><-</span> filter<span class="p">(</span>legalized_map<span class="p">,</span> Year <span class="o">==</span> year<span class="p">)</span>
p <span class="o"><-</span> ggplot<span class="p">(</span>yearly_map<span class="p">)</span> <span class="o">+</span>
geom_polygon<span class="p">(</span>aes<span class="p">(</span>fill <span class="o">=</span> Status<span class="p">,</span> x <span class="o">=</span> long<span class="p">,</span> y <span class="o">=</span> lat<span class="p">,</span> group <span class="o">=</span> group<span class="p">),</span> color <span class="o">=</span> <span class="s">'white'</span><span class="p">)</span> <span class="o">+</span>
coord_proj<span class="p">(</span>us_laea_proj<span class="p">)</span> <span class="o">+</span>
scale_fill_manual<span class="p">(</span>values <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">'#E9E9E9'</span><span class="p">,</span> <span class="s">'#105927'</span><span class="p">,</span> <span class="s">'#349941'</span><span class="p">,</span> <span class="s">'#61DE58'</span><span class="p">),</span>
name <span class="o">=</span> <span class="s">''</span><span class="p">,</span> limits <span class="o">=</span> <span class="kp">levels</span><span class="p">(</span>legalized<span class="o">$</span>Status<span class="p">))</span> <span class="o">+</span>
theme_ipsum<span class="p">(</span>base_size <span class="o">=</span> <span class="m">24</span><span class="p">,</span> plot_title_size <span class="o">=</span> <span class="m">36</span><span class="p">,</span>
axis_title_size <span class="o">=</span> <span class="m">24</span><span class="p">)</span> <span class="o">+</span>
map_theme <span class="o">+</span>
labs<span class="p">(</span>title <span class="o">=</span> <span class="kp">paste0</span><span class="p">(</span><span class="s">'Legal Status of Marijuana in '</span><span class="p">,</span> year<span class="p">),</span>
x <span class="o">=</span> <span class="s">'michaeltoth.me / @michael_toth'</span><span class="p">,</span> y <span class="o">=</span> <span class="s">''</span><span class="p">)</span>
<span class="kp">print</span><span class="p">(</span>p<span class="p">)</span>
<span class="p">}</span>
<span class="p">},</span> movie.name <span class="o">=</span> <span class="s">'~/dev/michaeltoth/content/figures/20190419_Marijuana_Legalization/map.gif'</span><span class="p">,</span> interval <span class="o">=</span> <span class="m">1</span><span class="p">,</span> ani.width <span class="o">=</span> <span class="m">1400</span><span class="p">,</span> ani.height <span class="o">=</span> <span class="m">1000</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] FALSE
</pre></div>
<p>Above I create the same map, except that I produce 1 map separately for each year from 1995 to 2019. I also repeat 2019 6 times in total to give the effect of pausing on the final frame.</p>
<p>To create this animation, I wrap the creation of the yearly maps in the saveGIF command, which will convert a series of images into a GIF. Then, I loop through the years 1995-2019 (repeating 2019 6 times), to create the animation!</p>
<p><img alt="center" src="/figures/20190419_Marijuana_Legalization/map.gif" /></p>
<hr />
<p>Did you find this post interesting? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, <a href="http://eepurl.com/gmYioz">sign up here!</a></p>Generating the Ultimate List of 41 Data Science Podcasts by Crowdsourcing Google Results2019-04-19T00:00:00-04:00Michael Tothtag:michaeltoth.me,2019-04-19:generating-the-ultimate-list-of-41-data-science-podcasts-by-crowdsourcing-google-results.html<p>Confession time: years ago, I was skeptical of podcasts. I was a music-only listener on commutes. Can you imagine? But around 2016, I gave in and finally took the plunge into podcasts. And I'm so glad that I did. </p>
<p>Since then, I've seen enormous benefits, all attributable at least in part to the podcasts I've listened to. I've improved my programming, learned new skills, and started multiple income-generating businesses.</p>
<p>Naturally, I've been interested in data science podcasts. Initially, I found my data science podcasts through people I followed on Twitter whose shows I knew about. But I'm always on the lookout for new podcasts, so I took to the internet to find recommendations.</p>
<p>If you're a podcast listener, you've probably learned that the current state of podcast discovery leaves much to be desired. It's very hard to find new podcasts, especially for niche fields like data science and analytics. </p>
<p>So of course, when I tried searching for interesting data science podcasts, I kept finding the same shows recommended everywhere. And usually they were the podcasts I was already listening to! </p>
<p>I decided that if I wanted a bigger list of data science podcasts, I needed to go deep. I searched nearly 100 lists for recommended podcasts, and used them to compile <strong>the most complete list of data science podcasts</strong> you will find. </p>
<p><a href="https://mailchi.mp/d18f2f50ca14/data-science-podcasts">Click here for the full list of 41 podcasts</a></p>
<p>In this post I'm going to talk about how I collected this list and analyze the results.</p>
<h2>Gathering a list of data science podcasts</h2>
<p>I knew I wanted to build a big list. But I also knew that not all podcasts are created equal, and that I needed some way to differentiate the truly great podcasts on the list. I decided I would use podcast recommendations as a form of social proof, so that the most recommended podcasts would bubble to the top of the list. Here's the method I decided to follow:</p>
<ol>
<li>I'd generate a list of search terms</li>
<li>I'd perform a google search for each term on the list</li>
<li>I'd open each of the top 10 Google links for that search term and note all the results</li>
<li>I'd aggregate the complete results by podcast and use the total number of recommendations to create a "recommendation score". Higher scores should, in theory, represent better podcasts. Or at least more well known podcasts.</li>
</ol>
<h4>Generating a list of search phrases (Step 1)</h4>
<p>This step was relatively easy. I knew I wanted a list of data science podcasts, but I also wanted to include more specific subgenres like data visualization, as well as more broad but related topics like Python programming or SQL & Databases. I settled on this list of phrases:</p>
<ul>
<li>Data Science Podcasts</li>
<li>Data Engineering Podcasts</li>
<li>Data Visualization Podcasts</li>
<li>Analytics Podcasts</li>
<li>SQL Podcasts</li>
<li>R Podcasts -reddit</li>
<li>Python Podcasts</li>
</ul>
<p>Pretty straightforward. The only curious bit is with the R Podcasts query. The Reddit url structure is such that reddit.com/r/podcasts leads to the podcasts subreddit. Without the explicit -reddit, all links were to that subreddit and unrelated to R programming.</p>
<h4>Searching Google and aggregating (Steps 2-4)</h4>
<p>Next, I'd search each of these phrases in Google and open up the first 10 sites that popped up in my search results.</p>
<p>Often times these were lists others had created of, for example, "top 5 data science podcasts". I copied down the list of podcasts and kept an ongoing tally for how many times I'd seen a particular podcast represented. </p>
<p>Occasionally, the Google search would yield links to specific podcasts rather than to lists of podcasts. In this case, I would record this as well. My thinking is that ranking on Google for a particular search phrase is at least as reliable an indicator of podcast quality as inclusion on a "best data science podcasts" list. That said, this did particularly benefit those podcasts whose names closely matched my search query, as was the case with <a href="https://www.dataengineeringpodcast.com/">Data Engineering Podcast</a> and the <a href="https://r-podcast.org/">The R-Podcast</a>.</p>
<p>After going through all of the search queries, I had a list of podcast recommendations along with a tally of how often the podcast had been recommended in search results.</p>
<p>Finally, I removed any podcasts that had only 1 appearance in the results. There were a large number of these, and I felt it would dilute the value of the list to include them. This was somewhat arbitrary, but I think it makes for a stronger list overall.</p>
<h4>Analyzing the list of best data science podcasts</h4>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>hrbrthemes<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>tidyverse<span class="p">)</span>
extrafont<span class="o">::</span>loadfonts<span class="p">(</span>quiet <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span> <span class="c1"># Needed for hrbrthemes / mac dependency issue</span>
podcasts <span class="o"><-</span> read_csv<span class="p">(</span><span class="s">'~/Desktop/best_podcasts.csv'</span><span class="p">)</span>
</pre></div>
<p>First I read in my compiled list of podcasts, then I used ggplot to graph the results.</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>podcasts<span class="p">,</span> aes<span class="p">(</span>x <span class="o">=</span> reorder<span class="p">(</span>Title<span class="p">,</span> Score<span class="p">),</span> y <span class="o">=</span> Score<span class="p">))</span> <span class="o">+</span>
geom_bar<span class="p">(</span>stat <span class="o">=</span> <span class="s">'identity'</span><span class="p">)</span> <span class="o">+</span>
coord_flip<span class="p">()</span> <span class="o">+</span>
labs<span class="p">(</span>title <span class="o">=</span> <span class="s">'Top Data Science Podcasts'</span><span class="p">,</span>
subtitle <span class="o">=</span> <span class="s">'Rankings crowdsourced from Google search results'</span><span class="p">,</span>
x <span class="o">=</span> <span class="s">''</span><span class="p">,</span>
y <span class="o">=</span> <span class="s">'Recommendation Score'</span><span class="p">,</span>
caption <span class="o">=</span> <span class="s">'michaeltoth.me / @michael_toth'</span><span class="p">)</span> <span class="o">+</span>
theme_ipsum<span class="p">(</span>base_size <span class="o">=</span> <span class="m">5</span><span class="p">,</span> caption_size <span class="o">=</span> <span class="m">8</span><span class="p">,</span> axis_title_size <span class="o">=</span> <span class="m">8</span><span class="p">)</span> <span class="o">+</span>
theme<span class="p">(</span>panel.grid.major.y <span class="o">=</span> element_blank<span class="p">(),</span>
panel.grid.major.x <span class="o">=</span> element_line<span class="p">(</span>colour <span class="o">=</span> <span class="s">'white'</span><span class="p">,</span> linetype <span class="o">=</span> <span class="s">'dotted'</span><span class="p">),</span>
panel.grid.minor.x <span class="o">=</span> element_line<span class="p">(</span>colour <span class="o">=</span> <span class="s">'white'</span><span class="p">,</span> linetype <span class="o">=</span> <span class="s">'dotted'</span><span class="p">),</span>
panel.ontop <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190415_best_ds_podcasts/graph_podcasts-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>This is awesome! This list of 41 podcasts is more than I found anywhere else online while going through this exercise, and I'm confident it's the most extensive list of data science podcasts you can find. </p>
<p>I had heard of many of the podcasts in the top 10--Data Stories, Data Skeptic, Partially Derivative, The O'Reilly Data Show, and Not So Standard Deviations--but others were new to me! I hadn't heard of Linear Digressions, or Talking Machines, or Learning Machines 101. While I haven't yet had a chance to listen to these, I'm excited to check them out!</p>
<p>So now we have a list of 41 data science podcasts to review. This is great, but I think we can do better! While these podcasts are all loosely related to data science, they're quite different from one another in focus. For example, FiveThirtyEight Politics is a relatively mainstream podcast, Data Stories deals with topics in data visualization, and Data Skeptic is an all-around instructional data science podcast. </p>
<p>I thought some form of categorization would help guide my listening, so I created a broad list of categories and grouped them as best as I could. I decided I would group the podcasts into 8 categories:</p>
<ul>
<li>General Data Science and Analytics</li>
<li>Relevant Mainstream Podcasts </li>
<li>Machine Learning & AI </li>
<li>Data Visualization </li>
<li>Data Engineering </li>
<li>SQL & Databases </li>
<li>R Programming </li>
<li>Python Programming </li>
</ul>
<p>I went through this list and categorized them as best as I could according to topic. Equipped with these new categories, let's take a look at the list of recommendations: </p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>podcasts<span class="p">,</span> aes<span class="p">(</span>x <span class="o">=</span> reorder<span class="p">(</span>Title<span class="p">,</span> Score<span class="p">),</span> y <span class="o">=</span> Score<span class="p">,</span> fill <span class="o">=</span> Category<span class="p">))</span> <span class="o">+</span>
geom_bar<span class="p">(</span>stat <span class="o">=</span> <span class="s">'identity'</span><span class="p">)</span> <span class="o">+</span>
coord_flip<span class="p">()</span> <span class="o">+</span>
scale_fill_manual<span class="p">(</span>values <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">"#a6cee3"</span><span class="p">,</span><span class="s">"#1f78b4"</span><span class="p">,</span><span class="s">"#b2df8a"</span><span class="p">,</span><span class="s">"#33a02c"</span><span class="p">,</span><span class="s">"#fb9a99"</span><span class="p">,</span><span class="s">"#e31a1c"</span><span class="p">,</span><span class="s">"#fdbf6f"</span><span class="p">,</span><span class="s">"#ff7f00"</span><span class="p">))</span> <span class="o">+</span>
labs<span class="p">(</span>title <span class="o">=</span> <span class="s">'Top Data Science Podcasts'</span><span class="p">,</span>
subtitle <span class="o">=</span> <span class="s">'Rankings crowdsourced from Google search results'</span><span class="p">,</span>
x <span class="o">=</span> <span class="s">''</span><span class="p">,</span>
y <span class="o">=</span> <span class="s">'Recommendation Score'</span><span class="p">,</span>
caption <span class="o">=</span> <span class="s">'michaeltoth.me / @michael_toth'</span><span class="p">)</span> <span class="o">+</span>
theme_ipsum<span class="p">(</span>base_size <span class="o">=</span> <span class="m">5</span><span class="p">,</span> caption_size <span class="o">=</span> <span class="m">8</span><span class="p">,</span> axis_title_size <span class="o">=</span> <span class="m">8</span><span class="p">)</span> <span class="o">+</span>
theme<span class="p">(</span>panel.grid.major.y <span class="o">=</span> element_blank<span class="p">(),</span>
panel.grid.major.x <span class="o">=</span> element_line<span class="p">(</span>colour <span class="o">=</span> <span class="s">'white'</span><span class="p">,</span> linetype <span class="o">=</span> <span class="s">'dotted'</span><span class="p">),</span>
panel.grid.minor.x <span class="o">=</span> element_line<span class="p">(</span>colour <span class="o">=</span> <span class="s">'white'</span><span class="p">,</span> linetype <span class="o">=</span> <span class="s">'dotted'</span><span class="p">),</span>
panel.ontop <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span>
legend.position <span class="o">=</span> <span class="s">'bottom'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190415_best_ds_podcasts/graph_podcasts_by_topic-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>While most of the top 10 are made up of what I'm calling general data science and analytics podcasts, Data Stories stands alone as the number one podcast and the only data visualization podcast in the top 10. Most of the language-specific podcasts are lower on the list, but that's likely due to the fact that they are more specific in nature and not necessarily due to lower quality. The R podcast was the only language-specific podcast to crack the top 10, so I'm definitely going to check that one out!</p>
<p>Do you have favorite podcasts that aren't included here? Let me know in the comments. Also let me know how you discover new podcasts, I'd love to improve my discovery! If you're interested, you can get the full list of data science podcasts below:</p>
<p><a href="https://mailchi.mp/d18f2f50ca14/data-science-podcasts">Get the spreadsheet of all 41 data science podcasts organized by topic</a></p>
<hr />
<p>Every week I publish concise tutorials 🎓 and career advice 💻 for data science and analytics workers. <a href="http://eepurl.com/gmYioz">I will help you learn R programming, build your data science career, and raise your salary.</a></p>A Detailed Guide to Plotting Line Graphs in R using ggplot geom_line2019-04-17T00:00:00-04:00Michael Tothtag:michaeltoth.me,2019-04-17:a-detailed-guide-to-plotting-line-graphs-in-r-using-ggplot-geom_line.html<p>When it comes to data visualization, it can be fun to think of all the flashy and exciting ways to display a dataset. But if you're trying to convey information, flashy isn't always the way to go. </p>
<p>In fact, in most cases, <strong>simplicity is key to making your audience understand</strong> your data. Whether it's <a href="https://michaeltoth.me/a-detailed-guide-to-the-ggplot-scatter-plot-in-r.html">scatter plots</a>, bar graphs, or line graphs (the subject of this post!), common graph types make things easy for your audience, which means you can more easily share your message.</p>
<p>Right now, we're talking about line graphs. A line graph is a type of graph that displays information as a series of data points connected by straight line segments. </p>
<h5>The price of Netflix stock (NFLX) displayed as a line graph</h5>
<p><center>
<img alt="The price of Netflix stock (NFLX) displayed as a line graph" src="../images/20190417_ggplot_geom_line/NFLX.png" width="600px" />
</center></p>
<h5>Line graph of average monthly temperatures for four major cities</h5>
<p><center>
<img alt="Line graph of average monthly temperatures for four major cities" src="../images/20190417_ggplot_geom_line/climate.png" width="600px" />
</center></p>
<p>There are many different ways to use R to plot line graphs, but the one I prefer is the <code>ggplot geom_line</code> function.</p>
<h2>Introduction to ggplot</h2>
<p>Before we dig into creating line graphs with the <code>ggplot geom_line</code> function, I want to briefly touch on <code>ggplot</code> and why I think it's the best choice for plotting graphs in R. </p>
<p><code>ggplot</code> is a package for creating graphs in R, but it's also a method of thinking about and decomposing complex graphs into logical subunits. </p>
<p><code>ggplot</code> takes each component of a graph--axes, scales, colors, objects, etc--and allows you to build graphs up sequentially one component at a time. You can then modify each of those components in a way that's both flexible and user-friendly. When components are unspecified, <code>ggplot</code> uses sensible defaults. This makes <code>ggplot</code> a powerful and flexible tool for creating all kinds of graphs in R. It's the tool I use to create nearly every graph I make these days, and I think you should use it too!</p>
<h2>Investigating our dataset</h2>
<p>Throughout this post, we'll be using the Orange dataset that's built into R. This dataset contains information on the age and circumference of 5 different orange trees, letting us see how these trees grow over time. Let's take a look at this dataset to see what it looks like:</p>
<p><center>
<img alt="A snippet of the Orange dataset" src="../images/20190417_ggplot_geom_line/Orange.png" width="400px" />
</center></p>
<p>The dataset contains 3 columns: Tree, age, and cimcumference. There are 7 observations for each Tree, and there are 5 Trees, for a total of 35 observations in all. </p>
<h2>Simple example of ggplot + geom_line()</h2>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>tidyverse<span class="p">)</span>
<span class="c1"># Filter the data we need</span>
tree_1 <span class="o"><-</span> filter<span class="p">(</span>Orange<span class="p">,</span> Tree <span class="o">==</span> <span class="m">1</span><span class="p">)</span>
<span class="c1"># Graph the data</span>
ggplot<span class="p">(</span>tree_1<span class="p">)</span> <span class="o">+</span>
geom_line<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> age<span class="p">,</span> y <span class="o">=</span> circumference<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_geom_line/simple_line-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>Here we are starting with the simplest possible line graph using geom_line. For this simple graph, I chose to only graph the size of the first tree. I used <code>dplyr</code> to filter the dataset to only that first tree.
If you're not familiar with <code>dplyr</code>'s <code>filter</code> function, it's my preferred way of subsetting a dataset in R, and I recently wrote an in-depth guide to <a href="https://michaeltoth.me/how-to-filter-in-r-a-detailed-introduction-to-the-dplyr-filter-function.html">dplyr filter</a> if you'd like to learn more!</p>
<p>Once I had filtered out the dataset I was interested in, I then used <code>ggplot + geom_line()</code> to create the graph. Let's review this in more detail:</p>
<p>First, I call <code>ggplot</code>, which creates a new <code>ggplot</code> graph. It's essentially a blank canvas on which we'll add our data and graphics. In this case, I passed tree_1 to <code>ggplot</code>, indicating that we'll be using the tree_1 data for this particular <code>ggplot</code> graph.</p>
<p>Next, I added my <code>geom_line</code> call to the base <code>ggplot</code> graph in order to create this line. In <code>ggplot</code>, you use the <code>+</code> symbol to add new layers to an existing graph. In this second layer, I told <code>ggplot</code> to use age as the x-axis variable and circumference as the y-axis variable. </p>
<p>And that's it, we have our line graph!</p>
<h2>Changing line color in <code>ggplot + geom_line</code></h2>
<p>Expanding on this example, let's now experiment a bit with colors.</p>
<div class="highlight"><pre><span></span><span class="c1"># Filter the data we need</span>
tree_1 <span class="o"><-</span> filter<span class="p">(</span>Orange<span class="p">,</span> Tree <span class="o">==</span> <span class="m">1</span><span class="p">)</span>
<span class="c1"># Graph the data</span>
ggplot<span class="p">(</span>tree_1<span class="p">)</span> <span class="o">+</span>
geom_line<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> age<span class="p">,</span> y <span class="o">=</span> circumference<span class="p">),</span> color <span class="o">=</span> <span class="s">'red'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_geom_line/color-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>You'll note that this geom_line call is identical to the one before, except that we've added the modifier <code>color = 'red'</code> to to end of the line. Experiment a bit with different colors to see how this works on your machine. You can use most color names you can think of, or you can use specific hex colors codes to get more granular.</p>
<p>Now, let's try something a little different. Compare the <code>ggplot</code> code below to the code we just executed above. There are 3 differences. See if you can find them and guess what will happen, then scroll down to take a look at the result.</p>
<div class="highlight"><pre><span></span><span class="c1"># Graph different data</span>
ggplot<span class="p">(</span>Orange<span class="p">)</span> <span class="o">+</span>
geom_line<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> age<span class="p">,</span> y <span class="o">=</span> circumference<span class="p">,</span> color <span class="o">=</span> Tree<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_geom_line/color_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>This line graph is quite different from the one we produced above, but we only made a few minor modifications to the code! Did you catch the 3 changes? They were:</p>
<ol>
<li>The dataset changed from tree_1 (our filtered dataset) to the complete Orange dataset</li>
<li>Instead of specifying <code>color = 'red'</code>, we specified <code>color = Tree</code></li>
<li>We moved the color parameter inside of the <code>aes()</code> parentheses</li>
</ol>
<p>Let's review each of these changes:</p>
<h5>Moving from tree_1 to Orange</h5>
<p>This change is relatively straightforward. Instead of only graphing the data for a single tree, we wanted to graph the data for all 5 trees. We accomplish this by changing our input dataset in the <code>ggplot()</code> call. </p>
<h5>Specifying <code>color = Tree</code> and moving it within the <code>aes()</code> parentheses</h5>
<p>I'm combining these because these two changes work together. </p>
<p>Before, we told <code>ggplot</code> to change the color of the line to red by adding <code>color = 'red'</code> to our <code>geom_line()</code> call. </p>
<p>What we're doing here is a bit more complex. Instead of specifying a single color for our line, we're telling <code>ggplot</code> to <em>map</em> the data in the <code>Tree</code> column to the <code>color</code> aesthetic. </p>
<p>Effectively, we're telling <code>ggplot</code> to use a different color for each tree in our data! This mapping also lets <code>ggplot</code> know that it also needs to create a legend to identify the trees, and it places it there automatically!</p>
<h2>Changing linetype in <code>ggplot + geom_line</code></h2>
<p>Let's look at a related example. This time, instead of changing the color of the line graph, we will change the linetype:</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>Orange<span class="p">)</span> <span class="o">+</span>
geom_line<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> age<span class="p">,</span> y <span class="o">=</span> circumference<span class="p">,</span> linetype <span class="o">=</span> Tree<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_geom_line/linetype_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>This <code>ggplot + geom_line()</code> call is identical to the one we just reviewed, except we've substituted <code>linetype</code> for <code>color</code>. The graph produced is quite similar, but it uses different linetypes instead of different colors in the graph. You might consider using something like this when printing in black and white, for example.</p>
<h2>A deeper review of <code>aes()</code> (aesthetic) mappings in ggplot</h2>
<p>We just saw how we can create graphs in <code>ggplot</code> that map the Tree variable to color or linetype in a line graph. <code>ggplot</code> refers to these mappings as <em>aesthetic</em> mappings, and they encompass everything you see within the <code>aes()</code> in ggplot.</p>
<p>Aesthetic mappings are a way of mapping <em>variables in your data</em> to particular <em>visual properties</em> (aesthetics) of a graph. </p>
<p>This might all sound a bit theoretical, so let's review the specific aesthetic mappings you've already seen as well as the other mappings available within geom_line.</p>
<h5>Reviewing the list of geom_line aesthetic mappings</h5>
<p>The main aesthetic mappings for <code>ggplot + geom_line()</code> include: </p>
<ul>
<li><code>x</code>: Map a variable to a position on the x-axis</li>
<li><code>y</code>: Map a variable to a position on the y-axis</li>
<li><code>color</code>: Map a variable to a line color</li>
<li><code>linetype</code>: Map a variable to a linetype</li>
<li><code>group</code>: Map a variable to a group (each group on a separate line)</li>
<li><code>size</code>: Map a variable to a line size</li>
<li><code>alpha</code>: Map a variable to a line transparency</li>
</ul>
<p>From the list above, we've already seen the <code>x</code>, <code>y</code>, <code>color</code>, and <code>linetype</code> aesthetic mappings. </p>
<p><code>x</code> and <code>y</code> are what we used in our first <code>ggplot + geom_line()</code> function call to map the variables age and circumference to x-axis and y-axis values. Then, we experimented with using <code>color</code> and <code>linetype</code> to map the Tree variable to different colored lines or linetypes. </p>
<p>In addition to those, there are 3 other main aesthetic mappings often used with <code>geom_line</code>. </p>
<p>The <code>group</code> mapping allows us to map a variable to different groups. Within <code>geom_line</code>, that means mapping a variable to different lines. Think of it as a pared down version of the <code>color</code> and <code>linetype</code> aesthetic mappings you already saw. While the <code>color</code> aesthetic mapped each Tree to a different line with a different color, the <code>group</code> aesthetic maps each Tree to a different line, but does not differentiate the lines by color or anything else. Let's take a look:</p>
<h5>Changing the <code>group</code> aesthetic mapping in <code>ggplot + geom_line</code></h5>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>Orange<span class="p">)</span> <span class="o">+</span>
geom_line<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> age<span class="p">,</span> y <span class="o">=</span> circumference<span class="p">,</span> group <span class="o">=</span> Tree<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_geom_line/group_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>You'll note that the 5 lines are separated as before, but the lines are all black and there is no legend differentiating them. Depending on the data you're working with, this may or may not be appropriate. It's up to <em>you</em> as the person familiar with the data to determine how best to represent it in graph form!</p>
<p>In our Orange tree dataset, if you're interested in investigating how <em>specific</em> orange trees grew over time, you'd want to use the <code>color</code> or <code>linetype</code> aesthetics to make sure you can track the progress for specific trees. If, instead, you're interested in only how orange trees <em>in general</em> grow, then using the <code>group</code> aesthetic is appropriate, simplifying your graph and discarding unnecessary detail.</p>
<p><code>ggplot</code> is both flexible and powerful, but it's up to <em>you</em> to design a graph that communicates what you want to show. Just because you <em>can</em> do something doesn't mean you <em>should</em>. You should always think about what message you're trying to convey with a graph, then design from those principles. </p>
<p>Keep this in mind as we review the next two aesthetics. While these aesthetics absolutely have a place in data visualization, in the case of the particular dataset we're working with, they don't make very much sense. But this is a guide to using <code>geom_line</code> in <code>ggplot</code>, not graphing the growth of Orange trees, so I'm still going to cover them for the sake of completeness!</p>
<h5>Changing transparency in <code>ggplot + geom_line</code> with the <code>alpha</code> aesthetic</h5>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>Orange<span class="p">)</span> <span class="o">+</span>
geom_line<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> age<span class="p">,</span> y <span class="o">=</span> circumference<span class="p">,</span> alpha <span class="o">=</span> Tree<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_geom_line/alpha_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>Here we map the <code>Tree</code> variable to the <code>alpha</code> aesthetic, which controls the transparency of the line. As you can see, certain lines are more transparent than others. In this case, transparency does not add to our understanding of the graph, so I would not use this to illustrate this dataset.</p>
<h5>Changing the <code>size</code> aesthetic mapping in <code>ggplot + geom_line</code></h5>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>Orange<span class="p">)</span> <span class="o">+</span>
geom_line<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> age<span class="p">,</span> y <span class="o">=</span> circumference<span class="p">,</span> size <span class="o">=</span> Tree<span class="p">))</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_geom_line/size_aes-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>Finally, we turn to the size aesthetic, which controls the size of lines. Again, I would say this is not does not add to our understanding of our data in this context. That said, it does slightly resemble <a href="https://en.wikipedia.org/wiki/Charles_Joseph_Minard">Charles Joseph Minard's</a> famous graph of the death tolls of Napoleon's disastrous 1812 Russia Campaign, so that's kind of cool:</p>
<p><center>
<img alt="A snippet of the Orange dataset" src="../images/20190417_ggplot_geom_line/Minard.png" width="600px" />
</center></p>
<h2>Aesthetic mappings vs. parameters in ggplot</h2>
<p>Before, we saw that we are able to use <code>color</code> in two different ways with geom_line. First, we were able to set the color of a line to red by specifying <code>color = 'red'</code> <em>outside</em> of our <code>aes()</code> mappings. Then, we were able to <em>map</em> the variable <code>Tree</code> to color by specifying <code>color = Tree</code> <em>inside</em> of our <code>aes()</code> mappings. How does this work with all of the other aesthetics you just learned about?</p>
<p>Essentially, they all work the same as color! That's the beautiful thing about graphing in <code>ggplot</code>--once you understand the syntax, it's very easy to expand your capabilities. </p>
<p>Each of the aesthetic mappings you've seen can also be used as a <em>parameter</em>, that is, a fixed value defined outside of the <code>aes()</code> aesthetic mappings. You saw how to do this with color when we set the line to red with <code>color = 'red'</code> before. Now let's look at an example of how to do this with linetype in the same manner:</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>Orange<span class="p">)</span> <span class="o">+</span>
geom_line<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> age<span class="p">,</span> y <span class="o">=</span> circumference<span class="p">,</span> group <span class="o">=</span> Tree<span class="p">),</span> linetype <span class="o">=</span> <span class="s">'dotted'</span><span class="p">)</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_geom_line/linetype_parameter-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>To review what values <code>linetype</code>, <code>size</code>, and <code>alpha</code> accept, just run <code>?linetype</code>, <code>?size</code>, or <code>?alpha</code> from your console window!</p>
<h2>Common errors with aesthetic mappings and parameters in ggplot</h2>
<p>When I was getting started with R and ggplot, the distinction between aesthetic mappings (the values included <em>inside</em> your <code>aes()</code>), and parameters (the ones <em>outside</em> your <code>aes()</code> was the concept that tripped me up the most. You'll learn how to deal with these issues over time, but I can help you speed along this process with a few common errors that you can keep an eye out for.</p>
<h5>Trying to include aesthetic mappings <em>outside</em> your <code>aes()</code> call</h5>
<p>If you're trying to map the Tree variable to linetype, you should include <code>linetype = tree</code> within the <code>aes()</code> of your <code>geom_line</code> call. What happens if you accidentally include it outside, and instead run <code>ggplot(Orange) + geom_line(aes(x = age, y = circumference), linetype = Tree)</code>? You'll get an error message that looks like this:</p>
<p><center>
<img alt="ggplot geom_line error message" src="../images/20190417_ggplot_geom_line/error_1.png" width="600px" />
</center></p>
<p>Whenever you see this error about object not found, make sure you check and make sure you're including your aesthetic mappings <em>inside</em> the <code>aes()</code> call!</p>
<h5>Trying to specify parameters <em>inside</em> your <code>aes()</code> call</h5>
<p>Alternatively, if we try to specify a specific parameter value (for example, <code>color = 'red'</code>) inside of the <code>aes()</code> mapping, we get a less intutive issue:</p>
<div class="highlight"><pre><span></span>ggplot<span class="p">(</span>Orange<span class="p">)</span> <span class="o">+</span>
geom_line<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> age<span class="p">,</span> y <span class="o">=</span> circumference<span class="p">,</span> color <span class="o">=</span> <span class="s">'red'</span><span class="p">))</span>
</pre></div>
<p><img src="/figures/20190416_ggplot_geom_line/unnamed-chunk-1-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>In this case, <code>ggplot</code> actually does produce a line graph (success!), but it doesn't have the result we intended. The graph it produces looks odd, because it is putting the values for all 5 trees on a single line, rather than on 5 separate lines like we had before. It did change the color to red, but it also included a legend that simply says 'red'. When you run into issues like this, double check to make sure you're including the parameters of your graph <em>outside</em> your <code>aes()</code> call!</p>
<p>You should now have a solid understanding of how to use R to plot line graphs using <code>ggplot</code> and <code>geom_line</code>! Experiment with the things you've learned to solidify your understanding. As an exercise, try producing a line graph of your own using a different dataset and at least one of the aesthetic mappings you learned about. Leave your graph in the comments or email it to me at mt.toth@gmail.com -- I'd love to take a look at what you produce!</p>
<hr />
<p>Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, <a href="http://eepurl.com/gmYioz">sign up here!</a></p>R Programmers Earn More than Python Programmers2019-04-15T00:00:00-04:00Michael Tothtag:michaeltoth.me,2019-04-15:r-programmers-earn-more-than-python-programmers.html<p>At least globally, that is. According to the <a href="https://insights.stackoverflow.com/survey/2019">2019 Stack Overflow Developer Survey</a>, R users globally reported earning an average of $64k per year, $1k more than the $63k reported by Python developers. In the United States, that situation reverses, with Python programmers earning $116k and R programmers $108k. </p>
<h5>Global Average Salaries by Technology</h5>
<p><p style="text-align:center;"><img src="https://michaeltoth.me/images/20190412_Stack_Overflow/Global_Average_Salaries_By_Tech.png" width="300"></p></p>
<h6>United States Average Salaries by Technology</h6>
<p><p style="text-align:center;"><img src="https://michaeltoth.me/images/20190412_Stack_Overflow/US_Average_Salaries_By_Tech.png" width="300"></p></p>
<h2>Highlights of The Stack Overflow Developer Survey</h2>
<p>Since 2011, Stack Overflow has been surveying their users each year to answer questions about the technologies they use, their work experience, their compensation, and their satisfaction at work. Given Stack Overflow's place in the broader programming world, they are able to draw quite the audience for their annual surveys. </p>
<p>This year, nearly 90,000 developers participated in the survey! There's a lot in this survey, and I recommend reviewing it yourself, but I wanted to surface some of the key findings that I thought were particularly relevant to data professionals here.</p>
<p>Stack Overflow says they will be releasing the underlying data for this survey in the coming weeks, so I hope to return to this for a deeper analysis once that's made available. For now, let's get into the results!</p>
<h6>Developer Roles</h6>
<p><p style="text-align:center;"><img src="https://michaeltoth.me/images/20190412_Stack_Overflow/Global_Developer_Roles.png" width="350"></p></p>
<p>People with all different types of coding backgrounds use Stack Overflow. While most of them identify as developers (just over 51.9% globally identify as full-stack developers), there are also a significant number of data professionals on the list (a term I've just invented to include the categories database administrator, data scienctist, data analyst, and data engineer).</p>
<p>Globally, 11.7% of Stack Overflow users surveyed identified as database administrators, 7.9% as data scientists or machine learning specialists, 7.7% as data analysts / business analysts, and 7.2% as data engineers. These figures were slightly higher for US-based respondents to the survey.</p>
<p>Note that these figures are not mutually exclusive, as people were able to select multiple options in the survey.</p>
<h6>United States Average Salaries by Job</h6>
<p><p style="text-align:center;"><img src="https://michaeltoth.me/images/20190412_Stack_Overflow/US_Average_Salaries_By_Job.png" width="350"></p></p>
<p>In the United States, data scientists/machine learning engineers reported an average salary of $120k. Data engineers also reported an average salary of $120k. Database administrators reported $105k, while data analysts / business analysts reported an average salary of $100k.</p>
<h6>Years of Experience</h6>
<p><p style="text-align:center;"><img src="https://michaeltoth.me/images/20190412_Stack_Overflow/Years_Of_Experience.png" width="350"></p></p>
<p>In the survey, data and business analysts reported an average of 9.3 years of professional coding experience, while data scientists reported an average of 7.8 years of experience. I thought this was particularly interesting in light of the fact that average salaries for data scientists were reported to be $20k higher than those of data analysts. </p>
<h6>Percent Looking for Jobs</h6>
<p><p style="text-align:center;"><img src="https://michaeltoth.me/images/20190412_Stack_Overflow/Looking_For_Jobs.png" width="350"></p></p>
<p>Both data scientists and data analysts ranked near the top of those surveyed to be looking for jobs, with 18.6% of data scientists looking and 17.9% of data analysts looking.</p>
<h6>Programming, Scripting, and Markup Languages Used</h6>
<p><p style="text-align:center;"><img src="https://michaeltoth.me/images/20190412_Stack_Overflow/Tools_Used.png" width="350"></p></p>
<p>SQL was by far the most popular data technology, at 54.4%. This is to be expected given its broad use in business applications and its long history. Python clocked in at 41.7%, though as a general purpose programming language the majority of those users are likely not using Python for its data processing in particular. R came in at 5.8% usage in the survey.</p>
<h6>Undergraduate Major</h6>
<p><p style="text-align:center;"><img src="https://michaeltoth.me/images/20190412_Stack_Overflow/Training.png" width="400"></p></p>
<p>While not directly related to analytics, I thought the survey question on undergraduate major was particularly interesting. A massive 62.4% of those surveyed reported studying computer science, computer engineering, or software engineering in undergrad. In undergrad I studied finance and statistics, which came in at 2.4% and 3.9% respectively, quite low on the scale here. I wonder how much these figures would differ for data scientists, where fields like economics, statistics, and business seem to be better represented. I hope that when Stack Overflow releases their data for this survey I'll be able to analyze this question more deeply! </p>
<p>I encourage you to <a href="https://insights.stackoverflow.com/survey/2019">check out the survey</a> for a bunch of other useful information. There's more information on database usage and machine learning frameworks, as well as general information on work environments, developer satisfaction, and much more!</p>
<hr />
<p>Did you find this post interesting? I frequently write about topics in the fields of analytics and data science to help keep you up to date on developments in the industry. If you want to be notified of new posts, <a href="http://eepurl.com/gmYioz">sign up here!</a></p>
<hr />
<p>I help technology companies to leverage their data to produce branded, influential content to share with their clients. I work on everything from investor newsletters to blog posts to research papers. If you enjoy the work I do and are interested in working together, you can visit my <a href="https://www.michaeltothanalytics.com" target="_blank">consulting website</a> or contact me at <a href="mailto:michael@michaeltothanalytics.com">michael@michaeltothanalytics.com</a>!</p>Announcement: Register by Friday for Free R Training Sessions2019-04-10T00:00:00-04:00Michael Tothtag:michaeltoth.me,2019-04-10:announcement-register-by-friday-for-free-r-training-sessions.html<p>I run this blog because I want to help you learn to be a better R user! But in order to do that, I need to know more about where you are in your journey and the kinds of problems you're facing. </p>
<p>So, I'm excited to announce that over the next two weeks I am going to be offering <strong>free one-on-one 45-minute R training sessions to 10 people!</strong> You come to me with a problem, and we'll work together over a screen sharing session to solve it using R!</p>
<p>Would it be helpful to spend 45 minutes with me and receive personalized advice on solving your R problems? Then please <a href="https://michaeltoth.me/pages/contact-me.html">contact me here</a> and give me some details about the project you'd like me to help with! </p>
<p>Make sure to submit your request by noon EST on Friday, April 12. I'll get back to you for scheduling early next week if you're one of the 10 selected!</p>
<p>I want to leave this relatively open to different topics, but this should give you some ideas of things I will be able to help you with:</p>
<ul>
<li>Loading and Cleaning Data in R using readr, dplyr, and other tidyverse tools</li>
<li>Creating graphs in R with ggplot</li>
<li>Animating graphs in R using gganimate</li>
<li>Creating reports in R using RMarkdown and knitr</li>
<li>Creating your own R package</li>
<li>Creating maps in R</li>
</ul>
<p>I look forward to hearing from you!</p>The Ultimate Opinionated Guide to Base R Date Format Functions2019-04-09T00:00:00-04:00Michael Tothtag:michaeltoth.me,2019-04-09:the-ultimate-opinionated-guide-to-base-r-date-format-functions.html<p>When I was first learning R, working with dates was one of the hardest and most time consuming tasks I dealt with. There are so many things to learn! What do I do with <code>as.POSIXct()</code>, <code>as.POSIXlt()</code>, <code>strftime()</code>, <code>strptime()</code>, <code>format()</code>, and <code>as.Date()</code>? R date formats were confusing, and it seemed no matter what I did I was always running into issues. </p>
<p>And I'll be honest, working with dates in R still trips me up from time to time. It can be confusing. But I've learned to follow a procedure to guide me through any date manipulation task with ease.</p>
<p>Today I'm going to help you learn that same procedure so you'll never have to worry about R date format issues again!</p>
<h2>The goals of most R date format exercises</h2>
<p>When working with R date formats, you're generally going to be trying to accomplish one of two different but related goals:</p>
<ol>
<li>Converting a character string like "<code>Jan 30 1989</code>" to a Date type</li>
<li>Getting an R Date object to print in a specific format for a graph or other output</li>
</ol>
<p>You may need to handle both of these goals in the same analysis, but it's best to think of them as two separate exercises. Knowing which goal you are trying to accomplish is important because you will need to use different functions to accomplish each of these. Let's tackle them one at a time.</p>
<h2>Converting a character string to a date</h2>
<p>A common scenario is that you have read in a .csv file where one of the columns contains dates. Often you will find this column is read in as a <code>character</code> vector. </p>
<p>If you don't need to use this column for anything, this might not be a problem. </p>
<p>But, often, you will need to do things with it! You'll want to sort a graph by date, or calculate the number of days between two dates, or format your dates in a specific way. </p>
<p>To accomplish any of those things, you'll first need to convert your <code>character</code> vector to a <code>Date</code> vector.</p>
<p>R has 3 main object types for working with dates: <code>Date</code>, <code>POSIXct</code>, and <code>POSIXlt</code>. <code>Date</code> objects can only work with dates, while <code>POSIXct</code> and <code>POSIXlt</code> objects can work with both dates and times. </p>
<p>Before you do any conversion, you need to first decide whether you want to keep any time data (if available) or if you only are working with dates. </p>
<p>If you're only working with dates, you'll want the <code>as.Date()</code> function which produces objects of type <code>Date</code>. </p>
<p>If you want both dates and times, you'll want the <code>as.POSIXct()</code> function which produces objects of type <code>POSIXct</code>. </p>
<p>Luckily, you'll find that these functions operate very similarly to one another, so you won't need to worry about memorizing little idiosyncracies between them!</p>
<p>Side note: In this post, I'm going to ignore the <code>POSIXlt</code> type which is very similar to <code>POSIXct</code> with some implementation differences beyond the scope of this post.</p>
<h4>Converting a character string to a date using the as.Date() R function</h4>
<p>The main function for converting from a <code>character</code> string to a <code>Date</code> (<em>without</em> time information) is the <code>as.Date()</code> function. <code>as.Date()</code> accepts a date vector and a format specification. The format specification identifies what date information is contained in the character string you are providing. Let's look at an example:</p>
<div class="highlight"><pre><span></span>my_date <span class="o"><-</span> <span class="s">"01/30/1989"</span> <span class="c1"># Input character string</span>
<span class="kp">as.Date</span><span class="p">(</span>my_date<span class="p">,</span> format <span class="o">=</span> <span class="s">"%m/%d/%Y"</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "1989-01-30"
</pre></div>
<p>Our input character string was in the format month/day/year, and we used the R format specification that corresponds to this, <code>%m/%d/%Y</code>, to convert this character string to a date.</p>
<h4>Converting a character string to a POSIXct datetime using the <code>as.POSIXct()</code> R function</h4>
<p>The main function for converting from a <code>character</code> string to a <code>POSIXct</code> datetime object (<em>with</em> time information) is the <code>as.POSIXct()</code> function. Just like <code>as.Date()</code>, <code>as.POSIXCT()</code> accepts a date vector and a format specification, which identifies what date and time information is contained in the character string you are providing. Here's an example:</p>
<div class="highlight"><pre><span></span>my_date_time <span class="o"><-</span> <span class="s">"01/30/1989 23:40:00"</span> <span class="c1"># Input character string with time information</span>
<span class="kp">as.POSIXct</span><span class="p">(</span>my_date_time<span class="p">,</span> format <span class="o">=</span> <span class="s">"%m/%d/%Y %H:%M:%S"</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "1989-01-30 23:40:00 EST"
</pre></div>
<p>Here we had the added the time string <code>23:40:00</code> on top of the same date we processed previously. As before, we used the R format specification for this string, <code>%m/%d/%Y %H:%M:%S</code>, to convert to a datetime <code>POSIXct</code> object.</p>
<h4>Discarding unnecessary time data</h4>
<p>We'll get more into how these format specifications work in a minute, but first I want to make a quick aside. Sometimes your dates will contain time information, but you won't actually need that for your analysis. This can sometimes be annoying to keep around, and it's often cleaner if you just get rid of it. Luckily, this is quite easy. Instead of using the <code>as.POSIXct()</code> function, we can simply use the <code>as.Date()</code> function and ignore the trailing timestamp information, as follows:</p>
<div class="highlight"><pre><span></span>my_date_time <span class="o"><-</span> <span class="s">"01/30/1989 23:40:00"</span>
<span class="kp">as.Date</span><span class="p">(</span>my_date_time<span class="p">,</span> format <span class="o">=</span> <span class="s">"%m/%d/%Y"</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "1989-01-30"
</pre></div>
<p>This will convert the character strings to <code>Date</code> objects, dropping the extraneous time data from our dataset. Remember: if you need time data, use <code>as.POSIXct()</code>, but if you don't, just use <code>as.Date()</code>!</p>
<h4>More on R date format specifications</h4>
<p>Above, we reviewed examples of how <code>as.Date()</code> and <code>as.POSIXct()</code> can convert character strings to dates, given the right <em>format specification</em>. Now, it's time to review what the different format specifications are, and how we can use them to convert character strings formatted in all different ways to dates.</p>
<p>We've already briefly seen the symbols <code>%m</code>, <code>%d</code>, <code>%Y</code>, <code>%H</code>, <code>%M</code>, and <code>%S</code>, and you probably have some idea that these correspond to month, day, year, hour, minute, and second. I'd now like to introduce the list of the most commonly used R date format specifications:</p>
<p><center>
<img alt="Note: This table includes most commonly used R date formats, but is not exhaustive. For a complete list, run ?strptime in the R console." src="../images/common_r_date_formats.png" width="400px" />
</center> </p>
<p>Look through this table to identify the different date formats we worked through previously. Pay special attention to the difference between <code>%b</code>, <code>%B</code>, and <code>%m</code>, as well as <code>%y</code> and <code>%Y</code>! </p>
<h3>Standard procedure for converting a character string to date or datetime object</h3>
<ol>
<li>Identify the key variables we need to map<ul>
<li>Generally month, day, and year for dates</li>
<li>Add in hour, minute, and second for times</li>
</ul>
</li>
<li>For each key variable, identify the appropriate mapping</li>
<li>Construct the format specification string</li>
<li>Construct the <code>as.Date()</code> or <code>as.POSIXct()</code> function call</li>
</ol>
<p>Let's work through a few examples!</p>
<p>Say we have a string in the format <code>Jan 30th, 1989 23:40</code>. </p>
<ol>
<li>The variables we need to map are month, day, year, hour, and minute</li>
<li>Let's find the appropriate mappings:<ul>
<li>This string uses an abbreviated month, which maps to <code>%b</code></li>
<li>Day of the month maps to <code>%d</code></li>
<li>4-digit years map to <code>%Y</code></li>
<li>24-hour-clock hour maps to <code>%H</code></li>
<li>Minutes map to <code>%M</code></li>
</ul>
</li>
<li>The format specification string should exactly match the input string, simply substituting in our mappings from above. In this case: <code>Jan 30th, 1989 23:40</code> becomes <code>%b %dth, %Y %H:%M</code></li>
<li>Because we're dealing with both dates and times here, we know we're going to need <code>as.POSIXct()</code> if we want to maintain that time data. Let's give this a try:</li>
</ol>
<div class="highlight"><pre><span></span>my_date_time <span class="o"><-</span> <span class="s">"Jan 30th, 1989 23:40"</span>
<span class="kp">as.POSIXct</span><span class="p">(</span>my_date_time<span class="p">,</span> format <span class="o">=</span> <span class="s">"%b %dth, %Y %H:%M"</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "1989-01-30 23:40:00 EST"
</pre></div>
<p>Awesome! It processed the data exactly as we needed it. Let's go through one more exercise to make sure we have this down before we move onto date formatting for output.</p>
<p>In this case, say we have the string <code>30 January 1989 11:40 PM</code>.</p>
<ol>
<li>The variables we need to map are day, month, year, hour, minute, and AM/PM</li>
<li>Let's again find the appropriate mappings:<ul>
<li>Day of the month maps to <code>%d</code></li>
<li>This string uses a full month, which maps to <code>%B</code></li>
<li>4-digit years map to <code>%Y</code></li>
<li>12-hour-clock hour maps to <code>%I</code></li>
<li>Minutes map to <code>%M</code></li>
<li>AM/PM indicator maps to <code>%p</code></li>
</ul>
</li>
<li>Substituting, <code>30 January 1989 11:40 PM</code> becomes <code>%d %B %Y %I:%M %p</code></li>
<li>Again, because we're dealing with both dates and times we need as.POSIXct to maintain the time data</li>
</ol>
<div class="highlight"><pre><span></span>my_date_time <span class="o"><-</span> <span class="s">'30 January 1989 2:24 PM'</span>
<span class="kp">as.POSIXct</span><span class="p">(</span>my_date_time<span class="p">,</span> format <span class="o">=</span> <span class="s">"%d %B %Y %I:%M %p"</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "1989-01-30 14:24:00 EST"
</pre></div>
<p>There we have it! You should now be equipped to take a given character string, determine the format of that string, and then use either the <code>as.Date()</code> R function (in the case of only dates) or the <code>as.POSIXct()</code> R function (in the case of dates and times) to convert that character string to a date or datetime representation.</p>
<p>Now, let's turn to our second challenge: formatting an R date or datetime object for output, cleaning things up to be publication-ready. Luckily, you'll find there are many similarities in the approach to what you've just learned!</p>
<h2>Formatting a date for publication-ready output</h2>
<p>In this case, you have an R object that is already stored as one of several R date formats (<code>Date</code>, <code>POSIXct</code>, or <code>POSIXlt</code>), and now you'd like to clean up that date for graphing or publication. I find that I perform this type of transformation most often when I'm making graphs, but this is useful for creating RMarkdown reports and other output as well.</p>
<p>While earlier we used the <code>as.POSIXct()</code> and <code>as.Date()</code> R functions to convert from characters to dates, we'll now be using the <code>format()</code> R function to convert from dates to characters! </p>
<p>As before, we'll want to decide what information is important for mapping, select the appropriate format specification, and then build our function call. Luckily, this process looks very similar to what we did before, we're just working in reverse! Using the same table from above, we can find the variables we need to map our date to a specific format. Review the examples below to see how we convert a date or datetime variable to different character formats for output!</p>
<div class="highlight"><pre><span></span>my_date <span class="o"><-</span> <span class="kp">as.Date</span><span class="p">(</span><span class="s">"01/30/1989"</span><span class="p">,</span> <span class="s">"%m/%d/%Y"</span><span class="p">)</span>
my_date <span class="c1"># Unformatted date</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "1989-01-30"
</pre></div>
<div class="highlight"><pre><span></span><span class="kp">format</span><span class="p">(</span>my_date<span class="p">,</span> <span class="s">'%B %d %Y'</span><span class="p">)</span> <span class="c1"># Date format 'January 30 1989'</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "January 30 1989"
</pre></div>
<div class="highlight"><pre><span></span><span class="kp">format</span><span class="p">(</span>my_date<span class="p">,</span> <span class="s">'%B %dth, %Y'</span><span class="p">)</span> <span class="c1"># Date format 'January 30th, 1989'</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "January 30th, 1989"
</pre></div>
<div class="highlight"><pre><span></span><span class="kp">format</span><span class="p">(</span>my_date<span class="p">,</span> <span class="s">'%d %b %Y'</span><span class="p">)</span> <span class="c1"># Date format '30 Jan 1989'</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "30 Jan 1989"
</pre></div>
<div class="highlight"><pre><span></span><span class="kp">format</span><span class="p">(</span>my_date<span class="p">,</span> <span class="s">'%A %B %d %Y'</span><span class="p">)</span> <span class="c1"># Date format 'Monday January 30 1989'</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "Monday January 30 1989"
</pre></div>
<div class="highlight"><pre><span></span><span class="kp">format</span><span class="p">(</span>my_date<span class="p">,</span> <span class="s">'%m/%d/%y'</span><span class="p">)</span> <span class="c1"># Date format '01/30/90'</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "01/30/89"
</pre></div>
<div class="highlight"><pre><span></span>my_date_time <span class="o"><-</span> <span class="kp">Sys.time</span><span class="p">()</span> <span class="c1"># Function that generates the current time</span>
my_date_time <span class="c1"># Unformatted datetime</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "2019-04-09 13:28:12 EDT"
</pre></div>
<div class="highlight"><pre><span></span><span class="kp">format</span><span class="p">(</span>my_date_time<span class="p">,</span> <span class="s">'%B %d %Y'</span><span class="p">)</span> <span class="c1"># Datetime format 'April 09 1989' (Discard time)</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "April 09 2019"
</pre></div>
<div class="highlight"><pre><span></span><span class="kp">format</span><span class="p">(</span>my_date_time<span class="p">,</span> <span class="s">'%B %d %Y %H:%M'</span><span class="p">)</span> <span class="c1"># Datetime format 'April 09 1989 12:14'</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "April 09 2019 13:28"
</pre></div>
<div class="highlight"><pre><span></span><span class="kp">format</span><span class="p">(</span>my_date_time<span class="p">,</span> <span class="s">'%H:%M on %B %d %Y'</span><span class="p">)</span> <span class="c1"># Datetime format '12:14 on April 09 1989'</span>
</pre></div>
<div class="highlight"><pre><span></span>## [1] "13:28 on April 09 2019"
</pre></div>
<p>There we have it! You should now be able to easily perform nearly any date operation you need in R. You can take character strings and convert them to dates using the <code>as.POSIXct()</code> and <code>as.Date()</code> R functions. You can also take date or datetime objects and use the <code>format()</code> function to clean them up for publication-ready graphs and papers! </p>
<hr />
<p>Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, <a href="http://eepurl.com/gmYioz">sign up here!</a></p>
<hr />
<p>I help technology companies to leverage their data to produce branded, influential content to share with their clients. I work on everything from investor newsletters to blog posts to research papers. If you enjoy the work I do and are interested in working together, you can visit my <a href="https://www.michaeltothanalytics.com" target="_blank">consulting website</a> or contact me at <a href="mailto:michael@michaeltothanalytics.com">michael@michaeltothanalytics.com</a>!</p>How to Filter in R: A Detailed Introduction to the dplyr Filter Function2019-04-08T00:00:00-04:00Michael Tothtag:michaeltoth.me,2019-04-08:how-to-filter-in-r-a-detailed-introduction-to-the-dplyr-filter-function.html<p>Data wrangling. It's the process of getting your raw data transformed into a format that's easier to work with for analysis. </p>
<p>It's not the sexiest or the most exciting work. </p>
<p>In our dreams, all datasets come to us perfectly formatted and ready for all kinds of sophisticated analysis! In real life, not so much. </p>
<p>It's estimated that as much as 75% of a data scientist's time is spent data wrangling. To be an effective data scientist, you need to be good at this, and you need to be FAST.</p>
<p>One of the most basic data wrangling tasks is filtering data. Starting from a large dataset, and reducing it to a smaller, more manageable dataset, based on some criteria. </p>
<p>Think of filtering your sock drawer by color, and pulling out only the black socks. </p>
<p>Whenever I need to filter in R, I turn to the <code>dplyr filter</code> function.</p>
<p>As is often the case in programming, there are many ways to filter in R. But the <code>dplyr filter</code> function is by far my favorite, and it's the method I use the vast majority of the time.</p>
<p>Why do I like it so much? It has a user-friendly syntax, is easy to work with, and it plays very nicely with the other dplyr functions.</p>
<h2>A brief introduction to dplyr</h2>
<p>Before I go into detail on the <code>dplyr filter</code> function, I want to briefly introduce dplyr as a whole to give you some context. </p>
<p>dplyr is a cohesive set of data manipulation functions that will help make your data wrangling as painless as possible. </p>
<p>dplyr, at its core, consists of 5 functions, all serving a distinct data wrangling purpose:</p>
<ul>
<li><code>filter()</code> selects rows based on their values</li>
<li><code>mutate()</code> creates new variables</li>
<li><code>select()</code> picks columns by name</li>
<li><code>summarise()</code> calculates summary statistics</li>
<li><code>arrange()</code> sorts the rows</li>
</ul>
<p>The beauty of dplyr is that the syntax of all of these functions is very similar, and they all work together nicely. </p>
<p>If you master these 5 functions, you'll be able to handle nearly any data wrangling task that comes your way. But we need to tackle them one at a time, so now: let's learn to filter in R using dplyr!</p>
<h2>Loading Our Data</h2>
<p>In this post, I'll be using the <code>diamonds</code> dataset, a dataset built into the ggplot package, to illustrate the best use of the <code>dplyr filter</code> function. To start, let's take a look at the data:</p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>dplyr<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>ggplot2<span class="p">)</span>
diamonds
</pre></div>
<div class="highlight"><pre><span></span>## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
</pre></div>
<p>We can see that the dataset gives characteristics of individual diamonds, including their carat, cut, color, clarity, and price. </p>
<h2>Our First dplyr Filter Operation</h2>
<p>I'm a big fan of learning by doing, so we're going to dive in right now with our first <code>dplyr filter</code> operation.</p>
<p>From our <code>diamonds</code> dataset, we're going to filter only those rows where the diamond cut is 'Ideal':</p>
<div class="highlight"><pre><span></span>filter<span class="p">(</span>diamonds<span class="p">,</span> cut <span class="o">==</span> <span class="s">'Ideal'</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## # A tibble: 21,551 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75
## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44
## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72
## 10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63
## # … with 21,541 more rows
</pre></div>
<p>As you can see, every diamond in the returned data frame is showing a cut of 'Ideal'. It worked! We'll cover exactly what's happening here in more detail, but first let's briefly review how R works with logical and relational operators, and how we can use those to efficiently filter in R.</p>
<h2>A brief aside on logical and relational operators in R and dplyr</h2>
<p>In dplyr, filter takes in 2 arguments: </p>
<ul>
<li>The dataframe you are operating on </li>
<li>A conditional expression that evaluates to <code>TRUE</code> or <code>FALSE</code> </li>
</ul>
<p>In the example above, we specified <code>diamonds</code> as the dataframe, and <code>cut == 'Ideal'</code> as the conditional expression</p>
<p>Conditional expression? What am I talking about?</p>
<p>Under the hood, <code>dplyr filter</code> works by testing each row against your conditional expression and mapping the results to <code>TRUE</code> and <code>FALSE</code>. It then selects all rows that evaluate to <code>TRUE</code>. </p>
<p>In our first example above, we checked that the diamond cut was Ideal with the conditional expression <code>cut == 'Ideal'</code>. For each row in our data frame, dplyr checked whether the column <code>cut</code> was set to <code>'Ideal'</code>, and returned only those rows where <code>cut == 'Ideal'</code> evaluated to <code>TRUE</code>. </p>
<p>In our first filter, we used the operator <code>==</code> to test for equality. That's not the only way we can use dplyr to filter our data frame, however. We can use a number of different <strong>relational operators</strong> to filter in R.</p>
<p><strong>Relational operators</strong> are used to compare values. In R generally (and in dplyr specifically), those are:</p>
<ul>
<li><code>==</code> (Equal to)</li>
<li><code>!=</code> (Not equal to)</li>
<li><code><</code> (Less than)</li>
<li><code><=</code> (Less than or equal to)</li>
<li><code>></code> (Greater than)</li>
<li><code>>=</code> (Greater than or equal to)</li>
</ul>
<p>These are standard mathematical operators you're used to, and they work as you'd expect. One quick note: make sure you use the double equals sign (<code>==</code>) for comparisons! By convention, a single equals sign (<code>=</code>) is used to assign a value to a variable, and a double equals sign (<code>==</code>) is used to check whether two values are equal. Using a single equals sign will often give an error message that is not intuitive, so make sure you check for this common error!</p>
<p>dplyr can also make use of the following <strong>logical operators</strong> to string together multiple different conditions in a single <code>dplyr filter</code> call!</p>
<ul>
<li><code>!</code> (logical NOT)</li>
<li><code>&</code> (logical AND)</li>
<li><code>|</code> (logical OR)</li>
</ul>
<p>There are two additional operators that will often be useful when working with dplyr to filter:</p>
<ul>
<li><code>%in%</code> (Checks if a value is in an array of multiple values)</li>
<li><code>is.na()</code> (Checks whether a value is NA)</li>
</ul>
<p>In our first example above, we tested for equality when we said <code>cut == 'Ideal'</code>. Now, let's expand our capabilities with different relational parameters in our filter:</p>
<div class="highlight"><pre><span></span>filter<span class="p">(</span>diamonds<span class="p">,</span> price <span class="o">></span> <span class="m">2000</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## # A tibble: 29,733 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57
## 2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
## 3 0.7 Ideal G VS2 61.6 56 2757 5.7 5.67 3.5
## 4 0.71 Very Good E VS2 62.4 57 2759 5.68 5.73 3.56
## 5 0.78 Very Good G SI2 63.8 56 2759 5.81 5.85 3.72
## 6 0.7 Good E VS2 57.5 58 2759 5.85 5.9 3.38
## 7 0.7 Good F VS1 59.4 62 2759 5.71 5.76 3.4
## 8 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
## 9 0.73 Very Good E SI1 61.6 59 2760 5.77 5.78 3.56
## 10 0.8 Premium H SI1 61.5 58 2760 5.97 5.93 3.66
## # … with 29,723 more rows
</pre></div>
<p>Here, we select only the diamonds where the price is greater than 2000.</p>
<div class="highlight"><pre><span></span>filter<span class="p">(</span>diamonds<span class="p">,</span> cut <span class="o">!=</span> <span class="s">'Ideal'</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## # A tibble: 32,389 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 3 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 9 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## 10 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
## # … with 32,379 more rows
</pre></div>
<p>And here, we select all the diamonds whose cut is NOT equal to 'Ideal'. Note that this is the exact opposite of what we filtered before. </p>
<p>You can use <code><</code>, <code>></code>, <code><=</code>, <code>>=</code>, <code>==</code>, and <code>!=</code> in similar ways to filter your data. Try a few examples on your own to get comfortable with the different filtering options!</p>
<h2>A note on storing your results</h2>
<p>By default, dplyr filter will perform the operation you ask and then print the result to the screen. If you prefer to store the result in a variable, you'll need to assign it as follows:</p>
<div class="highlight"><pre><span></span>e_diamonds <span class="o"><-</span> filter<span class="p">(</span>diamonds<span class="p">,</span> color <span class="o">==</span> <span class="s">'E'</span><span class="p">)</span>
</pre></div>
<p>Note that you can also overwrite the dataset (that is, assign the result back to the <code>diamonds</code> data frame) if you don't want to retain the unfiltered data. In this case I want to keep it, so I'll store this result in <code>e_diamonds</code>. In any case, it's always a good idea to preview your <code>dplyr filter</code> results before you overwrite any data!</p>
<h2>Filtering Numeric Variables</h2>
<p>Numeric variables are the quantitative variables in a dataset. In the diamonds dataset, this includes the variables carat and price, among others. When working with numeric variables, it is easy to filter based on ranges of values. For example, if we wanted to get any diamonds priced between 1000 and 1500, we could easily filter as follows:</p>
<div class="highlight"><pre><span></span>filter<span class="p">(</span>diamonds<span class="p">,</span> price <span class="o">>=</span> <span class="m">1000</span> <span class="o">&</span> price <span class="o"><=</span> <span class="m">1500</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## # A tibble: 5,511 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.38 Very Good E VVS2 61.8 56 1000 4.66 4.68 2.88
## 2 0.39 Very Good F VS1 57.1 61 1000 4.86 4.91 2.79
## 3 0.38 Very Good E VS1 61.5 58 1000 4.64 4.69 2.87
## 4 0.38 Premium E VS1 60.7 59 1000 4.65 4.7 2.84
## 5 0.38 Ideal E VS1 61.6 56 1000 4.65 4.67 2.87
## 6 0.53 Very Good G SI2 62.5 55 1000 5.14 5.19 3.23
## 7 0.570 Very Good I SI2 62.1 57 1000 5.29 5.33 3.3
## 8 0.38 Ideal E VS1 61.9 56 1000 4.63 4.67 2.88
## 9 0.5 Good E SI2 63.2 61 1000 5.02 5.05 3.18
## 10 0.3 Ideal D VVS1 61.3 57 1000 4.29 4.32 2.64
## # … with 5,501 more rows
</pre></div>
<p>In general, when working with numeric variables, you'll most often make use of the inequality operators, <code>></code>, <code><</code>, <code>>=</code>, and <code><=</code>. While it is possible to use the <code>==</code> and <code>!=</code> operators with numeric variables, I generally recommend against it. </p>
<p>The issue with using <code>==</code> is that it will only return true of the value is exactly equal to what you're testing for. If the dataset you're testing against consists of integers, this is possible, but if you're dealing with decimals, this will often break down. For example, <code>1.0100000001 == 1.01</code> will evaluate to <code>FALSE</code>. This is technically true, but it's easy to get into trouble with decimal precision. I never use <code>==</code> when working with numerical variables unless the data I am working with consists of integers only!</p>
<h2>Filtering Categorical Variables</h2>
<p>Categorical variables are non-quantitative variables. In our example dataset, the columns cut, color, and clarity are categorical variables. In contrast to numerical variables, the inequalities <code>></code>, <code><</code>, <code>>=</code> and <code><=</code> have no meaning here. Instead, you'll make frequent use of the <code>==</code>, <code>!=</code>, and <code>%in%</code> operators when filtering categorical variables. </p>
<p>Above, we filtered the dataset to include only the diamonds whose cut was Ideal using the <code>==</code> operator. Let's say that we wanted to expand this filter to also include diamonds where the cut is Premium. To accomplish this, we would use the <code>%in%</code> operator:</p>
<div class="highlight"><pre><span></span>filter<span class="p">(</span>diamonds<span class="p">,</span> cut <span class="o">%in%</span> <span class="kt">c</span><span class="p">(</span><span class="s">'Ideal'</span><span class="p">,</span> <span class="s">'Premium'</span><span class="p">))</span>
</pre></div>
<div class="highlight"><pre><span></span>## # A tibble: 35,342 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 4 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
## 5 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33
## 6 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 7 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
## 8 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
## 9 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
## 10 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47
## # … with 35,332 more rows
</pre></div>
<p>How does this work? First, we create a vector of our desired cut options, <code>c('Ideal', 'Premium')</code>. Then, we use %in% to filter only those diamonds whose cut is in that vector. dplyr will filter out BOTH those diamonds whose cut is Ideal AND those diamonds whose cut is Premium. The vector you check against for the %in% function can be arbitrarily long, which can be very useful when working with categorical data. </p>
<p>It's also important to note that the vector can be defined before you perform the dplyr filter operation:</p>
<div class="highlight"><pre><span></span>cuts_to_include <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="s">'Good'</span><span class="p">,</span> <span class="s">'Very Good'</span><span class="p">,</span> <span class="s">'Ideal'</span><span class="p">,</span> <span class="s">'Premium'</span><span class="p">)</span>
filter<span class="p">(</span>diamonds<span class="p">,</span> cut <span class="o">%in%</span> cuts_to_include<span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## # A tibble: 52,330 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## 10 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
## # … with 52,320 more rows
</pre></div>
<p>This helps to increase the readability of your code when you're filtering against a larger set of potential options. This also means that if you have an existing vector of options from another source, you can use this to filter your dataset. This can come in very useful as you start working with multiple datasets in a single analysis!</p>
<h2>Chaining together multiple filtering operations with logical operators</h2>
<p>The real power of the dplyr filter function is in its flexibility. Using the logical operators &, |, and !, we can group many filtering operations in a single command to get the exact dataset we want!</p>
<p>Let's say we want to select all diamonds where the cut is Ideal and the carat is greater than 1:</p>
<div class="highlight"><pre><span></span>filter<span class="p">(</span>diamonds<span class="p">,</span> cut <span class="o">==</span> <span class="s">'Ideal'</span> <span class="o">&</span> carat <span class="o">></span> <span class="m">1</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## # A tibble: 5,662 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.01 Ideal I I1 61.5 57 2844 6.45 6.46 3.97
## 2 1.02 Ideal H SI2 61.6 55 2856 6.49 6.43 3.98
## 3 1.02 Ideal I I1 61.7 56 2872 6.44 6.49 3.99
## 4 1.02 Ideal J SI2 60.3 54 2879 6.53 6.5 3.93
## 5 1.01 Ideal I I1 61.5 57 2896 6.46 6.45 3.97
## 6 1.02 Ideal I I1 61.7 56 2925 6.49 6.44 3.99
## 7 1.14 Ideal J SI1 60.2 57 3045 6.81 6.71 4.07
## 8 1.02 Ideal H SI2 58.8 57 3142 6.61 6.55 3.87
## 9 1.06 Ideal I SI2 62.8 55 3146 6.51 6.46 4.07
## 10 1.02 Ideal I VS2 62.8 57 3148 6.45 6.39 4.03
## # … with 5,652 more rows
</pre></div>
<p>BOTH conditions must evaluate to <code>TRUE</code> for the data to be selected. That is, the cut must be Ideal, and the carat must be greater than 1. </p>
<p>You don't need to limit yourself to two conditions either. You can have as many as you want! Let's say we also wanted to make sure the color of the diamond was E. We can extend our example:</p>
<div class="highlight"><pre><span></span>filter<span class="p">(</span>diamonds<span class="p">,</span> cut <span class="o">==</span> <span class="s">'Ideal'</span> <span class="o">&</span> carat <span class="o">></span> <span class="m">1</span> <span class="o">&</span> color <span class="o">==</span> <span class="s">'E'</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## # A tibble: 531 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1.25 Ideal E I1 60.9 56 3276 6.95 6.91 4.22
## 2 1.01 Ideal E I1 62 57 3388 6.37 6.41 3.96
## 3 1.01 Ideal E I1 62 57 3450 6.41 6.37 3.96
## 4 1.02 Ideal E SI2 62.3 56 3455 6.42 6.37 3.98
## 5 1.04 Ideal E SI2 59 57 3588 6.65 6.6 3.91
## 6 1.13 Ideal E I1 62 55 3729 6.66 6.7 4.14
## 7 1.09 Ideal E SI2 59.4 57 3760 6.74 6.65 3.98
## 8 1.13 Ideal E I1 62 55 3797 6.7 6.66 4.14
## 9 1.12 Ideal E SI2 60.9 57 3864 6.66 6.6 4.04
## 10 1.1 Ideal E I1 61.9 56 3872 6.59 6.63 4.09
## # … with 521 more rows
</pre></div>
<p>What if we wanted to select rows where the cut is ideal OR the carat is greater than 1? Then we'd use the | operator!</p>
<div class="highlight"><pre><span></span>filter<span class="p">(</span>diamonds<span class="p">,</span> cut <span class="o">==</span> <span class="s">'Ideal'</span> <span class="o">|</span> carat <span class="o">></span> <span class="m">1</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## # A tibble: 33,391 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75
## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44
## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72
## 10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63
## # … with 33,381 more rows
</pre></div>
<p>Any time you want to filter your dataset based on some combination of logical statements, this is possibly using the <code>dplyr filter</code> function and R's built-in logical parameters. You just need to figure out how to combine your logical expressions to get exactly what you want!</p>
<h2>Conclusion</h2>
<p><code>dplyr filter</code> is one of my most-used functions in R in general, and especially when I am looking to filter in R. With this article you should have a solid overview of how to filter a dataset, whether your variables are numerical, categorical, or a mix of both. Practice what you learned right now to make sure you cement your understanding of how to effectively filter in R using dplyr!</p>
<hr />
<p>Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, <a href="http://eepurl.com/gmYioz">sign up here!</a></p>
<hr />
<p>I help technology companies to leverage their data to produce branded, influential content to share with their clients. I work on everything from investor newsletters to blog posts to research papers. If you enjoy the work I do and are interested in working together, you can visit my <a href="https://www.michaeltothanalytics.com" target="_blank">consulting website</a> or contact me at <a href="mailto:michael@michaeltothanalytics.com">michael@michaeltothanalytics.com</a>!</p>How to Create a Bar Chart Race in R - Mapping United States City Population 1790-20102019-04-04T00:00:00-04:00Michael Tothtag:michaeltoth.me,2019-04-04:how-to-create-a-bar-chart-race-in-r-mapping-united-states-city-population-1790-2010.html<p>In my corner of the internet, there's been an explosion over the last several months of a style of graph called a bar chart race. Essentially, a bar chart race shows how a ranked list of something--largest cities, most valuable companies, most-followed Youtube channels--evolves over time. Maybe you've been following this trend with the same curiosity that I have, and that's how you made your way here. Or maybe you're a normal person who doesn't even know what I'm talking about! Who knows, anything is possible. By way of introduction, here is the bar chart race I've created on the largest cities in the United States over time:</p>
<p><img alt="center" src="/figures/city_populations/create_graph-1.gif" /></p>
<h2>Motivation for this post</h2>
<p>For me, it all started with <a href="https://www.youtube.com/watch?v=BQovQUga0VE">this brand value graphic</a> that went viral back in February. For a few days, it felt like this thing was everywhere. And as this went viral, data visualization practitioners all started to try their hand at creating new versions of this on their own. One of my favorites was <a href="https://observablehq.com/@johnburnmurdoch/bar-chart-race-the-most-populous-cities-in-the-world">this bar chart race of world cities</a> created by John Murdoch, who, as far as I can tell, was the person who coined the term bar chart race.</p>
<p>Of course, I knew that I wanted to try to create a version of this graph in R. And for as long as I an remember, I've been obsessed with looking at population statistics for cities. I remember finding an Almanac (!) on my dad's bookshelf as a kid and memorizing the list of the largest cities in the United States. </p>
<p>Later, I remember reading that Detroit had at a time been among the largest cities in the country. From 1916 to 1944, Detroit was the fourth largest city in the country. Its population peaked at 1.85 million in 1950. Today its population is estimated to be 673,000. The history of Detroit's population in particular was interesting to me. Having grown up Toledo, Ohio, 60 miles south of Detroit, I'd seen the effects of Detroit's dramatic population decrease first-hand. I wanted to see how this story played out in the data, and what other interesting trends would be unearthed.</p>
<p>So when I decided I wanted to create a bar chart race, I knew the subject I was going to study. If you're here to learn how to create a bar chart race in R, you're in the right place! Now, let's get into it!</p>
<h2>Loading packages and data</h2>
<p>We start by loading the packages we'll use to create the graph. gganimate provides the toolkit for animation, tidyverse will help with data processing and graphing, and hrbrthemes provides a nice-looking base graphing theme.</p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>gganimate<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>hrbrthemes<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>tidyverse<span class="p">)</span>
</pre></div>
<p>Now we load in some preprocessed census population data, based on decadal U.S. Census data from 1790 - 2010. This combines the data all into one large dataset we will be using for this analysis.</p>
<div class="highlight"><pre><span></span><span class="c1"># Read in Census datasets by year that I downloaded and stored locally</span>
all_data <span class="o"><-</span> <span class="kt">data.frame</span><span class="p">()</span>
<span class="kr">for</span><span class="p">(</span>year <span class="kr">in</span> <span class="kp">seq</span><span class="p">(</span><span class="m">1790</span><span class="p">,</span> <span class="m">2010</span><span class="p">,</span> <span class="m">10</span><span class="p">))</span> <span class="p">{</span>
data_path <span class="o"><-</span> <span class="kp">paste0</span><span class="p">(</span><span class="s">'~/dev/michaeltoth/content/data/city_populations/'</span><span class="p">,</span> year<span class="p">,</span> <span class="s">'.csv'</span><span class="p">)</span>
data <span class="o"><-</span> read_csv<span class="p">(</span>data_path<span class="p">)</span>
data <span class="o"><-</span> data<span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">5</span><span class="p">]</span>
<span class="kp">colnames</span><span class="p">(</span>data<span class="p">)</span> <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="s">'Rank'</span><span class="p">,</span> <span class="s">'City'</span><span class="p">,</span> <span class="s">'State'</span><span class="p">,</span> <span class="s">'Population'</span><span class="p">,</span> <span class="s">'Region'</span><span class="p">)</span>
data<span class="o">$</span>year <span class="o"><-</span> year
all_data <span class="o"><-</span> <span class="kp">rbind</span><span class="p">(</span>all_data<span class="p">,</span> data<span class="p">)</span>
<span class="p">}</span>
<span class="c1"># The datasets were inconcistent with state naming, sometimes using full names</span>
<span class="c1"># And sometimes abbreviations. This code standardizes on state names:</span>
all_data<span class="o">$</span>State_From_Abbrev <span class="o"><-</span> state.name<span class="p">[</span><span class="kp">match</span><span class="p">(</span>all_data<span class="o">$</span>State<span class="p">,</span>state.abb<span class="p">)]</span>
all_data <span class="o"><-</span> all_data <span class="o">%>%</span> mutate<span class="p">(</span>State <span class="o">=</span> case_when<span class="p">(</span><span class="kp">is.na</span><span class="p">(</span>State_From_Abbrev<span class="p">)</span> <span class="o">~</span> State<span class="p">,</span>
<span class="kc">TRUE</span> <span class="o">~</span> State_From_Abbrev<span class="p">))</span> <span class="o">%>%</span>
select<span class="p">(</span><span class="o">-</span>State_From_Abbrev<span class="p">)</span>
</pre></div>
<h2>Interpolating missing values between census readings</h2>
<p>Here I'm going to make some adjustments to the datasets I'm using to get them in a format I want to work with. There are 2 things in particular I want to accomplish: </p>
<ol>
<li>The datasets only contain information at decade intervals. I want yearly data, so I'm going to create blank entries for intermediate years that I'll later fill with linear interpolation. </li>
<li>The datasets generally contain the 100 most populous cities, but I only care about the top 10 at any given time, so I'm going to discard any cities that don't at some point crack the top 10.</li>
</ol>
<div class="highlight"><pre><span></span><span class="c1"># Get the list of cities that were at some point in the top 10 by population</span>
top_cities <span class="o"><-</span> all_data <span class="o">%>%</span> filter<span class="p">(</span>Rank <span class="o"><=</span> <span class="m">10</span><span class="p">)</span> <span class="o">%>%</span>
select<span class="p">(</span>City<span class="p">,</span> State<span class="p">,</span> Region<span class="p">)</span> <span class="o">%>%</span> distinct<span class="p">()</span>
<span class="c1"># Generate a list of all years from 1790 - 2010</span>
all_years <span class="o"><-</span> <span class="kt">data.frame</span><span class="p">(</span>year <span class="o">=</span> <span class="kp">seq</span><span class="p">(</span><span class="m">1790</span><span class="p">,</span> <span class="m">2010</span><span class="p">,</span> <span class="m">1</span><span class="p">))</span>
<span class="c1"># Create all combinations of city and year we'll need for our final dataset</span>
all_combos <span class="o"><-</span> <span class="kp">merge</span><span class="p">(</span>top_cities<span class="p">,</span> all_years<span class="p">,</span> all <span class="o">=</span> <span class="bp">T</span><span class="p">)</span>
<span class="c1"># This accomplishes 2 things:</span>
<span class="c1"># 1. Filters out cities that are not ever in the top 10</span>
<span class="c1"># 2. Adds rows for all years (currently blank) to our existing dataset for each city</span>
all_data_interp <span class="o"><-</span> <span class="kp">merge</span><span class="p">(</span>all_data<span class="p">,</span> all_combos<span class="p">,</span> all.y <span class="o">=</span> <span class="bp">T</span><span class="p">)</span>
</pre></div>
<p>Next, I'll use linear interpolation to calculate an estimate of city populations in between the census readings each 10 years. This isn't strictly necessary, but I think it produces a more interesting final graphic than using only the official census statistics.</p>
<div class="highlight"><pre><span></span>all_data_interp <span class="o"><-</span> all_data_interp <span class="o">%>%</span>
group_by<span class="p">(</span>City<span class="p">)</span> <span class="o">%>%</span>
mutate<span class="p">(</span>Population<span class="o">=</span>approx<span class="p">(</span>year<span class="p">,</span>Population<span class="p">,</span>year<span class="p">)</span><span class="o">$</span>y<span class="p">)</span>
</pre></div>
<p>Last step before we graph! Here we calculate the ranked list of the top 10 cities for each year,
then filter so only those cities remain in the data for that year. </p>
<div class="highlight"><pre><span></span>data <span class="o"><-</span> all_data_interp <span class="o">%>%</span>
group_by<span class="p">(</span>year<span class="p">)</span> <span class="o">%>%</span>
arrange<span class="p">(</span><span class="o">-</span>Population<span class="p">)</span> <span class="o">%>%</span>
mutate<span class="p">(</span>rank<span class="o">=</span>row_number<span class="p">())</span> <span class="o">%>%</span>
filter<span class="p">(</span><span class="kp">rank</span><span class="o"><=</span><span class="m">10</span><span class="p">)</span>
</pre></div>
<h2>Animating the bar chart race in R</h2>
<p>Finally, we create the graph! This piece of code looks a bit intimadating, but mostly it's formatting for the graph. Much of the core code here comes from <a href="https://github.com/stevejburr/Bar-Chart-Race/blob/master/Final.R">This code by Steven Burr</a>, which was very helpful as I tried to figure out how best to use gganimate for this purpose. The key points to call out: </p>
<ol>
<li>I use geom_tile, not geom_bar as this allows for better transitions within gganimate</li>
<li>The gganimate functions transition_time and ease_aes handle the animation and transition between bars. The settings here worked well for my purposes, but dig into these functions to get an overview of different options</li>
<li>The nframes and fps parameters to the animate function control the speed of transitions. One mistake I made here initially was to set nframes equal to the number of years in the dataset. This works, but because there is only 1 frame per year, you don't get the smooth transitions that I wanted in this graph. Increasing the number of frames fixed that issue.</li>
</ol>
<div class="highlight"><pre><span></span>p <span class="o"><-</span> data <span class="o">%>%</span>
ggplot<span class="p">(</span>aes<span class="p">(</span>x <span class="o">=</span> <span class="o">-</span><span class="kp">rank</span><span class="p">,</span>y <span class="o">=</span> Population<span class="p">,</span> group <span class="o">=</span> City<span class="p">))</span> <span class="o">+</span>
geom_tile<span class="p">(</span>aes<span class="p">(</span>y <span class="o">=</span> Population <span class="o">/</span> <span class="m">2</span><span class="p">,</span> height <span class="o">=</span> Population<span class="p">,</span> fill <span class="o">=</span> Region<span class="p">),</span> width <span class="o">=</span> <span class="m">0.9</span><span class="p">)</span> <span class="o">+</span>
geom_text<span class="p">(</span>aes<span class="p">(</span>label <span class="o">=</span> City<span class="p">),</span> hjust <span class="o">=</span> <span class="s">"right"</span><span class="p">,</span> colour <span class="o">=</span> <span class="s">"black"</span><span class="p">,</span> fontface <span class="o">=</span> <span class="s">"bold"</span><span class="p">,</span> nudge_y <span class="o">=</span> <span class="m">-100000</span><span class="p">)</span> <span class="o">+</span>
geom_text<span class="p">(</span>aes<span class="p">(</span>label <span class="o">=</span> scales<span class="o">::</span>comma<span class="p">(</span>Population<span class="p">)),</span> hjust <span class="o">=</span> <span class="s">"left"</span><span class="p">,</span> nudge_y <span class="o">=</span> <span class="m">100000</span><span class="p">,</span> colour <span class="o">=</span> <span class="s">"grey30"</span><span class="p">)</span> <span class="o">+</span>
coord_flip<span class="p">(</span>clip<span class="o">=</span><span class="s">"off"</span><span class="p">)</span> <span class="o">+</span>
scale_fill_manual<span class="p">(</span>name <span class="o">=</span> <span class="s">'Region'</span><span class="p">,</span> values <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">"#66c2a5"</span><span class="p">,</span> <span class="s">"#fc8d62"</span><span class="p">,</span> <span class="s">"#8da0cb"</span><span class="p">,</span> <span class="s">"#e78ac3"</span><span class="p">))</span> <span class="o">+</span>
scale_x_discrete<span class="p">(</span><span class="s">""</span><span class="p">)</span> <span class="o">+</span>
scale_y_continuous<span class="p">(</span><span class="s">""</span><span class="p">,</span>labels<span class="o">=</span>scales<span class="o">::</span>comma<span class="p">)</span> <span class="o">+</span>
hrbrthemes<span class="o">::</span>theme_ipsum<span class="p">(</span>plot_title_size <span class="o">=</span> <span class="m">32</span><span class="p">,</span> subtitle_size <span class="o">=</span> <span class="m">24</span><span class="p">,</span> caption_size <span class="o">=</span> <span class="m">20</span><span class="p">,</span> base_size <span class="o">=</span> <span class="m">20</span><span class="p">)</span> <span class="o">+</span>
theme<span class="p">(</span>panel.grid.major.y<span class="o">=</span>element_blank<span class="p">(),</span>
panel.grid.minor.x<span class="o">=</span>element_blank<span class="p">(),</span>
legend.position <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="m">0.4</span><span class="p">,</span> <span class="m">0.2</span><span class="p">),</span>
plot.margin <span class="o">=</span> margin<span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="s">"cm"</span><span class="p">),</span>
axis.text.y<span class="o">=</span>element_blank<span class="p">())</span> <span class="o">+</span>
<span class="c1"># gganimate code to transition by year:</span>
transition_time<span class="p">(</span>year<span class="p">)</span> <span class="o">+</span>
ease_aes<span class="p">(</span><span class="s">'cubic-in-out'</span><span class="p">)</span> <span class="o">+</span>
labs<span class="p">(</span>title<span class="o">=</span><span class="s">'Largest Cities in the United States'</span><span class="p">,</span>
subtitle<span class="o">=</span><span class="s">'Population in {round(frame_time,0)}'</span><span class="p">,</span>
caption<span class="o">=</span><span class="s">'Source: United States Census</span>
<span class="s">michaeltoth.me / @michael_toth'</span><span class="p">)</span>
animate<span class="p">(</span>p<span class="p">,</span> nframes <span class="o">=</span> <span class="m">750</span><span class="p">,</span> fps <span class="o">=</span> <span class="m">25</span><span class="p">,</span> end_pause <span class="o">=</span> <span class="m">50</span><span class="p">,</span> width <span class="o">=</span> <span class="m">1200</span><span class="p">,</span> height <span class="o">=</span> <span class="m">900</span><span class="p">)</span>
</pre></div>
<p><img alt="center" src="/figures/city_populations/create_graph-1.gif" /></p>
<p>And there we have it! If you end up creating a bar chart race of your own, please share it in the comments - I'd love to take a look! </p>
<hr />
<p>Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, <a href="http://eepurl.com/gmYioz">sign up here!</a></p>
<hr />
<p>I help technology companies to leverage their data to produce branded, influential content to share with their clients. I work on everything from investor newsletters to blog posts to research papers. If you enjoy the work I do and are interested in working together, you can visit my <a href="https://www.michaeltothanalytics.com" target="_blank">consulting website</a> or contact me at <a href="mailto:michael@michaeltothanalytics.com">michael@michaeltothanalytics.com</a>!</p>You Need to Start Branding Your Graphs. Here's How, with ggplot!2019-02-27T00:00:00-05:00Michael Tothtag:michaeltoth.me,2019-02-27:you-need-to-start-branding-your-graphs-heres-how-with-ggplot.html<p>In today's post I want to help you incorporate your company's branding into your ggplot graphs. Why should you care about this? I'm glad you asked!</p>
<p>
<img src="/figures/add_branding_to_graphs/ggplot_base-1.png" title="center" alt="center" style="display: block; margin: auto;" />
</p>
<p>Have you ever seen a graph that looks like this? Of course you have! This is the default ggplot theme, and these graphs are everywhere. Now, look--I like the way this graph looks. The base ggplot theme is reasonable and the graph is clear. But it doesn't in any way differentiate it from thousands of other, similarly designed graphs on the internet. That's not good. You want your graphs to stand out! Take a look at this next graph:</p>
<p>
<center></p>
<p><img alt="FiveThirtyEight Graph" src="https://fivethirtyeight.com/wp-content/uploads/2014/05/morris-feature-qbweight-chart-3.png" width="500px" /></p>
<p></center>
</p>
<p>I'm guessing, even before you saw the caption at the bottom, that you knew this was a graph from FiveThirtyEight, Nate Silver's data-driven news service.</p>
<p>How about this one?</p>
<p>
<center></p>
<p><img alt="Economist Graph" src="https://www.economist.com/sites/default/files/imagecache/1280-width/images/2019/02/articles/body/20190216_woc346.png" width="300px" /></p>
<p></center>
</p>
<p>Again, this graph is immediately recognizable as coming from the Economist magazine. These two companies have done exceptional jobs of creating branded, differentiating styles to make their graphics immediately recognizable to anybody who sees them. In this post, I'm going to convince you why it's important that you develop a branded style for your graphs at your own company, and then I'll show you some quick steps to do it. </p>
<p>Now, I know, You might be thinking: branding and visual identity is something for the design and marketing teams to worry about. You're a data scientist! You don't have those skills, and, frankly, you have more important things to do. I sympathize, but I'm going to be honest with you: you need to get that idea out of your head. It's easier than you think, and it's part of your job! Or, at least, it should be. You see, when you start to make YOUR work fit in with YOUR COMPANY'S work, doors start to open for you. You create more for cross-departmental and external facing opportunities when your work clearly matches the company brand. Maybe a graph you created can help the sales team put together a presentation to win a big client. Maybe the marketing team can use your work to help put together a press kit. These probably aren't the core focus of your job, but the more you help people around your organization, the more respected you'll be and the more opportunities you'll have. </p>
<p>Over time, you will build a reputation and be recognized for your quality work because your work will be VISIBLE. Pretty soon, you will be the go-to person when an executive needs a graph for a board presentation or an investor pitch. As you build more relationships throughout your company, you'll be able to direct your focus to work that you enjoy, get involved in interesting projects, and advocate for projects of your own. And the best part? Most of this is completely passive! You're already doing the work, these tips will just help you to make it more visible, more shareable, and more impactful!</p>
<p>Convinced? Okay, let's get started. This is going to be one of the easiest high-impact changes you can make, so I hope you're excited!</p>
<p>I'm going to show you how you can easily change the color palette of a graph and add your company's logo to create a final, branded image that's ready for publication. </p>
<p>To start, create a standard ggplot graph like you otherwise would:</p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>ggplot2<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>hrbrthemes<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>magick<span class="p">)</span>
<span class="c1"># Create a base graph, similar to what we had above</span>
p <span class="o"><-</span> ggplot<span class="p">(</span>iris<span class="p">,</span> aes<span class="p">(</span>x <span class="o">=</span> Petal.Width<span class="p">,</span> y <span class="o">=</span> Petal.Length<span class="p">,</span> color <span class="o">=</span> Species<span class="p">))</span> <span class="o">+</span>
geom_point<span class="p">()</span> <span class="o">+</span>
labs<span class="p">(</span>title <span class="o">=</span> <span class="s">'Branding your ggplot Graphs'</span><span class="p">,</span>
subtitle <span class="o">=</span> <span class="s">'Simple tweaks you can use to boost the impact of your graphs today'</span><span class="p">,</span>
x <span class="o">=</span> <span class="s">'This axis title intentionally left blank'</span><span class="p">,</span>
y <span class="o">=</span> <span class="s">'This axis title intentionally left blank'</span><span class="p">,</span>
caption <span class="o">=</span> <span class="s">'michaeltoth.me / @michael_toth'</span><span class="p">)</span>
p
</pre></div>
<p><img src="/figures/add_branding_to_graphs/create_graph-1.png" title="center" alt="center" style="display: block; margin: auto;" />
</p>
<p>Once you have your base graph put together, the next step is for you to change the colors to match your company's color palette. I'm going to create a Coca-Cola branded graph for demonstration purposes, but you should substitute in your own company's details here. If you don't know what your company's color palette is, ask somebody from your design or marketing teams to send it to you! Here, I'm using red, black, and gray to match Coca-Cola's color palette. </p>
<p>Choosing individual colors like I'm doing here works for categorical graphs, but for continuous graphs you'll need to do a bit more work to get a branded color scale. That's a subject for another post, but check out the awesome <a href="https://projects.susielu.com/viz-palette?colors=%255B%2522#1DABE6%22,%22#1C366A%22,%22#C3CED0%22,%22#E43034%22,%22#FC4E51%22,%22#AF060F%22%5D&backgroundColor=%22white%22&fontColor=%22black%22" target="_blank">Viz Palette</a> tool by Elijah Meeks and Susie Lu for a sense of what's possible. As you become more familiar, you can create custom ggplot themes and ggplot color paletes to make this process seamless, but I don't want to get into all of that here, as it can be a bit overwhelming to learn all that at once. </p>
<div class="highlight"><pre><span></span><span class="c1"># Customize the graphs with your company's color palette</span>
p <span class="o"><-</span> p <span class="o">+</span> scale_color_manual<span class="p">(</span>name <span class="o">=</span> <span class="s">''</span><span class="p">,</span>
labels <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">'Black'</span><span class="p">,</span> <span class="s">'Red'</span><span class="p">,</span> <span class="s">'Gray'</span><span class="p">),</span>
values <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">'#000000'</span><span class="p">,</span> <span class="s">'#EC0108'</span><span class="p">,</span> <span class="s">'#ACAEAD'</span><span class="p">))</span> <span class="o">+</span>
theme_ipsum<span class="p">()</span> <span class="o">+</span>
theme<span class="p">(</span>plot.title <span class="o">=</span> element_text<span class="p">(</span>color <span class="o">=</span> <span class="s">"#EC0108"</span><span class="p">),</span>
plot.caption <span class="o">=</span> element_text<span class="p">(</span>color <span class="o">=</span> <span class="s">"#EC0108"</span><span class="p">,</span> face <span class="o">=</span> <span class="s">'bold'</span><span class="p">))</span>
p
</pre></div>
<p><img src="/figures/add_branding_to_graphs/change_colors-1.png" title="center" alt="center" style="display: block; margin: auto;" />
</p>
<p>Finally, let's add your company's logo to the graph for a complete, branded, and publication-ready graph. Download a moderately high resolution logo and save it somewhere on your machine. The workhorse here is the grid.raster function, which can render an image on top of a pre-existing image (in this case, your graph). The trick is to get the positioning and sizing right. This can be a bit confusing when you're first starting with these image manipulations, so I'll walk through them one-by-one:</p>
<ul>
<li><em>x</em>: this controls the x-position of where you place the logo. This should be a numeric value between 0 and 1, where 0 represents a position all the way on the left of the graph and 1 represents a position all the way on the right.</li>
<li><em>y</em>: this controls the y-position of where you place the logo. This should be a numeric value between 0 and 1, where 0 represents a position all the way on the bottom of the graph and 1 represents a position all the way on the top.</li>
<li><em>just</em>: this is a set of two values, the first corresponding to the horizontal justification and the second corresponding to the vertical justification. With <em>x</em> and <em>y</em> we chose a position on the grid to place the logo. <em>just</em> lets us choose how to justify the image at that location. Here, I've selected 'left' and 'bottom' justification, which means that the left bottom corner of the logo will be placed at the x-y coordinates specified.</li>
<li><em>width</em>: this scales the logo down to a smaller size so it can be placed on the image. Here, I'm scaling the logo down to a 1-inch size, but the size you'll want will depend on the size of the graph you've created. Play around with different sizes until you find one that feels right.</li>
</ul>
<div class="highlight"><pre><span></span><span class="c1"># Add your company's logo to the graph you created</span>
logo <span class="o"><-</span> image_read<span class="p">(</span><span class="s">"~/dev/michaeltoth/output/images/logo.jpg"</span><span class="p">)</span>
p
grid<span class="o">::</span>grid.raster<span class="p">(</span>logo<span class="p">,</span> x <span class="o">=</span> <span class="m">0.07</span><span class="p">,</span> y <span class="o">=</span> <span class="m">0.03</span><span class="p">,</span> just <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">'left'</span><span class="p">,</span> <span class="s">'bottom'</span><span class="p">),</span> width <span class="o">=</span> unit<span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="s">'inches'</span><span class="p">))</span>
</pre></div>
<p><img src="/figures/add_branding_to_graphs/add_logo-1.png" title="center" alt="center" style="display: block; margin: auto;" />
</p>
<p>And there we have it: a branded graph that would be suitable for a sales meeting, marketing presentation, or investor deck! </p>
<hr />
<p>Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, <a href="http://eepurl.com/gmYioz">sign up here!</a></p>
<hr />
<p>I help technology companies to leverage their data to produce branded, influential content to share with their clients. I work on everything from investor newsletters to blog posts to research papers. If you enjoy the work I do and are interested in working together, you can visit my <a href="https://www.michaeltothanalytics.com" target="_blank">consulting website</a> or contact me at <a href="mailto:michael@michaeltothanalytics.com">michael@michaeltothanalytics.com</a>!</p>I'm Blogging Again!2019-01-10T00:00:00-05:00Michael Tothtag:michaeltoth.me,2019-01-10:im-blogging-again.html<p>2018 was a big year for me, filled with life developments, new adventures, and lots of change. Having spent 3 years working at Orchard, a financial technology startup in the online lending space, I was ready for a change. I decided in early 2018 to leave and take some time off, not having a clear sense of what I wanted to do next. This was the first time since graduating from college that I had left a job without a plan, and I was both nervous and excited. Luckily, my fears that this would make me unemployable, cripple my ability to negotiate a salary, and generally ruin my life turned out to be unfounded! I spent four months traveling, working on side projects, exploring New York City, and trying to figure out what I wanted to do next. <a href="https://www.instagram.com/p/BgpNDeAB9be/" target="_blank">I learned to ski</a>, and I got really into it, spending a total of ten days on the slopes between January and March. <a href="https://www.instagram.com/p/Bm1x7Q1BRDl/" target="_blank">I got engaged</a> and will be getting married this year! I also spent more time visiting home, Ohio, than I think I have in total in all other years since I graduated from college. Finally, I was able to devote more time to creating the <a href="http://www.artfuldataprints.com" target="_blank">data-driven art</a> that I sell on Etsy, growing my sales by 3x and my revenues more than 5x from 2017!</p>
<p><img src="../images/maggie_mike_37.jpg"></p>
<p>As I considered what I wanted to do next, I thought back on my career and where I was uniquely qualified to help people. In my time at Orchard, I had grown from a data scientist into a senior data scientist, built out a team of my own, and ultimately became the head of research. I had a broad mandate to dig through Orchard's massive financial dataset to find meaningful insights that might help our clients and to write about them in blog posts, white papers, and newsletters. I enjoyed the work of finding new insights that others had not yet uncovered, and I especially enjoyed the task of visualizing and explaining those insights to others in an accessible way. I became quite good at it, and my writing brought tens of thousands of readers--and potential clients--to Orchard. My writing also gained significant press coverage, with my work covered by the <a href="https://www.ft.com/content/32b08804-e008-11e7-a8a4-0a1e63a52f9c" target="_blank">Financial Times</a>, <a href="https://www.cnbc.com/2018/02/23/why-warren-buffett-is-such-an-influential-leader-according-to-data.html" target="_blank">CNBC</a>,
<a href="https://www.inc.com/minda-zetlin/this-1-mental-habit-helped-make-warren-buffett-a-billionaire.html" target="_blank">Inc</a>, and
<a href="https://www.marketwatch.com/story/warren-buffetts-disarmingly-simple-investment-strategy-explained-by-big-data-2017-06-26" target="_blank">MarketWatch</a>, among others. I was excited to be able to contribute so much to Orchard's brand and market position with my data analysis and writing. Everybody in the industry was familiar with Orchard and the quality work we did, which enabled our sales team to get meetings with nearly any company they wanted.</p>
<p>What surprised me at the time--and still does today--was how few companies were leveraging their data to create content and expand their reach. In 2019, it's hard to find a company that isn't using their data to improve internal decision-making and forecasting. "Big Data", "Machine Learning", and "Artificial Intelligence" are buzzwords that can be heard across boardrooms from Silicon Valley to New York City. Still, the unique datasets possessed by most technology companies today are vastly underutilized when it comes to generating compelling content, building a brand, and establishing credibility. Almost nobody was doing what I had done at Orchard, and nobody was doing it to the extent that I had done it. Those companies that do share insights from their data--<a href="https://medium.com/airbnb-engineering" target="_blank">Airbnb</a>, <a href="https://research.fb.com/blog/" target="_blank">Facebook</a>, <a href="https://medium.com/netflix-techblog" target="_blank">Netflix</a>, <a href="https://multithreaded.stitchfix.com/blog/" target="_blank">Stitch Fix</a>, <a href="https://eng.uber.com/" target="_blank">Uber</a>--usually do so through an engineering blog, with highly technical content aimed more at finding new employees than finding new customers. There's nothing wrong with that, of course, but companies should also be using their data to create compelling content for their <strong>customers</strong> that helps build awareness, credibility, and trust. This idea is already well established at large financial companies, who often provide market research for free with the goal of building client trust, gaining credibility, establishing relationships, and making more sales. In technology, however, where companies can generate even more impact given the unique datasets at their disposal, almost nobody is doing this! I knew that this was too big an opportunity to let pass, and I knew that I was the perfect person to help companies better leverage their data in this way. I created a company, <a href="https://www.michaeltothanalytics.com" target="_blank">Michael Toth Analytics</a>, and started working on this idea full time in October. It's still early days, but I'm getting positive responses and am optimistic about the new year.</p>
<h3>The State of This Blog in 2019</h3>
<p>Working as a consultant means that I don't have many long-term projects like I would at a typical job. In some ways, this is nice, as it means I don't end up maintaining projects that have grown boring and uninteresting to me. But there are downsides as well. I like the feeling of accomplishment that comes with building something up over time and seeing it evolve. With that in mind, I've decided to start blogging more frequently again. In particular, one of my New Year's Resolutions for 2019 is to blog at least biweekly, for a total of 26 blog posts in 2019. I credit much of my professional success in recent years to work directly related to this blog. It helped me secure my first data science job at Orchard, moving from a financial analytics role at BlackRock where I had spent four years. I knew that data science was a competitive field to break into, and I was worried about my lack of formal training. I had self-studied data science through Coursera and similar sites, but that was it. So, before applying to Orchard, I took the time to research and write a <a href="https://michaeltoth.me/analyzing-historical-default-rates-of-lending-club-notes.html" target="_blank">blog post</a> using publicly available data from Lending Club, then the largest company in Orchard's industry. As a direct result of that blog post, my resume floated to the top of their pile, my interviews felt largely like a formality, as they already knew I was capable of the exact work they wanted me to do, and I ended up securing a position. This was a huge success for a relatively small time investment on my part.</p>
<p>This is a brief aside, but I <strong>highly</strong> recommend this approach to aspiring data scientists, or to anybody looking to make a move in their career. If you can demonstrate to your target company that you're skilled, competent, and hard-working ahead of time, you will be lightyears ahead of your competitition. Look--I know that applying for jobs is daunting. A quick LinkedIn search for data scientist positions showed me jobs that had been posted for two days that already had hundreds of applicants. Having been on the hiring side, I can also confirm that these figures are true. But they're also incredibly misleading. Jobs I've hired for have indeed received hundreds of applications, but the vast majority of those applications were immediately filtered out. Probably 5%-10% of the people who applied for my jobs demonstrated that they had relevant skills to do the work. Probably only 1%-2% had any specific domain expertise relevant to the position. For a job that received 500 applicants, I might have 5-10 people who seemed like qualified applicants. Imagine if you were a qualified applicant and had also written a blog post about the very challenges that company is facing, demonstrating your commitment, your work ethic, and your expertise! The worst thing that could happen is you spend a few hours working on a blog post, learn a few things, and don't ultimately get the job. The best thing that could happen is you get the exact job you're looking for, and you come into the company looking like the expert that you are, possibly commanding a significantly higher salary than you would otherwise! Almost nobody does this, but it's such a huge game changer that I cannot possibly recommend this approach more highly. Now, moving on!</p>
<p>This blog has also helped me to establish my presence online. In 2017, I wrote a <a href="https://michaeltoth.me/sentiment-analysis-of-warren-buffetts-letters-to-shareholders.html" target="_blank">sentiment analysis of Warren Buffett's Letters to Shareholders</a> that went viral and earned me significant press coverage, which was pretty exciting! I've also been able to grow my following on <a href="https://www.twitter.com/Michael_Toth" target="_blank">Twitter</a>, with over 4000 like-minded professional contacts following me there. Additionally, the blog has allowed me to explore new data science and data visualization techniques that have significantly improved my abilities over the years. All I need to do to confirm this is look at some of my earlier posts to see how far I've come.</p>
<p>Anyway, all this to say, I've seen enormous benefits over the years as a direct result of the work that I've done on this blog. Still, sitting down and writing a blog post that may or may not gain any traction has always been a challenge. I obsess over each post, wanting it to be perfect, which ends up taking a significant chunk of time. To address that, I want to force myself to blog more frequently to remove some of the pressure I feel for any particular post to be perfect. I'm reminded of the story of the ceramics teacher from the book Art & Fear by David Bayles & Ted Orland: </p>
<blockquote>
<p>The ceramics teacher announced on opening day that he was dividing the class into two groups. All those on the left side of the studio, he said, would be graded solely on the quantity of work they produced, all those on the right solely on its quality.</p>
<p>His procedure was simple: on the final day of class he would bring in his bathroom scales and weigh the work of the “quantity” group: fifty pounds of pots rated an “A”, forty pounds a “B”, and so on. Those being graded on “quality”, however, needed to produce only one pot – albeit a perfect one – to get an “A”.</p>
<p>Well, came grading time and a curious fact emerged: the works of highest quality were all produced by the group being graded for quantity. It seems that while the “quantity” group was busily churning out piles of work – and learning from their mistakes – the “quality” group had sat theorizing about perfection, and in the end had little more to show for their efforts than grandiose theories and a pile of dead clay.</p>
</blockquote>
<p>When you produce a large body of work, some of the work you create will be great, and some will be poor. But over time you develop your skills and your work improves. I think my goal of biweekly blog posts will help me to overcome the fear of putting work out there and hopefully result in a significant improvement to my abilities over the course of the year. We'll see!</p>
<p>While this blog has always been and will continue to be primarily a data science and data visualization blog, I also intend to write some different content this year, similar to today's post. There are two reasons for this. First, I feel that only posting data analysis is unnecessarily constraining. Second, I think that I have relied on data science blog posts as a bit of a crutch, allowing me to put myself out there without really exposing myself to any vulnerability or potential criticism. As long as I'm presenting facts based on data analysis, it feels that there is not much room for criticism. This type of post allows me to express my thoughts and opinions, be vulnerable, and become more comfortable putting myself and my ideas into the world. This is a skill I definitely want to develop, and I hope to continue this both here and on Twitter in 2019. </p>
<p>This blog post has gotten quite long, so I will end here. While 2018 was a big year for me, I'm looking forward to 2019, and I think that it might be even better! I'll be publishing here every two weeks, and you can follow me <a href="https://www.twitter.com/Michael_Toth" target="_blank">on Twitter</a> if you're interested in getting notified when a new one is available!</p>
<p>If you enjoy the work I do and are interested in working together, you can visit my <a href="https://www.michaeltothanalytics.com" target="_blank">consulting website</a> or contact me at <a href="mailto:michael@michaeltothanalytics.com">michael@michaeltothanalytics.com</a>!</p>How to Write Pelican Blog Posts using RMarkdown & Knitr, 2.02017-06-14T00:00:00-04:00Michael Tothtag:michaeltoth.me,2017-06-14:how-to-write-pelican-blog-posts-using-rmarkdown-knitr-20.html<p><a href="https://michaeltoth.me/how-to-write-pelican-blog-posts-using-rmarkdown-knitr.html">Back in January</a> I wrote a post discussing how to get RMarkdown and Pelican to work together to make the R analysis > blog post workflow a bit easier. While I had high hopes, I was never really happy with the setup I put together then, so I set out to update it. </p>
<p>In this post I'm going to talk about my new, improved way of publishing Pelican blog posts using RMarkdown. I'm assuming you already have a Pelican blog set up, so I won't be covering that in today's post. If you're interested but haven't yet set up a blog for yourself, it's quite straightforward! I recommend checking out these links:</p>
<ul>
<li><a href="http://docs.getpelican.com/en/stable/quickstart.html">Official Pelican Guide</a></li>
<li><a href="http://duncanlock.net/blog/2013/05/17/how-i-built-this-website-using-pelican-part-1-setup/">Detailed Pelican Setup by Duncan Lock</a></li>
</ul>
<h3>Issue with Old Setup</h3>
<p>The setup I recommended in my prior post used a Pelican plugin called rmd_reader to convert .Rmd files to standard .md files that Pelican could read. For taking a static .Rmd post and creating a published post, this worked pretty well. One of my favorite things about my Pelican setup, though, was using the development server feature. Basically, this runs a web server locally, monitoring your content directory for any changes, and automatically regenerates your site whenever it finds a change. This feature did not play nice with the rmd_reader plugin. When you start the development server, rmd_reader starts converting any .rmd files to .md files. This action would trigger the development server to restart, as it identified changes in the content directory, and you'd find yourself stuck in an infinite loop of regeneration. Admittedly, it's a minor issue, and I probably could have hacked together a solution, but I didn't want to make changes to the base Pelican code or the rmd_reader code, because I wanted this to be portable to other systems. So in the end, I decided I needed another solution. </p>
<h3>New Solution & Setup</h3>
<p>The challenge is to find a way to run your .Rmd code, producing any desired figures and code chunks, then store the results in a .md file that is readable by Pelican. I remembered having read about how David Robinson built <a href="http://varianceexplained.org/">his site</a> using a <a href="https://github.com/dgrtwo/dgrtwo.github.com/blob/master/_scripts/knitpages.R">custom R script</a> to convert each .Rmd file to a .md file using knitr commands, and I set out to see if I could modify that for my purposes. </p>
<p>Below is the final knitpages.R file I'm using, having made a few minor changes to David's file, which was optimized for Jekyll blogs:</p>
<div class="highlight"><pre><span></span><span class="c1">#!/usr/local/bin/Rscript --vanilla</span>
<span class="c1"># compiles all .Rmd files in _R directory into .md files in blog directory,</span>
<span class="c1"># if the input file is older than the output file.</span>
<span class="c1"># run ./knitpages.R to update all knitr files that need to be updated.</span>
<span class="c1"># run this script from your base content directory</span>
<span class="kn">library</span><span class="p">(</span>knitr<span class="p">)</span>
KnitPost <span class="o"><-</span> <span class="kr">function</span><span class="p">(</span>input<span class="p">,</span> outfile<span class="p">,</span> figsfolder<span class="p">,</span> cachefolder<span class="p">,</span> base.url<span class="o">=</span><span class="s">"/"</span><span class="p">)</span> <span class="p">{</span>
opts_knit<span class="o">$</span>set<span class="p">(</span>base.url <span class="o">=</span> base.url<span class="p">)</span>
fig.path <span class="o"><-</span> <span class="kp">paste0</span><span class="p">(</span>figsfolder<span class="p">,</span> <span class="kp">sub</span><span class="p">(</span><span class="s">".Rmd$"</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="kp">basename</span><span class="p">(</span>input<span class="p">)),</span> <span class="s">"/"</span><span class="p">)</span>
cache.path <span class="o"><-</span> <span class="kp">file.path</span><span class="p">(</span>cachefolder<span class="p">,</span> <span class="kp">sub</span><span class="p">(</span><span class="s">".Rmd$"</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> <span class="kp">basename</span><span class="p">(</span>input<span class="p">)),</span> <span class="s">"/"</span><span class="p">)</span>
opts_chunk<span class="o">$</span>set<span class="p">(</span>fig.path <span class="o">=</span> fig.path<span class="p">)</span>
opts_chunk<span class="o">$</span>set<span class="p">(</span>cache.path <span class="o">=</span> cache.path<span class="p">)</span>
opts_chunk<span class="o">$</span>set<span class="p">(</span>fig.cap <span class="o">=</span> <span class="s">"center"</span><span class="p">)</span>
render_markdown<span class="p">()</span>
knit<span class="p">(</span>input<span class="p">,</span> outfile<span class="p">,</span> envir <span class="o">=</span> <span class="kp">parent.frame</span><span class="p">())</span>
<span class="p">}</span>
knit_folder <span class="o"><-</span> <span class="kr">function</span><span class="p">(</span>infolder<span class="p">,</span> outfolder<span class="p">,</span> figsfolder<span class="p">,</span> cachefolder<span class="p">,</span> force <span class="o">=</span> <span class="bp">F</span><span class="p">)</span> <span class="p">{</span>
<span class="kr">for</span> <span class="p">(</span>infile <span class="kr">in</span> <span class="kp">list.files</span><span class="p">(</span>infolder<span class="p">,</span> pattern <span class="o">=</span> <span class="s">"*.Rmd"</span><span class="p">,</span> full.names <span class="o">=</span> <span class="kc">TRUE</span><span class="p">))</span> <span class="p">{</span>
<span class="kp">print</span><span class="p">(</span>infile<span class="p">)</span>
outfile <span class="o">=</span> <span class="kp">paste0</span><span class="p">(</span>outfolder<span class="p">,</span> <span class="s">"/"</span><span class="p">,</span> <span class="kp">sub</span><span class="p">(</span><span class="s">".Rmd$"</span><span class="p">,</span> <span class="s">".md"</span><span class="p">,</span> <span class="kp">basename</span><span class="p">(</span>infile<span class="p">)))</span>
<span class="kp">print</span><span class="p">(</span>outfile<span class="p">)</span>
<span class="c1"># knit only if the input file is the last one modified</span>
<span class="kr">if</span> <span class="p">(</span><span class="o">!</span><span class="kp">file.exists</span><span class="p">(</span>outfile<span class="p">)</span> <span class="o">|</span> <span class="kp">file.info</span><span class="p">(</span>infile<span class="p">)</span><span class="o">$</span>mtime <span class="o">></span> <span class="kp">file.info</span><span class="p">(</span>outfile<span class="p">)</span><span class="o">$</span>mtime<span class="p">)</span> <span class="p">{</span>
KnitPost<span class="p">(</span>infile<span class="p">,</span> outfile<span class="p">,</span> figsfolder<span class="p">,</span> cachefolder<span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
knit_folder<span class="p">(</span><span class="s">"_R"</span><span class="p">,</span> <span class="s">"blog"</span><span class="p">,</span> <span class="s">"figures/"</span><span class="p">,</span> <span class="s">"_caches"</span><span class="p">)</span>
</pre></div>
<p>The only real change to David's script is updating <code>render_jekyll()</code> to <code>render_markdown()</code>. I also had to change the path to the Rscript executable (first line), which you may need to do based on your OS. Run <code>which Rscript</code> from your terminal to find the correct path.</p>
<p>You should modify the knit_folder command at the bottom to reflect your own blog's directory structure. Here's how this script works:</p>
<p>1) The script finds all .Rmd files in your infolder, ignoring old & unchanged files
2) New & updated files are passed to the KnitPost function, which runs the Rmd file, saving any generated images to the figsfolder directory and storing any cached data to the cachefolder.
3) An output .md file is created in the outfolder directory with links to any figures generated by R</p>
<h3>New Process</h3>
<p>Here's my updated process for publishing today:</p>
<p>1) From the content directory of my blog, I run knitpages.R, which will convert any new or updated .Rmd files to Pelican-readable .md files
2) Next I generate my Pelican site
3) When I'm satisfied with the results locally, I can easily push to my web server and <code>make publish</code> from there to generate my site</p>
<p>I like this solution because it's a bit cleaner and requires less overhead than the one I wrote about previously. I've noticed that personally one of the biggest issues preventing me from publishing more frequently has been friction in the publishing process. This solution goes a long way toward solving that, and I hope this helps me increase my frequency of publishing. I'd like to extend a big thanks to David for sharing his knitpages.R code and examples on his blog, it made the process of setting this up so much easier!</p>Sentiment Analysis of Warren Buffett's Letters to Shareholders2017-03-20T00:00:00-04:00Michael Tothtag:michaeltoth.me,2017-03-20:sentiment-analysis-of-warren-buffetts-letters-to-shareholders.html<p>Last week, I was reading through Warren Buffett's most recent letter to Berkshire Hathaway shareholders. Every year, he writes a letter that he makes <a href="http://www.berkshirehathaway.com/letters/letters.html">publicly available</a> on the Berkshire Hathaway website. In the letters he talks about the performance of Berkshire Hathaway and their portfolio of businesses and investments. But he also talks about his views on business, the market, and investing more generally, and it's after this wisdom that many investors, including me, read what he has to say. </p>
<p>In many ways Warren Buffett's letters are atypical. When most companies report their financial performance, they fill their reports with dense, technical language designed to obscure and confuse. Mr. Buffett does not follow this approach. His letters are written in easily understandable language, beacuse he wants them to be accessible to everybody. Warren Buffett is not often swayed by what others are doing. He goes his own way, and that has been a source of incredible strength. In annually compounded returns, Berkshire stock has gained 20.8% since 1965, while the S&P 500 as a whole has gained only 9.7% over the same period. To highlight how truly astounding this performance is, one dollar invested in the S&P in 1965 would have grown to $112.34 by the end of 2016, while the same dollar invested in Berkshire stock would have grown to the massive sum of $15,325.46!</p>
<p>I've been reading the annual Berkshire letters when they come out for the last few years. One day I'll sit down and read through all of them, but I haven't gotten around to it yet. But while I was reading through his most recent letter last week, I got to thinking. I wondered whether there are any trends in his letters over time, or how strongly his writings are influenced by external market factors. I decided I could probably answer some of these questions through a high-level analysis of the text in his letters, which brings me to the subject of this blog post. </p>
<p>In this post I'm going to be performing a sentiment analysis of the text of Warren Buffett's letters to shareholders from 1977 - 2016. A <a href="https://en.wikipedia.org/wiki/Sentiment_analysis">sentiment analysis</a> is a method of identifying and quantifying the overall sentiment of a particular set of text. Sentiment analysis has many use cases, bu a common one is to determine how positive or negative a particular text document is, which is what I'll be doing here. For this, I'll be using <a href="https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html">bing sentiment analysis</a>, developed by <a href="https://www.cs.uic.edu/~liub/">Bing Liu</a> of the University of Illinois at Chicago. For this type of sentiment analysis, you first split a text document into a set of distinct words, and then for each word determining whether it is positive, negative, or neutral. </p>
<p>In the graph below, I show something called the 'Net Sentiment Ratio' for each of Warren Buffett's letters, beginning in 1977 and ending with 2016. The net sentiment ratio tells how positive or negative a particular text is. I'm definining the net sentiment ratio as: </p>
<p>(Number of Positive Words - Number of Negative Words) / (Number of Total Words)</p>
<p><img src="/figures/berkshire_hathaway_sentiment/plotting-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>The results here show that overall, Warren Buffett's letters have been positive. Over the forty years of letters I'm analyzing here, only 5 show a negative net sentiment score. The five years that do show negative net sentiment scores are closely tied with major negative economic events:</p>
<ul>
<li><strong>1987</strong>: The market crash that happened on October 19th, 1987 (Black Monday) is widely known as the largest single-day percentage decline ever experienced for the Dow-Jones Industrial Average, 22.61% in one day.</li>
<li><strong>1990</strong>: The recession of 1990, triggered by an oil price shock following the United States' invasion of Kuwait, resulted in a notable increase in unemployment.</li>
<li><strong>2001</strong>: Following the 1990s, which represented the longest period of growth in American history, 2001 saw the collapse of the dot-com bubble and associated declining market values, as well as the September 11th attacks.</li>
<li><strong>2002</strong>: The market, already falling in 2001, continued to see declines throughout much of 2002.</li>
<li><strong>2008</strong>: The Great Recession was a large worldwide economic recession, characterized by the International Monetary Fund as the worst global recession since before World War II. Other related events during this period included the financial crisis of 2007-2008 and the subprime mortgage crisis of 2007-2009.</li>
</ul>
<p>Another interesting topic to examine is which words were actually the strongest contributors to the positive and negative sentiment in the letters. For this exercise, I analyzed the letters as one single text, and present the most common positive and negative words in the graph below.</p>
<p><img src="/figures/berkshire_hathaway_sentiment/sentiment_list-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>The results here are interesting. Many of the most common words--'gain', 'gains', 'loss', 'losses', 'worth', 'liability', and 'debt'--are what we'd expect given the financial nature of these documents. I find the adjectives that make their way into this set particularly interesting, as they give insight into the way Warren Buffett thinks. On the positive side we have 'significant', 'outstanding', 'excellent', 'extraordinary', and 'competitive'. On the negative side there are 'negative', 'unusual', 'difficult', and 'bad'. One interesting inclusion that shows some of the limitations of sentiment analysis is 'casualty', where Mr. Buffett is not referring to death, but to the basket of property and casualty insurance companies that make up a significant portion of his business holdings. </p>
<p>While the above is interesting, and helps us to highlight the most frequent positive and negative words, it's a bit limited in the number of words we can present before the graph becomes too crowded. To see a larger number of words, we can use a word cloud. The word cloud below shows 400 of the most commonly used words, split by positive and negative sentiment. </p>
<p><img src="/figures/berkshire_hathaway_sentiment/wordcloud-1.png" title="center" alt="center" style="display: block; margin: auto;" /></p>
<p>If you're interested in reproducing this blog post or analysis, please check out the <a href="https://github.com/michaeltoth/michaeltoth/blob/master/content/_R/berkshire_hathaway_sentiment.Rmd">R code I used to produce this document</a></p>How to Write Pelican Blog Posts using RMarkdown & Knitr2017-01-05T00:00:00-05:00Michael Tothtag:michaeltoth.me,2017-01-05:how-to-write-pelican-blog-posts-using-rmarkdown-knitr.html<p>UPDATE: This is no longer my preferred method for syncing RMarkdown analyses with my blog. Please check out my <a href="https://michaeltoth.me/how-to-write-pelican-blog-posts-using-rmarkdown-knitr-20.html">new post</a> </p>
<p>In this post I'm going to be talking about how to easily modify your Pelican blog configuration to let you directly publish blog posts using RMarkdown. I'm assuming you already have a Pelican blog set up, so I won't be covering that in today's post. If you're interested but haven't yet set up a blog for yourself, it's quite straightforward! I recommend checking out these links:</p>
<ul>
<li><a href="http://docs.getpelican.com/en/stable/quickstart.html">Official Pelican Guide</a></li>
<li><a href="http://duncanlock.net/blog/2013/05/17/how-i-built-this-website-using-pelican-part-1-setup/">Detailed Pelican Setup by Duncan Lock</a></li>
</ul>
<p>Until now, I've been writing posts on this blog using standard markdown. This means I'd do an analysis in R, produce a series of graphs and results that I would store locally in image files, and put it all together on my own in a markdown document. It's not that bad a process, but it is a bit inefficient, and I wanted to see if there was a better way. Luckily, there's a very easy-to-use Pelican plugin called rmd_reader that will automatically convert any RMarkdown posts you have into Pelican-compliant html documents. In figuring out how to set this up, I drew heavily on these resources:</p>
<ul>
<li><a href="https://github.com/getpelican/pelican-plugins/tree/master/rmd_reader">rmd_reader Plugin on Github</a></li>
<li><a href="https://rjweiss.github.io/articles/2014_08_25/testing-rmarkdown-integration/">rmd_reader Setup Tutorial by Rebecca Weiss</a>
<br><br></li>
</ul>
<h3>Setup Instructions</h3>
<p>First, let's install the RMD Reader extension so that Pelican knows what to do. We'll do this by cloning the pelican-plugins github repository and referencing this in our Pelican configuration file. This has the added benefit of allowing you to easily use other Pelican plugins, should you decide you want to do that.</p>
<p>Execute the following command from the directory where you want to store this repository.<br />
<em>(Run from terminal):</em></p>
<div class="highlight"><pre><span></span>git clone --recursive https://github.com/getpelican/pelican-plugins
</pre></div>
<p>Add the following to your Pelican config file. If you already have these variables defined, simple add the new path and plugin to the end of your existing list.<br />
<em>(Edit pelicanconf.py):</em></p>
<div class="highlight"><pre><span></span><span class="n">PLUGIN_PATHS</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'your-path-to/pelican-plugins'</span><span class="p">]</span>
<span class="n">PLUGINS</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'rmd_reader'</span><span class="p">]</span>
</pre></div>
<p>Make sure you have the rpy2 python package installed.<br />
<em>(Run from terminal):</em></p>
<div class="highlight"><pre><span></span><span class="n">pip</span> <span class="n">install</span> <span class="n">rpy2</span>
</pre></div>
<p>Also make sure you have the knitr R package installed.<br />
<em>(Run from R):</em></p>
<div class="highlight"><pre><span></span>install.packages<span class="p">(</span><span class="s">'knitr'</span><span class="p">)</span>
</pre></div>
<p><br></p>
<h3>Additional Setup</h3>
<p>The above is the core setup, but there are a few more tweaks that I recommend you do in order to make your life easier down the road.</p>
<p>Add the following to your Pelican config file. Essentially what we're doing here is giving knitr instructions on how to name & where to store image files to reduce the likelihood of you having conflicts and overwriting files from older blog posts. There are several ways to do this, but this seemed the best solution to me. For further details, check out the <a href="https://github.com/getpelican/pelican-plugins/tree/master/rmd_reader">official rmd_reader documentation</a>.<br />
<em>(Edit pelicanconf.py):</em></p>
<div class="highlight"><pre><span></span><span class="n">STATIC_PATHS</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'figure'</span><span class="p">]</span>
<span class="n">RMD_READER_RENAME_PLOT</span> <span class="o">=</span> <span class="s1">'directory'</span>
<span class="n">RMD_READER_KNITR_OPTS_CHUNK</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'fig.path'</span><span class="p">:</span> <span class="s1">'figure/'</span><span class="p">}</span>
</pre></div>
<p><br></p>
<h3>Testing & Examples</h3>
<p>Finally, we're ready to test out our new setup. Try this out with your own .Rmd document or use this one, <a href="https://www.github.com/michaeltoth/michaeltoth/blob/master/content/_R/pelican_rmarkdown_setup.Rmd">available on my Github</a>, if you're just looking for a quick test. The steps are relatively simple:</p>
<ol>
<li>Save your .Rmd file into the same content folder where you'd put any other .md file for your Pelican blog</li>
<li>Run your Pelican blog like you would normally. </li>
</ol>
<p>That's it. rmd_reader will automatically execute your .Rmd file, produce the relevant graphics, and set up the html for your blog just like base Pelican would.</p>
<p>Just to confirm everythng is working correctly, let's do some basic operations on the iris dataset.</p>
<p>First let's see a simple summary of the data:</p>
<div class="highlight"><pre><span></span><span class="kp">summary</span><span class="p">(</span>iris<span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span>## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
</pre></div>
<p>Let's finish with a simple k-means cluster analysis:</p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>broom<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>dplyr<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>ggplot2<span class="p">)</span>
iris_sub <span class="o"><-</span> select<span class="p">(</span>iris<span class="p">,</span> x1 <span class="o">=</span> Petal.Length<span class="p">,</span> x2 <span class="o">=</span> Petal.Width<span class="p">)</span>
kclusts <span class="o"><-</span> <span class="kt">data.frame</span><span class="p">(</span>k<span class="o">=</span><span class="m">1</span><span class="o">:</span><span class="m">6</span><span class="p">)</span> <span class="o">%>%</span> group_by<span class="p">(</span>k<span class="p">)</span> <span class="o">%>%</span> do<span class="p">(</span>kclust<span class="o">=</span>kmeans<span class="p">(</span>iris_sub<span class="p">,</span> <span class="m">.</span><span class="o">$</span>k<span class="p">))</span>
clusters <span class="o"><-</span> kclusts <span class="o">%>%</span> group_by<span class="p">(</span>k<span class="p">)</span> <span class="o">%>%</span> do<span class="p">(</span>tidy<span class="p">(</span><span class="m">.</span><span class="o">$</span>kclust<span class="p">[[</span><span class="m">1</span><span class="p">]]))</span>
assignments <span class="o"><-</span> kclusts <span class="o">%>%</span> group_by<span class="p">(</span>k<span class="p">)</span> <span class="o">%>%</span> do<span class="p">(</span>augment<span class="p">(</span><span class="m">.</span><span class="o">$</span>kclust<span class="p">[[</span><span class="m">1</span><span class="p">]],</span> iris_sub<span class="p">))</span>
clusterings <span class="o"><-</span> kclusts <span class="o">%>%</span> group_by<span class="p">(</span>k<span class="p">)</span> <span class="o">%>%</span> do<span class="p">(</span>glance<span class="p">(</span><span class="m">.</span><span class="o">$</span>kclust<span class="p">[[</span><span class="m">1</span><span class="p">]]))</span>
ggplot<span class="p">(</span>assignments<span class="p">,</span> aes<span class="p">(</span>x <span class="o">=</span> x1<span class="p">,</span> y <span class="o">=</span> x2<span class="p">))</span> <span class="o">+</span>
facet_wrap<span class="p">(</span><span class="o">~</span> k<span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>color<span class="o">=</span><span class="m">.</span>cluster<span class="p">))</span> <span class="o">+</span>
geom_point<span class="p">(</span>data<span class="o">=</span>clusters<span class="p">,</span> size<span class="o">=</span><span class="m">10</span><span class="p">,</span> shape<span class="o">=</span><span class="s">"x"</span><span class="p">)</span>
</pre></div>
<p><img alt="center" src="/figures/pelican_rmarkdown_setup/iris-plot-1.png" /></p>
<h3>Closing Remarks</h3>
<p>That's it! I've been meaning to get this set up for a while, and I'm pretty excited about it. Since most of my blog posts are R analyses, this is going to really simplify my workflow, which should make it much for me to actually finalize and post my results, something I've had issues with before. I'm also glad I'll be able to make greater use of R Markdown/Knitr, which will help me to organize my thoughts while analyzing as well as create reproducible research documents to share. I hope you find this useful as well!</p>Popularity of Baby Names Since 18802016-11-20T12:00:00-05:00Michael Tothtag:michaeltoth.me,2016-11-20:popularity-of-baby-names-since-1880.html<p><a href="https://michaeltoth.me/installing-and-running-shiny-server-from-source-on-32-bit-ubuntu.html">A while back</a> I spent some time figuring out how to serve interactive shiny apps through my website, but I haven't had a chance to build anything until recently. I set out to create a few simple shiny apps in R that I could use as a sort of test run, and I'm writing those up here.</p>
<p>In this post I'm going to be analyzing some open data provided by the Social Security Administration on the popularity of baby names over the years--specifically, since 1880. Data comes from the <a href="https://www.ssa.gov/oact/babynames/limits.html">Social Security Administration</a> </p>
<p>In the application below, you can view the 10 most popular baby names for any year since 1880, either for males or females. You can also click the play icon directly below the year slider to view an animated history of the most common names.
<br>
<br></p>
<iframe src="http://www.michaeltoth.me/shiny/census_names/top10/" style="border: none; width: 100%; height: 400px"></iframe>
<p>In the next application, you can enter any name, and the graph will display how the popularity of that name has changed over time. Be sure to also select whether the name is for males or females, or you'll likely see some unexpected results!
<br>
<br></p>
<iframe src="http://www.michaeltoth.me/shiny/census_names/tracer/" style="border: none; width: 100%; height: 400px"></iframe>
<p>After building the shiny applications above, I got interested in whether I could identify any meaningful trends over time in the data. I wanted to see whether the concentration (the proportion of babies born with a given name) of the most popular names was relatively static over time, or whether this fluctuated. I was also interested in finding trends in the number of babies born with each of the most popular names. To investigate these, I used a subset of the original data, grabbing the 10 most common male and female names for each year since 1880. I went through several iterations of how best to display the data, and ultimately arrived at the graph below, which I quite like. </p>
<p>I was excited that this project gave me an opportunity to make use of David Robinson's <a href="https://github.com/dgrtwo/gganimate">gganimate package</a>, which I must regrettably admit I hadn't had a chance to experiment with previously. For those unfamiliar, this package makes it incredibly easy to create animated ggplot graphs, and it's awesome!</p>
<p>I wanted to create some kind of trailing visualization to make it clear how patterns and trends were changing over time. The implementation here was adapted from Thomas Pedersen's <a href="https://gist.github.com/thomasp85/c8e22be4628e4420d4f66bcc6c88ac87">example</a> which he used to produce <a href="https://twitter.com/thomasp85/status/694905779539812352">this image</a>. </p>
<p>In the graph below we can use the trailing effect to easily identify trends that occur over a series of years. The grey background data also helps us to visualize how any given year compares with the overall history. I see 5 key periods present themselves in the data:</p>
<ul>
<li><strong>Early Years (1880 - 1910)</strong>: This period is characterized by a low number of babies born (due to the much lower population at this time) as well as a high concentration of the most popular names, in some cases reaching almost 10%. This means that the most common names were being used by a very high percentage of the population during this period. Toward the end of this time period, we begin to see declines in the concentration statistic.</li>
<li><strong>World War I Years (1910 - 1920)</strong>: In this period we see an explosion in the number of babies born with the most popular names. We don't see much change in the concentration of names over this period.</li>
<li><strong>Intra-War Period (1921 - 1940)</strong>: In this period both the number and the concentration of births is remarkably consistent, with almost no changes on a year-over-year basis.</li>
<li><strong>WWII & Baby Boom (1941 - 1957)</strong>: In this period we again see a huge increase in the number of babies born, corresponding to the well-known baby boom that occurred in the post-WWII years. We actually see this increase beginning during the war, in 1941. We also see a slight decrease in the concentration of the most popular baby names during this period.</li>
<li><strong>"Modern" Era (1958 - 2015)</strong>: This period is characterized by a steady decrease in both the concentration of the most popular names and in the number of babies born with those names. Though overall birth rates did begin to decline in recent years, this change is not due primarily to a decline in birth rates, but rather to an increased equity in the popularity of names, which means that the most popular names make up a much smaller percentage of overall births.
<img alt="Yearly Birth Names with Ten Year Trails" src="./images/yearly-birth-names-with-trails.gif" /></li>
</ul>
<p>The code for this image is available below:</p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>dplyr<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>ggplot2<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>gganimate<span class="p">)</span> <span class="c1">#devtools::install_github("dgrtwo/gganimate")</span>
<span class="kn">library</span><span class="p">(</span>readr<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>scales<span class="p">)</span>
<span class="c1"># Load pre-processed data. For additional details check Github below</span>
top_10_each_year <span class="o"><-</span> read_csv<span class="p">(</span><span class="s">'input/top_10_each_year.csv'</span><span class="p">)</span>
<span class="c1"># Create fading animation effect by replicating the data frame and adding an exponentially decaying fade parameter to previous years</span>
anim <span class="o"><-</span> <span class="kp">lapply</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">,</span> <span class="kr">function</span><span class="p">(</span>i<span class="p">)</span> <span class="p">{</span>top_10_each_year<span class="o">$</span>year <span class="o"><-</span> top_10_each_year<span class="o">$</span>year <span class="o">+</span> i<span class="p">;</span> top_10_each_year<span class="o">$</span>fade <span class="o"><-</span> <span class="m">1</span> <span class="o">/</span> <span class="p">(</span>i <span class="o">+</span> <span class="m">2</span><span class="p">);</span> top_10_each_year<span class="p">})</span>
top_10_each_year<span class="o">$</span>fade <span class="o"><-</span> <span class="m">1</span>
top_10_with_fade <span class="o"><-</span> <span class="kp">rbind</span><span class="p">(</span>top_10_each_year<span class="p">,</span> <span class="kp">do.call</span><span class="p">(</span><span class="kp">rbind</span><span class="p">,</span> anim<span class="p">))</span>
top_10_with_fade <span class="o"><-</span> filter<span class="p">(</span>top_10_with_fade<span class="p">,</span> year <span class="o"><=</span> <span class="m">2015</span><span class="p">)</span>
p <span class="o"><-</span> ggplot<span class="p">(</span>top_10_with_fade<span class="p">,</span> aes<span class="p">(</span>x <span class="o">=</span> proportion<span class="p">,</span> y <span class="o">=</span> count<span class="p">))</span> <span class="o">+</span>
geom_point<span class="p">(</span>color <span class="o">=</span> <span class="s">'#e6e6e6'</span><span class="p">,</span> size <span class="o">=</span> <span class="m">4</span><span class="p">)</span> <span class="o">+</span>
geom_point<span class="p">(</span>aes<span class="p">(</span>color <span class="o">=</span> sex<span class="p">,</span> frame <span class="o">=</span> year<span class="p">,</span> alpha <span class="o">=</span> fade<span class="p">),</span> size <span class="o">=</span> <span class="m">4</span><span class="p">)</span> <span class="o">+</span>
ggtitle<span class="p">(</span><span class="s">'Top 10 Male & Female Baby Names\nYear:'</span><span class="p">)</span> <span class="o">+</span>
xlab<span class="p">(</span><span class="s">'\nProportion (by sex) Born with Name'</span><span class="p">)</span> <span class="o">+</span>
ylab<span class="p">(</span><span class="s">'Number Born with Name'</span><span class="p">)</span> <span class="o">+</span>
scale_color_manual<span class="p">(</span>name <span class="o">=</span> <span class="s">''</span><span class="p">,</span> values <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">'#ff7f00'</span><span class="p">,</span> <span class="s">'#377eb8'</span><span class="p">),</span> labels <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">'Female'</span><span class="p">,</span> <span class="s">'Male'</span><span class="p">))</span> <span class="o">+</span>
scale_x_continuous<span class="p">(</span>labels <span class="o">=</span> percent<span class="p">)</span> <span class="o">+</span>
scale_y_continuous<span class="p">(</span>labels <span class="o">=</span> comma<span class="p">)</span> <span class="o">+</span>
scale_alpha<span class="p">(</span>guide <span class="o">=</span> <span class="s">'none'</span><span class="p">)</span> <span class="o">+</span> <span class="c1"># Remove alpha legend from plot output</span>
theme_bw<span class="p">()</span> <span class="o">+</span>
theme<span class="p">(</span>panel.border <span class="o">=</span> element_blank<span class="p">(),</span>
panel.grid <span class="o">=</span> element_blank<span class="p">(),</span>
axis.ticks <span class="o">=</span> element_blank<span class="p">(),</span>
legend.key <span class="o">=</span> element_blank<span class="p">(),</span>
legend.position <span class="o">=</span> <span class="s">'bottom'</span><span class="p">,</span>
axis.text <span class="o">=</span> element_text<span class="p">(</span>size <span class="o">=</span> <span class="m">14</span><span class="p">),</span>
axis.title <span class="o">=</span> element_text<span class="p">(</span>size <span class="o">=</span> <span class="m">16</span><span class="p">),</span>
legend.text <span class="o">=</span> element_text<span class="p">(</span>size <span class="o">=</span> <span class="m">12</span><span class="p">))</span>
gg_animate<span class="p">(</span>p<span class="p">,</span> filename <span class="o">=</span> <span class="s">'yearly-birth-names-with-trails.gif'</span><span class="p">,</span> interval <span class="o">=</span> <span class="m">0.2</span><span class="p">,</span> ani.width <span class="o">=</span> <span class="m">800</span><span class="p">,</span> ani.height <span class="o">=</span> <span class="m">600</span><span class="p">)</span>
</pre></div>
<p>For the full code behind the shiny applications and the animation produced above, check out my <a href="https://github.com/michaeltoth/shiny-projects/tree/master/census_names">Github</a></p>Installing and Running Shiny Server from Source on 32-bit Ubuntu2016-03-20T20:00:00-04:00Michael Tothtag:michaeltoth.me,2016-03-20:installing-and-running-shiny-server-from-source-on-32-bit-ubuntu.html<p>I recently migrated this site from a shared web hosting service to <a href="https://m.do.co/c/e38e89eb35d9">DigitalOcean</a> because I was interested in learning about how to host my own site and how web servers work. I also wanted to play with and host shiny applications on my own site, rather than relying on a third-party service provider. </p>
<p>In this post I'll talk about the steps I followed to get my shiny server running on a 32-bit Ubuntu DigitalOcean cloud instance. This article assumes that you have</p>
<ul>
<li>A 32-bit Ubuntu OS with space available. For this I was using a DigitalOcean instance, but this should work just as well on a local computer or another provider's service</li>
<li>A decent understanding of the Linux command line</li>
<li>(Optional) A running nginx server that is serving your current site (where you want to deploy your shiny apps) </li>
</ul>
<h2>Installing Required Dependencies</h2>
<p>Before proceeding with the installation, make sure that you have all of the required dependencies installed on your machine. The following software is all needed (sudo apt-get install any of the below that are missing). If you run into any issues below, double check that you've installed the below (with the specified versions) properly</p>
<ul>
<li>python 2.6 or 2.7</li>
<li>cmake >= 2.8.10</li>
<li>gcc</li>
<li>g++</li>
<li>git </li>
</ul>
<h2>R Installations</h2>
<p>Assuming you don't have R installed, you'll need to do that. We'll need r-base and r-devel both installed for some shiny apps to run properly, so let's do that:</p>
<div class="highlight"><pre><span></span>sudo apt-get update
sudo apt-get install r-base r-base-dev
</pre></div>
<p>Next we'll install the shiny R package. While we could start up an R session and simply run install.packages('shiny') from that instance, that would install the shiny package only for one user. Instead, we'll run the command below which will install the shiny package for all users on the machine:</p>
<div class="highlight"><pre><span></span>sudo su - -c <span class="s2">"R -e \"install.packages('shiny', repos='http://cran.rstudio.com/')\""</span>
</pre></div>
<p>When I ran the above command, the packages downloaded but did not install. After some investigation, I realized this was because my instance was running out of memory, so I added some swap space to the instance and ran the command again, which completed successfully. You may or may not have this issue, but if you do, the below command gives you 1GB swap space which should allow your installation to complete.</p>
<div class="highlight"><pre><span></span>sudo /bin/dd <span class="k">if</span><span class="o">=</span>/dev/zero <span class="nv">of</span><span class="o">=</span>/var/swap.1 <span class="nv">bs</span><span class="o">=</span>1M <span class="nv">count</span><span class="o">=</span>1024
sudo /sbin/mkswap /var/swap.1
sudo /sbin/swapon /var/swap.1
</pre></div>
<h2>Shiny Server 32-bit Installation from Source</h2>
<p>RStudio provides a convenient .deb install with pre-compiled binaries for 64-bit architecture, but if you're running 32-bit architecture you'll need to build shiny server from source as described below. This involves manually compiling the required binary files and making some changes to setup directories and config files. This can be a bit complicated (luckily not <em>too</em> complicated), so if you do have 64-bit architecture/OS you can follow the <a href="https://www.rstudio.com/products/shiny/download-server/">simpler instructions from RStudio</a> to install directly. Otherwise, follow along with my instructions below</p>
<p>cd into the directory where you'd like the shiny-server repository. I'll be installing mine to ~/dev, but any location will work. Note that this location will be temporary, so your decision here is not too important. </p>
<div class="highlight"><pre><span></span><span class="nb">cd</span> ~/dev
<span class="c1"># Clone the repository from GitHub and cd into the new directory</span>
git clone https://github.com/rstudio/shiny-server.git
<span class="nb">cd</span> shiny-server
<span class="c1"># Add the bin directory to the path so we can reference</span>
<span class="nv">DIR</span><span class="o">=</span><span class="sb">`</span><span class="nb">pwd</span><span class="sb">`</span>
<span class="nv">PATH</span><span class="o">=</span><span class="nv">$DIR</span>/bin:<span class="nv">$PATH</span>
<span class="nv">PYTHON</span><span class="o">=</span><span class="sb">`</span>which python<span class="sb">`</span>
<span class="c1"># If Python version is not 2.6.x or 2.7.x, you'll need to modify to </span>
<span class="c1"># reference one of these versions (e.g. which python26). This may</span>
<span class="c1"># or may not require you to install a new Python version. For more</span>
<span class="c1"># details, review the "Python" section of the RStudio documentation: </span>
<span class="c1"># https://github.com/rstudio/shiny-server/wiki/Building-Shiny-Server-from-Source</span>
<span class="nv">$PYTHON</span> --version
<span class="c1"># Use cmake to prepare the make step. Modify the "--DCMAKE_INSTALL_PREFIX"</span>
<span class="c1"># if you wish the install the software at a different location.</span>
mkdir tmp<span class="p">;</span> <span class="nb">cd</span> tmp
cmake -DCMAKE_INSTALL_PREFIX<span class="o">=</span>/usr/local -DPYTHON<span class="o">=</span><span class="s2">"</span><span class="nv">$PYTHON</span><span class="s2">"</span> ../
<span class="c1"># Recompile the npm modules included in the project</span>
make
mkdir ../build
<span class="o">(</span><span class="nb">cd</span> .. <span class="o">&&</span> ./bin/npm --python<span class="o">=</span><span class="s2">"</span><span class="nv">$PYTHON</span><span class="s2">"</span> rebuild<span class="o">)</span>
<span class="c1"># Need to rebuild our gyp bindings since 'npm rebuild' won't run gyp for us.</span>
<span class="o">(</span><span class="nb">cd</span> .. <span class="o">&&</span> ./bin/node ./ext/node/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js --python<span class="o">=</span><span class="s2">"</span><span class="nv">$PYTHON</span><span class="s2">"</span> rebuild<span class="o">)</span>
<span class="c1"># Install the software at the predefined location</span>
sudo make install
</pre></div>
<h2>Configuration</h2>
<p>Now we've successfully installed Shiny Server in the location we defined above. You can now safely delete the shiny-server git repo if you would like. There are a few configuration issues we need to finalize before we can use shiny-server, which we'll cover next.</p>
<div class="highlight"><pre><span></span><span class="c1"># Place a shortcut to the shiny-server executable in /usr/bin. </span>
<span class="c1"># As /usr/bin should already be in your PATH variable, you won't need</span>
<span class="c1"># to permanently modify your PATH to reflect the change we made above</span>
sudo ln -s /usr/local/shiny-server/bin/shiny-server /usr/bin/shiny-server
<span class="c1">#Create shiny user on your system. On some systems, you may need to specify the full path to 'useradd'</span>
sudo useradd -r -m shiny
<span class="c1"># Create log, config, and application directories for shiny</span>
sudo mkdir -p /var/log/shiny-server
sudo mkdir -p /srv/shiny-server
sudo mkdir -p /var/lib/shiny-server
sudo chown shiny /var/log/shiny-server
sudo mkdir -p /etc/shiny-server
</pre></div>
<p>Shiny server will look for resources in certain file paths. Certain log directories and application directories can be modified in a configuration file stored at /etc/shiny-server/shiny-server.conf. By default, there will be no file at that location. The RStudio instructions claim that the default configuration(link) will be used if no file exists, but that was not the case for me and I received an error message when trying to run. If the same happens for you, simply copy the default configuration into /etc/shiny-server/shiny-server.conf and you should be all set.</p>
<p>Get the RStudio Upstart script which will allow you to run shiny in the background as you would for running your nginx or other server. This will let you run shiny automatically when you boot your system, or to run it continuously on Digital Ocean</p>
<div class="highlight"><pre><span></span>sudo wget https://raw.github.com/rstudio/shiny-server/master/config/upstart/shiny-server.conf<span class="se">\</span>
-O /etc/init/shiny-server.conf
</pre></div>
<p>Place any shiny scripts in the /srv/shiny-server/your_app
Write a little bit about how you can now run shiny server to serve sites on http://ip_addr:3838/your_app
Starting and stopping shiny server</p>
<div class="highlight"><pre><span></span>sudo start shiny-server
sudo stop shiny-server
</pre></div>
<h2>(Optional) Serving to a custom domain with clean URLs (no :3838 links)</h2>
<p>Now we have shiny installed and configured properly. You'll still need to set it up to serve the files to your actual website address however. I updated my nginx configuration (/etc/nginx/sites-enabled/default or /etc/nginx/sites-enabled/<yoursite>) to add a block for shiny. This allows you to host on yoursite/shiny. This is a reverse proxy and allows you to get around porting issues </p>
<div class="highlight"><pre><span></span>location /shiny/ <span class="o">{</span>
proxy_pass http://127.0.0.1:3838/<span class="p">;</span>
proxy_http_version 1.1<span class="p">;</span>
proxy_set_header Upgrade <span class="nv">$http_upgrade</span><span class="p">;</span>
proxy_set_header Connection <span class="s2">"upgrade"</span><span class="p">;</span>
<span class="o">}</span>
</pre></div>
<p>I drew heavily on the following sources in navigating this process: </p>
<ul>
<li><a href="https://www.rstudio.com/products/shiny/download-server/">Installation</a> </li>
<li><a href="http://docs.rstudio.com/shiny-server/">Shiny Administrator's Guide</a> </li>
<li><a href="https://www.digitalocean.com/community/tutorials/how-to-set-up-shiny-server-on-ubuntu-14-04">Digital Ocean Tutorial</a> </li>
<li><a href="https://github.com/rstudio/shiny-server/wiki/Building-Shiny-Server-from-Source">Building from Source</a> </li>
<li><a href="http://deanattali.com/2015/05/09/setup-rstudio-shiny-server-digital-ocean/">Blog for Digital Ocean Setup</a> </li>
</ul>How to write an R Data Frame to an SQL Table2015-07-08T20:45:00-04:00Michael Tothtag:michaeltoth.me,2015-07-08:how-to-write-an-r-data-frame-to-an-sql-table.html<p>Frequently I find I need to perform an analysis that requires querying some values in a database table based on a series of IDs that might vary depending on some input. As an example, assume we have the following:</p>
<ul>
<li>A table that contains historical stock prices for 2000 stocks for the last 30 years</li>
<li>Some input that contain's a user's portfolio of stock tickers </li>
</ul>
<p>Often, we'll want to pull the price history over a certain date range for all stocks in the user's portfolio. We could of course query all values in the stock prices table and then subset, but this is incredibly inefficient and also means we can't make use of any SQL aggregation functions in our query. Something I've done before when working in an SQL IDE is to create a temp table where I insert a list of the IDs that I am trying to look up, and then join on that table for my query. This is an ideal solution when we're talking about looking up more than a few securities. It took me a while to find an easy way to do this directly in R, but it turns out to be quite simple. I'm hoping my solution helps anybody else who might have this same issue.<br />
<br></p>
<h4>Assumptions:</h4>
<ul>
<li>A table called <em>stock_prices</em> that contains stock price history</li>
<li>A data frame called <em>tickers</em> that contains a list of stock tickers (column name = ticker)</li>
<li>Here I am using PostgreSQL, but this should work essentially the same for any SQL variant<br />
<br></li>
</ul>
<h4>Code</h4>
<p>Start by setting up the connection:</p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>RPostgreSQL<span class="p">)</span>
drv <span class="o"><-</span> dbDriver<span class="p">(</span><span class="s">"PostgreSQL"</span><span class="p">)</span>
con <span class="o"><-</span> dbConnect<span class="p">(</span>drv<span class="p">,</span> user<span class="o">=</span><span class="s">'user'</span><span class="p">,</span> password<span class="o">=</span><span class="s">'password'</span><span class="p">,</span> dbname<span class="o">=</span><span class="s">'my_database'</span><span class="p">,</span> host<span class="o">=</span><span class="s">'host'</span><span class="p">)</span>
</pre></div>
<p><br></p>
<p>Next create the temp table and insert values from our data frame. The key here is the dbWriteTable function which allows us to write an R data frame directly to a database table. The data frame's column names will be used as the database table's fields</p>
<div class="highlight"><pre><span></span><span class="c1"># Drop table if it already exists</span>
<span class="kr">if</span> <span class="p">(</span>dbExistsTable<span class="p">(</span>con<span class="p">,</span> <span class="s">"temp_tickers"</span><span class="p">))</span>
dbRemoveTable<span class="p">(</span>con<span class="p">,</span> <span class="s">"temp_tickers"</span><span class="p">)</span>
<span class="c1"># Write the data frame to the database</span>
dbWriteTable<span class="p">(</span>con<span class="p">,</span> name <span class="o">=</span> <span class="s">"temp_tickers"</span><span class="p">,</span> value <span class="o">=</span> tickers<span class="p">,</span> row.names <span class="o">=</span> <span class="kc">FALSE</span><span class="p">)</span>
</pre></div>
<p><br></p>
<p>Finally, join the stock prices table to the table we just created and query the subsetted values</p>
<div class="highlight"><pre><span></span>sql <span class="o"><-</span> <span class="s">" </span>
<span class="s"> select sp.ticker, sp.date, sp.price</span>
<span class="s"> from stock_prices sp</span>
<span class="s"> join temp_tickers tt on sp.ticker = tt.ticker</span>
<span class="s"> where date between '2000-01-01' and '2015-07-08'</span>
<span class="s">"</span>
results <span class="o"><-</span> dbGetQuery<span class="p">(</span>con<span class="p">,</span> sql<span class="p">)</span>
<span class="c1"># Free up resources</span>
dbDisconnect<span class="p">(</span>con<span class="p">)</span>
dbUnloadDriver<span class="p">(</span>drv<span class="p">)</span>
</pre></div>
<p><br></p>
<p>And that's it. It turned out to not be very complicated, and many may already know this, but it took me a while to figure out how this should be done. I spent a lot of time messing around with INSERT statements before scrapping that idea and coming up with this solution. Let me know if you find this helpful or if you have any ideas on how to do this better!</p>Analyzing Historical Default Rates of Lending Club Notes2015-03-09T21:28:00-04:00Michael Tothtag:michaeltoth.me,2015-03-09:analyzing-historical-default-rates-of-lending-club-notes.html<p>In case you're unfamiliar, Lending Club is the world's largest peer-to-peer lending company, offering a platform for borrowers and lenders to work directly with one another, eliminating the need for a financial intermediary like a bank. Removing the middle-man generally allows both borrowers and lenders to benefit from better interest rates than they otherwise would, which makes peer-to-peer lending an attractive proposition. This post will be the first in a series of posts analyzing the probability of default and expected return of Lending Club notes. In this first post, I'll cover some of the background on Lending Club, talk about getting and cleaning the loan data, and perform some exploratory analysis on the available variables and outcomes. In subsequent posts, I'll work on developing a predictive model for determining the loan default probabilities. <em>Before investing, it is always important to fully understand the risks, and this post does not constitute investment advice in either Lending Club or in Lending Club notes.</em> </p>
<h2>Background and Gathering Data</h2>
<p>Lending Club makes all past borrower data freely available <a href="https://www.lendingclub.com/info/download-data.action">on their website</a> for review, and I will be referencing the 2012-2013 data throughout this post. </p>
<p>To download the 2012-2013 data from Lending Club: </p>
<div class="highlight"><pre><span></span><span class="c1"># Download and extract data from Lending Club</span>
<span class="kr">if</span> <span class="p">(</span><span class="o">!</span><span class="kp">file.exists</span><span class="p">(</span><span class="s">"LoanStats3b.csv"</span><span class="p">))</span> <span class="p">{</span>
fileUrl <span class="o"><-</span> <span class="s">"https://resources.lendingclub.com/LoanStats3b.csv.zip"</span>
download.file<span class="p">(</span>fileUrl<span class="p">,</span> destfile <span class="o">=</span> <span class="s">"LoanStats3b.csv.zip"</span><span class="p">,</span> method<span class="o">=</span><span class="s">"curl"</span><span class="p">)</span>
dateDownloaded <span class="o"><-</span> <span class="kp">date</span><span class="p">()</span>
unzip<span class="p">(</span><span class="s">"LoanStats3b.csv.zip"</span><span class="p">)</span>
<span class="p">}</span>
<span class="c1"># Read in Lending Club Data</span>
<span class="kr">if</span> <span class="p">(</span><span class="o">!</span><span class="kp">exists</span><span class="p">(</span><span class="s">"full_dataset"</span><span class="p">))</span> <span class="p">{</span>
full_dataset <span class="o"><-</span> read.csv<span class="p">(</span>file<span class="o">=</span><span class="s">"LoanStats3b.csv"</span><span class="p">,</span> header<span class="o">=</span><span class="kc">TRUE</span><span class="p">,</span> skip <span class="o">=</span> <span class="m">1</span><span class="p">)</span>
<span class="p">}</span>
</pre></div>
<p>For each loan in the file, Lending Club provides an indication of the current loan status. Because many of the loan statuses represent similar outcomes, I've mapped them from Lending Club's 7 down to only 2, simplifying the problem of classifying loan outcomes without much loss of information useful for investment decisions. My two outcomes "Performing" and "NonPerforming" seek to separate those loans likely to pay in full from those likely to default. Below I include a table summarizing the mappings: </p>
<p><br>
<img alt="Loan Statuses" src="https://michaeltoth.me/images/loan-statuses.jpg" />
<br> </p>
<p>Now that we've loaded the data, let's extract the fields we need and do some cleaning. We can eliminate any fields that would not have been known at the time of issuance, as we'll be trying to make decisions on loan investments using available pre-issuance data. We can also eliminate a few indicative data fields that are repetitive or too granular to be analyzed, and make some formatting changes to get the data ready for analysis. Finally, we'll map the loan statuses to the binary "Performing" and "NonPerforming" classifiers as discussed above. </p>
<div class="highlight"><pre><span></span><span class="c1"># Select variables to keep and subset the data</span>
variables <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="s">"id"</span><span class="p">,</span> <span class="s">"loan_amnt"</span><span class="p">,</span> <span class="s">"term"</span><span class="p">,</span> <span class="s">"int_rate"</span><span class="p">,</span> <span class="s">"installment"</span><span class="p">,</span> <span class="s">"grade"</span><span class="p">,</span>
<span class="s">"sub_grade"</span><span class="p">,</span> <span class="s">"emp_length"</span><span class="p">,</span> <span class="s">"home_ownership"</span><span class="p">,</span> <span class="s">"annual_inc"</span><span class="p">,</span>
<span class="s">"is_inc_v"</span><span class="p">,</span> <span class="s">"loan_status"</span><span class="p">,</span> <span class="s">"purpose"</span><span class="p">,</span> <span class="s">"addr_state"</span><span class="p">,</span> <span class="s">"dti"</span><span class="p">,</span>
<span class="s">"delinq_2yrs"</span><span class="p">,</span> <span class="s">"earliest_cr_line"</span><span class="p">,</span> <span class="s">"inq_last_6mths"</span><span class="p">,</span>
<span class="s">"mths_since_last_delinq"</span><span class="p">,</span> <span class="s">"mths_since_last_record"</span><span class="p">,</span> <span class="s">"open_acc"</span><span class="p">,</span>
<span class="s">"pub_rec"</span><span class="p">,</span> <span class="s">"revol_bal"</span><span class="p">,</span> <span class="s">"revol_util"</span><span class="p">,</span> <span class="s">"total_acc"</span><span class="p">,</span>
<span class="s">"initial_list_status"</span><span class="p">,</span> <span class="s">"collections_12_mths_ex_med"</span><span class="p">,</span>
<span class="s">"mths_since_last_major_derog"</span><span class="p">)</span>
train <span class="o"><-</span> full_dataset<span class="p">[</span>variables<span class="p">]</span>
<span class="c1"># Reduce loan status to binary "Performing" and "NonPerforming" Measures:</span>
train<span class="o">$</span>new_status <span class="o"><-</span> <span class="kp">factor</span><span class="p">(</span><span class="kp">ifelse</span><span class="p">(</span>train<span class="o">$</span>loan_status <span class="o">%in%</span> <span class="kt">c</span><span class="p">(</span><span class="s">"Current"</span><span class="p">,</span> <span class="s">"Fully Paid"</span><span class="p">),</span>
<span class="s">"Performing"</span><span class="p">,</span> <span class="s">"NonPerforming"</span><span class="p">))</span>
<span class="c1"># Convert a subset of the numeric variables to factors</span>
train<span class="o">$</span>delinq_2yrs <span class="o"><-</span> <span class="kp">factor</span><span class="p">(</span>train<span class="o">$</span>delinq_2yrs<span class="p">)</span>
train<span class="o">$</span>inq_last_6mths <span class="o"><-</span> <span class="kp">factor</span><span class="p">(</span>train<span class="o">$</span>inq_last_6mths<span class="p">)</span>
train<span class="o">$</span>open_acc <span class="o"><-</span> <span class="kp">factor</span><span class="p">(</span>train<span class="o">$</span>open_acc<span class="p">)</span>
train<span class="o">$</span>pub_rec <span class="o"><-</span> <span class="kp">factor</span><span class="p">(</span>train<span class="o">$</span>pub_rec<span class="p">)</span>
train<span class="o">$</span>total_acc <span class="o"><-</span> <span class="kp">factor</span><span class="p">(</span>train<span class="o">$</span>total_acc<span class="p">)</span>
<span class="c1"># Convert interest rate numbers to numeric (strip percent signs)</span>
train<span class="o">$</span>int_rate <span class="o"><-</span> <span class="kp">as.numeric</span><span class="p">(</span><span class="kp">sub</span><span class="p">(</span><span class="s">"%"</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> train<span class="o">$</span>int_rate<span class="p">))</span>
train<span class="o">$</span>revol_util <span class="o"><-</span> <span class="kp">as.numeric</span><span class="p">(</span><span class="kp">sub</span><span class="p">(</span><span class="s">"%"</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span> train<span class="o">$</span>revol_util<span class="p">))</span>
</pre></div>
<h2>Analyzing Predictive Power of Variables</h2>
<p><br></p>
<h4>Lending Club Grades and Subgrades</h4>
<p>All types of borrowers are using peer-to-peer lending for a variety of purposes. This raises the question of how to determine appropriate interest rates given the varying levels of risk across borrowers. Luckily for us, Lending Club handles this for us. They use an algorithm to determine a borrower's level of risk, and then set the interest rates according to the level of risk. Specifically, Lending Club maps borrowers to a series of grades [A-F] and subgrades [A-F][1-5] based on their risk profile. Loans in each subgrade are then given appropriate interest rates. The specific rates will change over time according to market conditions, but generally they will fall within a tight range for each subgrade. </p>
<p>Let's take a look at the proportions of performing and non-performing loans by Lending Club's provided grades: </p>
<div class="highlight"><pre><span></span>by_grade <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>grade<span class="p">,</span> exclude<span class="o">=</span><span class="s">""</span><span class="p">)</span>
prop_grade <span class="o"><-</span> <span class="kp">prop.table</span><span class="p">(</span>by_grade<span class="p">,</span><span class="m">2</span><span class="p">)</span>
barplot<span class="p">(</span>prop_grade<span class="p">,</span> main <span class="o">=</span> <span class="s">"Loan Performance by Grade"</span><span class="p">,</span> xlab <span class="o">=</span> <span class="s">"Grade"</span><span class="p">,</span>
col<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">"darkblue"</span><span class="p">,</span><span class="s">"red"</span><span class="p">),</span> legend <span class="o">=</span> <span class="kp">rownames</span><span class="p">(</span>prop_grade<span class="p">))</span>
by_subgrade <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>sub_grade<span class="p">,</span> exclude<span class="o">=</span><span class="s">""</span><span class="p">)</span>
prop_subgrade <span class="o"><-</span> <span class="kp">prop.table</span><span class="p">(</span>by_subgrade<span class="p">,</span><span class="m">2</span><span class="p">)</span>
barplot<span class="p">(</span>prop_subgrade<span class="p">,</span> main <span class="o">=</span> <span class="s">"Loan Performance by Sub Grade"</span><span class="p">,</span> xlab <span class="o">=</span> <span class="s">"SubGrade"</span><span class="p">,</span>
col<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">"darkblue"</span><span class="p">,</span><span class="s">"red"</span><span class="p">),</span>legend <span class="o">=</span> <span class="kp">rownames</span><span class="p">(</span>prop_subgrade<span class="p">))</span>
</pre></div>
<p>We can see from the chart below that rates of default steadily increase as the loan grades worsen from A to G, as expected.</p>
<p><br>
<img alt="Performance by Grade" src="https://michaeltoth.me/images/by-grade.png" />
<br> </p>
<p>We see a similar pattern for the subgrades, although the trend begins to weaken across the G1-G5 subgrades. On further investigation, I found that there are only a few hundred data points for each of these subgrades, in contrast to thousands of data points for the A-F subgrades, and these differences are not large enough to be significant.</p>
<p><br>
<img alt="Performance by Subgrade" src="https://michaeltoth.me/images/by-subgrade.png" />
<br> </p>
<p>In general, it looks like the Lending Club grading system does a pretty great job of predicting ultimate loan performance, but let's check out some of the other available data to see what other trends we might be able to find in the data.</p>
<p><br></p>
<h4>Home Ownership</h4>
<p>The Lending Club data has 3 main classifications for home ownership: mortgage (outstanding mortgage payment), own (home is owned outright), and rent. I would expect those with mortgages to default less frequently than those who rent, both because there are credit requirements to get a mortgage and because those with mortgages might in general tend to be more established. Let's see whether this is actually the case: </p>
<div class="highlight"><pre><span></span>ownership_status <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span>train<span class="o">$</span>home_ownership<span class="p">,</span>
exclude<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">"OTHER"</span><span class="p">,</span><span class="s">"NONE"</span><span class="p">,</span><span class="s">""</span><span class="p">))</span>
prop_ownership <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>ownership_status<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
</pre></div>
<p><br>
<img alt="Ownership Status" src="https://michaeltoth.me/images/ownership-status.jpg" />
<br> </p>
<p>So those with mortgages default the least, followed by those who own their homes outright and finally those who rent. The differences here are much smaller than when comparing different grades, but they are still notable. Let's verify whether these are statistically significant: </p>
<div class="highlight"><pre><span></span><span class="c1"># Calculate the counts of mortgage, owners, and renters:</span>
count_m <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"MORTGAGE"</span><span class="p">)</span>
count_o <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"OWN"</span><span class="p">)</span>
count_r <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"RENT"</span><span class="p">)</span>
<span class="c1"># Calculate the counts of default for mortgages, owners, and renters:</span>
dflt_m <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"MORTGAGE"</span> <span class="o">&</span> train<span class="o">$</span>new_status <span class="o">==</span> <span class="s">"NonPerforming"</span><span class="p">)</span>
dflt_o <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"OWN"</span> <span class="o">&</span> train<span class="o">$</span>new_status <span class="o">==</span> <span class="s">"NonPerforming"</span><span class="p">)</span>
dflt_r <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"RENT"</span> <span class="o">&</span> train<span class="o">$</span>new_status <span class="o">==</span> <span class="s">"NonPerforming"</span><span class="p">)</span>
<span class="c1"># 1-sided proportion test for mortgage vs owners</span>
prop.test<span class="p">(</span><span class="kt">c</span><span class="p">(</span>dflt_m<span class="p">,</span>dflt_o<span class="p">),</span> <span class="kt">c</span><span class="p">(</span>count_m<span class="p">,</span>count_o<span class="p">),</span> alternative <span class="o">=</span> <span class="s">"less"</span><span class="p">)</span>
<span class="c1"># 1-sided proportion test for owners vs renters</span>
prop.test<span class="p">(</span><span class="kt">c</span><span class="p">(</span>dflt_o<span class="p">,</span>dflt_r<span class="p">),</span> <span class="kt">c</span><span class="p">(</span>count_o<span class="p">,</span>count_r<span class="p">),</span> alternative <span class="o">=</span> <span class="s">"less"</span><span class="p">)</span>
</pre></div>
<p>The p-value of the first test was 6.377*10^-12 and the p-value for the second test was 3.787*10^-8, indicating that the differences in both of these proportions are very statistically significant. Although the differences in the default probabilities are somewhat small, on the order of 1.5%, the number of data points is in the high tens of thousands, which contributes to the significance. Given this result, we can generally conclude that similar differences in default probabilities for other factors should also be significant, so long as a similar quantity of data points is available. </p>
<p><strong>Note:</strong> for the remaining analysis, the code for each variable becomes a bit repetitive, so in the interest of minimizing the length of this post I will present only the results. If you are interested to see the actual code, you will find it in the appendix at the bottom of this post. You can also read the <a href="https://github.com/michaeltoth/lending_club/blob/master/LendingClub.R">complete code on Github</a>. </p>
<p><br></p>
<h4>Debt to Income Ratio</h4>
<p>Debt to income ratio indicates the ratio between a borrowers monthly debt payment and monthly income. This was originally formatted as a continuous numerical variable, but I bucketed it into 5% increments to better visualize the effect on loan performance. As we might expect, there is a steady increase in the percentage of non-performing loans as DTI increases, reflecting the constraints that increased debt put onto borrower ability to repay: </p>
<p><br>
<img alt="Debt to Income Ratio" src="https://michaeltoth.me/images/dti.jpg" />
<br> </p>
<p><br></p>
<h4>Revolving Utilization Percent</h4>
<p>Revolving utilization percent is the portion of a borrower's revolving credit limit (i.e. credit card limit) that they actually are using at any given point. For example, if a borrower's total credit limit is $15,000 and their outstanding balance is $1,500 their utilization rate would be 10%. We can see below that the percentage of non-performing loans steadily increases with utilization rate. Borrowers with high utilization rates are more likely to have high fixed credit card payments which might affect their ability to repay their loans. Also, a high utilization rate often reflects a lack of other financing options, with borrowers turning to peer-to-peer lending as a last resort. This is in contrast to those borrowers with low utilization rates, who may be using peer-to-peer lending opportunistically to pursue lower interest payments. </p>
<p><br>
<img alt="Revolving Utilization" src="https://michaeltoth.me/images/utilization.jpg" />
<br> </p>
<p><br></p>
<h4>Loan Purpose</h4>
<p>Loan purpose refers to the borrower's stated reason for taking out the loan. We see below that credit card and debt consolidation tend to have better performance, along with home improvement, cars, and other major purchases. Luxury spending on vacations and weddings and unexpected medical and moving expenses generally have worse performance. Small business loans perform very poorly, perhaps reflecting the fact that those borrowers unable to get bank financing for their small business may have poor credit or business plans that aren't fully developed. </p>
<p><br>
<img alt="Loan Purpose" src="https://michaeltoth.me/images/loan-purpose.jpg" />
<br> </p>
<p><br></p>
<h4>Inquiries in the Past 6 Months</h4>
<p>Number of inquiries refers to the number of times a borrower's credit report is accessed by financial institutions, which generally happens when the borrower is seeking a loan or credit line. More inquiries leads to higher rates of nonperformance, perhaps indicating that increased borrower desperation to access credit might highlight poor financial health. Interestingly, we see an increase in loan performance in the 4+ inquiries bucket. These high levels of inquiries may reflect financially savvy borrowers shopping around for mortgage loans or credit cards. </p>
<p><br>
<img alt="Inquiries" src="https://michaeltoth.me/images/inquiries.jpg" />
<br> </p>
<p><br></p>
<h4>Number of Total Accounts</h4>
<p>A larger number of total accounts indicates a longer credit history and a high level of trust between the borrower and financial institutions, both of which point to financial health and lower rates of default. We see steady increases in the rates of performing loans as the number of accounts increases from 7 to around 20, but diminishing effects after that. </p>
<p><br>
<img alt="Total Accounts" src="https://michaeltoth.me/images/total-accounts.jpg" />
<br> </p>
<p><br></p>
<h4>Annual Income</h4>
<p>As we might expect, the higher a borrower's annual income the more likely they are to be able to repay their loans. Below I've broken the income data into quintiles, and we can see that those in the top 20% of annual incomes ($95000 +) are approximately 6% more likely to be performing borrowers than those in the bottom 20% (less than $42000). </p>
<p><br>
<img alt="Annual Income" src="https://michaeltoth.me/images/annual-income.jpg" />
<br> </p>
<p><br></p>
<h4>Loan Amount</h4>
<p>As the amount borrowed increases, we see increasing rates of nonperforming loans. The difference between the first two buckets is only around 1% (and the intra-bucket differences are very small), but we see a larger decrease in loan quality in the $30,000 - $35,000 bucket. Noting that the Lending Club maximum loan is $35,000, this may indicate particularly desperate borrowers who are maximizing their possible borrowing. </p>
<p><br>
<img alt="Loan Amount" src="https://michaeltoth.me/images/loan-amount.jpg" />
<br> </p>
<p><br></p>
<h4>Employment Length</h4>
<p>We'd expect those who have been employed longer to be more stable, and thus less likely to default. Looking into the data, 3 key groups emerged: the unemployed, those employed less than 10 years, and those employed for 10+ years: </p>
<p><br>
<img alt="Employment Length" src="https://michaeltoth.me/images/employment-length.jpg" />
<br> </p>
<p><br></p>
<h4>Delinquencies in the Past 2 Years</h4>
<p>The number of delinquencies in the past 2 years indicates the number of times a borrower has been behind on payments. I combined all values 3 or larger into a single bucket for analysis, as this was a long right-tailed distribution. Interestingly, those with a single delinquency seem to perform more often than those with none. In general however, the differences between 0, 1, and 2 delinquencies are relatively small, while those with greater than 3 show a significant decrease in performance. </p>
<p><br>
<img alt="Delinquencies" src="https://michaeltoth.me/images/delinquencies.jpg" />
<br> </p>
<p><br></p>
<h4>Number of Open Accounts</h4>
<p>Unlike the number of total accounts above, which we saw to be quite significant, the number of open accounts variable was not a particularly strong indicator: </p>
<p><br>
<img alt="Open Accounts" src="https://michaeltoth.me/images/open-accounts.jpg" />
<br> </p>
<p><br></p>
<h4>Verified Income Status</h4>
<p>Lending Club categorizes income verification into three statuses: not verified, source verified, and verified. Verified income means that Lending Club independently verified both the source and size of reported income, source verified means that they verified only the source of the income, and not verified means there was no independent verification of the reported values. Interestingly, we see that as income verification increases, the loan performance actually worsens. During the mortgage crisis, non-verified "no-doc" loans were among the worst performing, so the reversal here is interesting. This likely reflects the fact that Lending Club only verifies those borrowers who seem to be of worse credit quality, so there may be <a href="http://en.wikipedia.org/wiki/Confounding">confounding variables</a> present here.<br />
<br>
<img alt="Verified Income" src="https://michaeltoth.me/images/verified-income.jpg" />
<br> </p>
<p><br></p>
<h4>Number of Public Records</h4>
<p>Public records generally refer to bankruptcies, so we would expect those with more public records to show worse performance. Actually, performance increases as we move from 0 to 1 to 2 public records. This possibly indicates stricter lending standards from Lending Club on those borrowers with public records: </p>
<p><br>
<img alt="Public Records" src="https://michaeltoth.me/images/public-records.jpg" />
<br> </p>
<p><br></p>
<h4>Variables that were not significant:</h4>
<ul>
<li>Months since last delinquency</li>
<li>Months since last major derogatory note</li>
<li>Collections previous 12 months (too few data points on which to make any conclusions or form predictions)</li>
</ul>
<h2>Summary</h2>
<ul>
<li>Lending club grade and subgrade variables provide the most predictive power for determining expected loan performance.</li>
<li>A large number of the other variables also provide strong indications of expected performance. Among the most telling are debt-to-income ratio, credit utilization rate, home ownership status, loan purpose, annual income, inquiries in the past 6 months, and number of total accounts.</li>
<li>Verified income status and number of public records show results opposite from what we would expect. This is likely due to increased standards on borrowers with poorer credit history, so all else equal we see outperformance in these loans.</li>
</ul>
<p>We've gotten a good understanding of the available borrower data, and we've seen which variables give the best indiciations of future loan performance. In the next post, We'll work on developing a predictive model for projecting the probability of default for newly issued loans. </p>
<h2>Appendix</h2>
<p>Below I've included the code used to generate the numbers in the tables above. You can also find the <a href="https://github.com/michaeltoth/projects/tree/master/lending-club-analysis">complete code available on Github</a></p>
<div class="highlight"><pre><span></span><span class="c1">### Explore the relationships between default rates and factor levels</span>
<span class="c1">### I take a few different approaches, but the key idea is the same</span>
<span class="c1"># Home Ownership (exclude status "OTHER" and "NONE" because of few data points)</span>
home_ownership <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span>train<span class="o">$</span>home_ownership<span class="p">,</span>
exclude<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="s">"OTHER"</span><span class="p">,</span><span class="s">"NONE"</span><span class="p">,</span><span class="s">""</span><span class="p">))</span>
prop_home_ownership <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>home_ownership<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Test for significance of the difference in proportions for home ownership factors</span>
<span class="c1"># Calculate the counts of mortgage, owners, and renters:</span>
count_m <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"MORTGAGE"</span><span class="p">)</span>
count_o <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"OWN"</span><span class="p">)</span>
count_r <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"RENT"</span><span class="p">)</span>
<span class="c1"># Calculate the counts of default for mortgages, owners, and renters:</span>
dflt_m <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"MORTGAGE"</span> <span class="o">&</span> train<span class="o">$</span>new_status <span class="o">==</span> <span class="s">"NonPerforming"</span><span class="p">)</span>
dflt_o <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"OWN"</span> <span class="o">&</span> train<span class="o">$</span>new_status <span class="o">==</span> <span class="s">"NonPerforming"</span><span class="p">)</span>
dflt_r <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span>train<span class="o">$</span>home_ownership <span class="o">==</span> <span class="s">"RENT"</span> <span class="o">&</span> train<span class="o">$</span>new_status <span class="o">==</span> <span class="s">"NonPerforming"</span><span class="p">)</span>
<span class="c1"># 1-sided proportion test for mortgage vs owners</span>
prop.test<span class="p">(</span><span class="kt">c</span><span class="p">(</span>dflt_m<span class="p">,</span>dflt_o<span class="p">),</span> <span class="kt">c</span><span class="p">(</span>count_m<span class="p">,</span>count_o<span class="p">),</span> alternative <span class="o">=</span> <span class="s">"less"</span><span class="p">)</span>
<span class="c1"># 1-sided proportion test for owners vs renters</span>
prop.test<span class="p">(</span><span class="kt">c</span><span class="p">(</span>dflt_o<span class="p">,</span>dflt_r<span class="p">),</span> <span class="kt">c</span><span class="p">(</span>count_o<span class="p">,</span>count_r<span class="p">),</span> alternative <span class="o">=</span> <span class="s">"less"</span><span class="p">)</span>
<span class="c1"># Debt to Income Ratio (break into factors at 5% levels)</span>
train<span class="o">$</span>new_dti <span class="o"><-</span> <span class="kp">cut</span><span class="p">(</span>train<span class="o">$</span>dti<span class="p">,</span> breaks <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">10</span><span class="p">,</span> <span class="m">15</span><span class="p">,</span> <span class="m">20</span><span class="p">,</span> <span class="m">25</span><span class="p">,</span> <span class="m">30</span><span class="p">,</span> <span class="m">35</span><span class="p">))</span>
dti <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>new_dti<span class="p">)</span>
prop_dti <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>dti<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Revolving Utilization (break into 0 - 20, then factors of 10, then 80+)</span>
train<span class="o">$</span>new_revol_util <span class="o"><-</span> <span class="kp">cut</span><span class="p">(</span>train<span class="o">$</span>revol_util<span class="p">,</span> breaks <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">20</span><span class="p">,</span> <span class="m">30</span><span class="p">,</span> <span class="m">40</span><span class="p">,</span> <span class="m">50</span><span class="p">,</span> <span class="m">60</span><span class="p">,</span> <span class="m">70</span><span class="p">,</span> <span class="m">80</span><span class="p">,</span> <span class="m">141</span><span class="p">))</span>
revol_util <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>new_revol_util<span class="p">)</span>
prop_revol_util <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>revol_util<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Loan Purpose (exclude renewable energy because so few data points)</span>
purpose <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span>train<span class="o">$</span>purpose<span class="p">,</span> exclude <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="s">"renewable_energy"</span><span class="p">,</span><span class="s">""</span><span class="p">))</span>
prop_purpose <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>purpose<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Inquiries in the last 6 months (combine factor levels for any > 4)</span>
<span class="kp">levels</span><span class="p">(</span>train<span class="o">$</span>inq_last_6mths<span class="p">)</span> <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="s">"0"</span><span class="p">,</span> <span class="s">"1"</span><span class="p">,</span> <span class="s">"2"</span><span class="p">,</span> <span class="s">"3"</span><span class="p">,</span> <span class="kp">rep</span><span class="p">(</span><span class="s">"4+"</span><span class="p">,</span> <span class="m">5</span><span class="p">))</span>
inq_last_6mths <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>inq_last_6mths<span class="p">)</span>
prop_inq_last_6mths <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>inq_last_6mths<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Number of total accounts (combine factor levels into groups of 5, then 23+)</span>
<span class="kp">levels</span><span class="p">(</span>train<span class="o">$</span>total_acc<span class="p">)</span> <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="kp">rep</span><span class="p">(</span><span class="s">"<= 7"</span><span class="p">,</span> <span class="m">5</span><span class="p">),</span> <span class="kp">rep</span><span class="p">(</span><span class="s">"8 - 12"</span><span class="p">,</span> <span class="m">5</span><span class="p">),</span>
<span class="kp">rep</span><span class="p">(</span><span class="s">"13 - 17"</span><span class="p">,</span> <span class="m">5</span><span class="p">),</span> <span class="kp">rep</span><span class="p">(</span><span class="s">"18 - 22"</span><span class="p">,</span> <span class="m">5</span><span class="p">),</span>
<span class="kp">rep</span><span class="p">(</span><span class="s">"23+"</span><span class="p">,</span> <span class="m">68</span><span class="p">))</span>
total_acc <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>total_acc<span class="p">)</span>
prop_total_acc <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>total_acc<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Annual Income (factor into quantiles of 20%)</span>
train<span class="o">$</span>new_annual_inc <span class="o"><-</span> <span class="kp">cut</span><span class="p">(</span>train<span class="o">$</span>annual_inc<span class="p">,</span>
quantile<span class="p">(</span>train<span class="o">$</span>annual_inc<span class="p">,</span> na.rm <span class="o">=</span> <span class="kc">TRUE</span><span class="p">,</span>
probs <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">0.2</span><span class="p">,</span> <span class="m">0.4</span><span class="p">,</span> <span class="m">0.6</span><span class="p">,</span> <span class="m">0.8</span><span class="p">,</span> <span class="m">1</span><span class="p">)))</span>
annual_inc <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>new_annual_inc<span class="p">)</span>
prop_annual_inc <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>annual_inc<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Loan Amount (break into < 15k, 15k - 30k, 30k - 35k)</span>
train<span class="o">$</span>new_loan_amnt <span class="o"><-</span> <span class="kp">cut</span><span class="p">(</span>train<span class="o">$</span>loan_amnt<span class="p">,</span><span class="kt">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">15000</span><span class="p">,</span> <span class="m">30000</span><span class="p">,</span> <span class="m">35000</span><span class="p">))</span>
loan_amnt <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>new_loan_amnt<span class="p">)</span>
prop_loan_amnt <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>loan_amnt<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Employment Length (combine factor levels for better comparison)</span>
<span class="kp">levels</span><span class="p">(</span>train<span class="o">$</span>emp_length<span class="p">)</span> <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="s">"None"</span><span class="p">,</span> <span class="s">"< 10 years"</span><span class="p">,</span> <span class="s">"< 10 years"</span><span class="p">,</span> <span class="s">"10+ years"</span><span class="p">,</span>
<span class="kp">rep</span><span class="p">(</span><span class="s">"< 10 years"</span><span class="p">,</span> <span class="m">8</span><span class="p">),</span> <span class="s">"None"</span><span class="p">)</span>
emp_length <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>emp_length<span class="p">)</span>
prop_emp_length <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>emp_length<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Delinquencies in the past 2 Years (combine factors levels for any > 3)</span>
<span class="kp">levels</span><span class="p">(</span>train<span class="o">$</span>delinq_2yrs<span class="p">)</span> <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="s">"0"</span><span class="p">,</span> <span class="s">"1"</span><span class="p">,</span> <span class="s">"2"</span><span class="p">,</span> <span class="kp">rep</span><span class="p">(</span><span class="s">"3+"</span><span class="p">,</span> <span class="m">17</span><span class="p">))</span>
delinq_2yrs <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>delinq_2yrs<span class="p">)</span>
prop_delinq_2yrs <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>delinq_2yrs<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Number of Open Accounts (combine factor levels into groups of 5)</span>
<span class="kp">levels</span><span class="p">(</span>train<span class="o">$</span>open_acc<span class="p">)</span> <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="kp">rep</span><span class="p">(</span><span class="s">"<= 5"</span><span class="p">,</span> <span class="m">6</span><span class="p">),</span> <span class="kp">rep</span><span class="p">(</span><span class="s">"6 - 10"</span><span class="p">,</span> <span class="m">5</span><span class="p">),</span>
<span class="kp">rep</span><span class="p">(</span><span class="s">"11 - 15"</span><span class="p">,</span> <span class="m">5</span><span class="p">),</span> <span class="kp">rep</span><span class="p">(</span><span class="s">"16+"</span><span class="p">,</span> <span class="m">38</span><span class="p">))</span>
open_acc <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>open_acc<span class="p">)</span>
prop_open_acc <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>open_acc<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Verified income status</span>
is_inc_v <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>is_inc_v<span class="p">,</span> exclude <span class="o">=</span> <span class="s">""</span><span class="p">)</span>
prop_is_inc_v <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>is_inc_v<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Number of Public Records (break factor levels into 0, 1, 2+)</span>
<span class="kp">levels</span><span class="p">(</span>train<span class="o">$</span>pub_rec<span class="p">)</span> <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="s">"0"</span><span class="p">,</span> <span class="s">"1"</span><span class="p">,</span> <span class="kp">rep</span><span class="p">(</span><span class="s">"2+"</span><span class="p">,</span> <span class="m">12</span><span class="p">))</span>
pub_rec <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>pub_rec<span class="p">)</span>
prop_pub_rec <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>pub_rec<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Months Since Last Record (compare blank vs. non-blank)</span>
na_last_record <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span><span class="kp">is.na</span><span class="p">(</span>train<span class="o">$</span>mths_since_last_record<span class="p">))</span>
not_na_last_record <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span><span class="o">!</span><span class="kp">is.na</span><span class="p">(</span>train<span class="o">$</span>mths_since_last_record<span class="p">))</span>
na_last_rec_dflt <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span><span class="kp">is.na</span><span class="p">(</span>train<span class="o">$</span>mths_since_last_record<span class="p">)</span> <span class="o">&</span> train<span class="o">$</span>new_status <span class="o">==</span> <span class="s">"NonPerforming"</span><span class="p">)</span>
not_na_last_rec_dflt <span class="o"><-</span> <span class="kp">sum</span><span class="p">(</span><span class="o">!</span><span class="kp">is.na</span><span class="p">(</span>train<span class="o">$</span>mths_since_last_record<span class="p">)</span> <span class="o">&</span> train<span class="o">$</span>new_status <span class="o">==</span> <span class="s">"NonPerforming"</span><span class="p">)</span>
not_na_last_rec_pct_dflt <span class="o"><-</span> not_na_last_rec_dflt <span class="o">/</span> not_na_last_record
na_last_rec_pct_dflt <span class="o"><-</span> na_last_rec_dflt<span class="o">/</span>na_last_record
<span class="c1"># Months since last delinquency (break factor levels in increments of 10)</span>
train<span class="o">$</span>mths_since_last_delinq <span class="o"><-</span> <span class="kp">cut</span><span class="p">(</span>train<span class="o">$</span>mths_since_last_delinq<span class="p">,</span>
breaks <span class="o">=</span> <span class="kt">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">10</span><span class="p">,</span> <span class="m">20</span><span class="p">,</span> <span class="m">30</span><span class="p">,</span> <span class="m">40</span><span class="p">,</span> <span class="m">50</span><span class="p">,</span> <span class="m">60</span><span class="p">,</span> <span class="m">156</span><span class="p">))</span>
mths_since_last_delinq <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>mths_since_last_delinq<span class="p">)</span>
prop_mths_since_last_delinq <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>mths_since_last_delinq<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
<span class="c1"># Collections last 12 months</span>
collections <span class="o"><-</span> <span class="kp">table</span><span class="p">(</span>train<span class="o">$</span>new_status<span class="p">,</span> train<span class="o">$</span>collections_12_mths_ex_med<span class="p">)</span>
prop_collections <span class="o"><-</span> <span class="kp">round</span><span class="p">(</span><span class="kp">prop.table</span><span class="p">(</span>collections<span class="p">,</span> <span class="m">2</span><span class="p">)</span> <span class="o">*</span> <span class="m">100</span><span class="p">,</span> <span class="m">2</span><span class="p">)</span>
</pre></div>Plotting the Evolution of the U.S. Treasury Yield Curve2014-11-12T20:01:00-05:00Michael Tothtag:michaeltoth.me,2014-11-12:plotting-the-evolution-of-the-us-treasuryyield-curve.html<p>Last week I came across a <a href="http://isomorphism.es/post/101890975168/treasury-yield-curve-from-the-volcker-era-through">graphic</a> that plots changes in the treasury yield curve from 1982 through 2012. For those unfamiliar, the yield curve shows the level of interest rates available to investors at a series of times to maturity or <em>terms</em>. I won't go into too much detail here, but for more information you may find the <a href="http://en.wikipedia.org/wiki/Yield_curve">Yield Curve</a> page on Wikipedia helpful. Note that while 20-year and 30-year treasuries are currently available, they were not available for this entire period and are therefore excluded from this analysis. </p>
<p>Building on the original plot mentioned above, I pulled post-2012 data directly from the treasury website and added it to the original data to produce an extended graphic of yield curve changes. I also updated the plot formatting and highlighted periods with inverted yield curves using a bright red line.<br />
<br>
<img alt="Yield Curve" src="https://michaeltoth.me/images/yield-output.gif" />
<br> </p>
<p>I've always liked graphics like this that show some changing feature over time, and I think the yield curve illustration is particularly informative. You can clearly see the extreme high interest rates that prevailed throughout the 1980s, and later the characteristically flat yield curve associated with the <a href="http://en.wikipedia.org/wiki/Zero_interest-rate_policy">Zero Interest Rate Policy</a> regime post-2008. The few yield curve inversions (referring to the situation where short term rates are higher than long term rates), are visible as well, highlighted in red. Yield curve inversions are generally thought to signal an impending economic decline, so highlighting these is informative. The plot also shows clearly that long-term rates historically are much less volatile than short term rates, but that the opposite has held true since 2008, with Federal Reserve action keeping short term rates near zero. </p>
<p>I've been experimenting with parameter and plotting settings in R, as part of the <a href="https://www.coursera.org/course/exdata">Exploratory Data Analysis</a> Coursera class, and I thought this was a good opportunity to experiment with different options. I derived many of the ideas and aesthetics in these plots from examples in Flowing Data's <a href="http://flowingdata.com/2014/10/23/moving-past-default-charts/">Moving Past Default Charts</a> post. </p>
<h3>Technical Details and R Code</h3>
<p>The FedYieldCurve data from the YieldCurve package only contains data through 2012. I wanted to expand the time history from the end of 2012 to current, so I pulled the additional yield curve data from the <a href="http://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yield">Treasury website</a>. I downloaded the raw text data and subset/formatted on the command line before uploading the results to Google Drive. I then pull this information in using the fetchGoogle function from the mosaic package. </p>
<p>After converting the 2012-2014 data to .xts format and combining with the FedYieldCurve data, I modified the graphical parameters to make for a more interesting plot. Then, using the saveGIF function and a for loop, I was able to create the above GIF with a single frame for each month in the data series. Within the loop, I included an if statement to determine whether the 3M rate exceeded the 10Y point (inverted yield curve), and plotted these periods using a red line to highlight this. </p>
<p>I've included the complete R code below. You can also access the code on <a href="https://github.com/michaeltoth/projects/tree/master/yield-curve-analysis">Github</a></p>
<div class="highlight"><pre><span></span><span class="kn">library</span><span class="p">(</span>YieldCurve<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>animation<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>lubridate<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>XML<span class="p">)</span>
<span class="kn">library</span><span class="p">(</span>mosaic<span class="p">)</span>
<span class="c1"># Getting yield curve data through 2012</span>
data<span class="p">(</span>FedYieldCurve<span class="p">)</span>
<span class="c1"># Pull 2013 and 2014 data separately from Google Docs (Source U.S. Treasury)</span>
end_curve <span class="o"><-</span> fetchGoogle<span class="p">(</span><span class="s">"https://docs.google.com/spreadsheets/d/1Yc3Og9g0Ko_SMh6l0EEZcqIQ85godDxgpnkbfK_N-Gk/export?format=csv&id"</span><span class="p">)</span>
<span class="c1"># Change formatting to xts and combine with FedYieldCurve data</span>
end_curve<span class="o">$</span>Date <span class="o"><-</span> <span class="kp">as.POSIXct</span><span class="p">(</span><span class="kp">as.character</span><span class="p">(</span>end_curve<span class="o">$</span>Date<span class="p">),</span> format<span class="o">=</span><span class="s">"%m/%d/%Y"</span><span class="p">)</span>
end_curve_xts <span class="o"><-</span> xts<span class="p">(</span>end_curve<span class="p">[,</span><span class="m">2</span><span class="o">:</span><span class="m">9</span><span class="p">],</span> order.by <span class="o">=</span> end_curve<span class="o">$</span>Date<span class="p">)</span>
final_curves <span class="o"><-</span> <span class="kp">rbind</span><span class="p">(</span>FedYieldCurve<span class="p">,</span> end_curve_xts<span class="p">)</span>
maturities <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="m">3</span><span class="o">/</span><span class="m">12</span><span class="p">,</span><span class="m">6</span><span class="o">/</span><span class="m">12</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">7</span><span class="p">,</span><span class="m">10</span><span class="p">)</span>
numloops <span class="o"><-</span> <span class="kp">nrow</span><span class="p">(</span>final_curves<span class="p">)</span>
<span class="c1"># Set graphical parameters</span>
par<span class="p">(</span>bg<span class="o">=</span><span class="s">"#DCE6EC"</span><span class="p">,</span> mar<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="m">5</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">2</span><span class="p">),</span> xpd<span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span> mgp<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="m">2.8</span><span class="p">,</span><span class="m">0.3</span><span class="p">,</span><span class="m">0.5</span><span class="p">),</span> font.main<span class="o">=</span><span class="m">2</span><span class="p">,</span>
col.lab<span class="o">=</span><span class="s">"black"</span><span class="p">,</span> col.axis<span class="o">=</span><span class="s">"black"</span><span class="p">,</span> col.main<span class="o">=</span><span class="s">"black"</span><span class="p">,</span> cex.axis<span class="o">=</span><span class="m">0.8</span><span class="p">,</span>
cex.lab<span class="o">=</span><span class="m">0.8</span><span class="p">,</span> cex.main<span class="o">=</span><span class="m">0.9</span><span class="p">,</span> family<span class="o">=</span><span class="s">"Helvetica"</span><span class="p">,</span> lend<span class="o">=</span><span class="m">1</span><span class="p">,</span>
tck<span class="o">=</span><span class="m">0</span><span class="p">,</span> las<span class="o">=</span><span class="m">1</span><span class="p">,</span> bty<span class="o">=</span><span class="s">"n"</span><span class="p">)</span>
opar <span class="o"><-</span> par<span class="p">()</span>
<span class="c1"># Note: must install ImageMagick program for saveGIF function to work</span>
saveGIF<span class="p">({</span>
<span class="c1"># Create one panel for each date</span>
<span class="kr">for</span> <span class="p">(</span>i <span class="kr">in</span> <span class="m">1</span><span class="o">:</span>numloops<span class="p">)</span> <span class="p">{</span>
par<span class="p">(</span>opar<span class="p">)</span>
plot<span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> type<span class="o">=</span><span class="s">"n"</span><span class="p">,</span> xlab<span class="o">=</span><span class="kp">expression</span><span class="p">(</span>italic<span class="p">(</span><span class="s">"Maturity"</span><span class="p">)),</span>
ylab<span class="o">=</span><span class="kp">expression</span><span class="p">(</span>italic<span class="p">(</span><span class="s">"Rates"</span><span class="p">)),</span> ylim<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">15</span><span class="p">),</span> xlim<span class="o">=</span><span class="kt">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">10</span><span class="p">),</span>
xaxt<span class="o">=</span><span class="s">"n"</span><span class="p">,</span> yaxt<span class="o">=</span><span class="s">"n"</span><span class="p">)</span>
title<span class="p">(</span>main<span class="o">=</span><span class="kp">paste</span><span class="p">(</span><span class="s">"Yield Curve: "</span><span class="p">,</span> year<span class="p">(</span>time<span class="p">(</span>final_curves<span class="p">[</span>i<span class="p">]))))</span>
grid<span class="p">(</span><span class="kc">NA</span><span class="p">,</span> <span class="kc">NULL</span><span class="p">,</span> col<span class="o">=</span><span class="s">"white"</span><span class="p">,</span> lty<span class="o">=</span><span class="s">"solid"</span><span class="p">,</span> lwd<span class="o">=</span><span class="m">1.5</span><span class="p">)</span>
axis<span class="p">(</span><span class="m">1</span><span class="p">,</span> tick<span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span> col.axis<span class="o">=</span><span class="s">"black"</span><span class="p">)</span>
axis<span class="p">(</span><span class="m">2</span><span class="p">,</span> tick<span class="o">=</span><span class="kc">FALSE</span><span class="p">,</span> col.axis<span class="o">=</span><span class="s">"black"</span><span class="p">)</span>
<span class="c1"># If yield curve is inverted, plot in red, else dark blue</span>
<span class="kr">if</span> <span class="p">(</span>final_curves<span class="o">$</span>R_3M<span class="p">[</span>i<span class="p">]</span> <span class="o">></span> final_curves<span class="o">$</span>R_10Y<span class="p">[</span>i<span class="p">])</span> <span class="p">{</span>
lines<span class="p">(</span>maturities<span class="p">,</span> final_curves<span class="p">[</span>i<span class="p">,],</span> lwd<span class="o">=</span><span class="m">3</span><span class="p">,</span> col<span class="o">=</span><span class="s">"red"</span><span class="p">)</span>
<span class="p">}</span>
<span class="kr">else</span> <span class="p">{</span>
lines<span class="p">(</span>maturities<span class="p">,</span> final_curves<span class="p">[</span>i<span class="p">,],</span> lwd<span class="o">=</span><span class="m">3</span><span class="p">,</span> col<span class="o">=</span><span class="s">"#244A5D"</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">},</span>
interval<span class="o">=</span><span class="m">.1</span><span class="p">,</span>
movie.name<span class="o">=</span><span class="s">"yieldOutput.gif"</span><span class="p">,</span>
ani.width<span class="o">=</span><span class="m">400</span><span class="p">,</span>
ani.height<span class="o">=</span><span class="m">400</span><span class="p">)</span>
</pre></div>Using Javascript to Visualize a Percolation System2014-10-16T19:27:00-04:00Michael Tothtag:michaeltoth.me,2014-10-16:using-javascript-to-visualize-a-percolation-system.html<p>In this post I will discuss the background for my <a href="../pages/percolation.html" title="Michael Toth - Percolation Visualization">percolation visualization page</a> and the details of my implementation. I hope to provide a good introduction to percolation theory and the union find algorithm in particular. This is the first non-trivial Javascript application I've created, and later in the post I will discuss some of the biggest challenges I faced and things I learned along the way. </p>
<h2>Background</h2>
<p>The inspiration and idea for this project came directly from the similar
programming assignment in Robert Sedgwick's and Kevin Wayne's
<a href="https://www.coursera.org/course/algs4partI" title="Coursera -
Algorithms">Algorithms class</a> on Coursera. In the class I implemented a percolation system in Java,
and afterward I thought that it would be an interesting challenge to port that
to Javascript and create a visualization on my website. For the web
version I use Javascript for all of the calculations, and I use the HTML canvas
element to draw the visualization to the screen. </p>
<p><br></p>
<h4>Connectivity and the Union Find Algorithm</h4>
<p>A connectivity problem seeks to determine, given a directed graph of sites
such as the one below, whether two sites are connected via any path. </p>
<p><br>
<img alt="Connected Sites" src="https://michaeltoth.me/images/connected.jpg" />
<br></p>
<p>For a small set of sites like this one, a brute force approach would solve the
problem effectively. If we wanted to see whether site 2 was connected to site
8, we could recursively check each site's neighbors to ultimately determine
that they are connected. As the number of sites grows large however, this
method does not scale, and we need a better solution. Instead of thinking of
the grid of sites as a directed graph, we can convey the same information as a
grouping of components as seen in the image below. Now, determining whether any
two sites are connected is as simple as checking whether they are both members of
the same component. Connecting two sites under this new representation involves
merging their components rather than drawing paths. </p>
<p><br>
<img alt="Connected Components" src="https://michaeltoth.me/images/connected-components.png" />
<br></p>
<p>The Union Find data structure, sometimes called a disjoint set data structure or
a merge-find set, allows for high performance operations on a component grouping
as described above. The Union Find data structure keeps track of a set of
elements partitioned into a number of disjoint subsets (components). The data
structure supports two main operations: </p>
<p><em>Find</em>: Return the id of the component to which the given site is a member<br />
<em>Union</em>: Connect two sites by combining their two components into a single
component with the same id </p>
<p>A new Union Find data structure of size N is initialized with N distinct
components. In a numerical representation, the id of each component when
initialized is simply the value of the site. Calls to the Union operation
create a tree of components such that when two sets are combined, the
members of one set will point to those of the other set. The id of the
merged component is the id of the root node of this tree. The find
operation returns the id of a site by traversing to the top of the tree
to find the root member component id.</p>
<p>In this implementation, the find operation takes time proportional to the
depth of the tree. A naive implementation of the union operation could allow
trees to potentially become very deep, which would slow the performance of this
algorithm. Instead, if we modify the union function such that we always append
the smaller component to the larger component, we reduce the maximum depth of
the tree and ensure that the algorithm takes O(nlg(n)) time in the worst case.<br />
<br>
<em>Abbreviated Proof</em>:<br />
- Note: lg = log base 2<br />
- For a given node x, its depth in the tree will increase by 1 when its tree T1
is merged into another tree T2.<br />
- When this happens, the size of x's tree will at least double, because our
union operation requires size(T2) > size(T1) for T1 to point to T2.<br />
- Through any number of union operations, the size of x's tree can double at
most lg(N) times [ <em>lg(2^N) = N</em> ], and N is the total number of nodes.<br />
- <em>Therefore</em>: Traversals take O(nlg(n)) in the worst case </p>
<p>Below I have included my complete Javascript code for the Union Find data
structure. This code is essentially the same as the Java code presented in
the Coursera class mentioned above. We include two operations in addition to
the find and union operations: a connected operation that returns true if
two sites are connected and a count variable that returns the number of
unique components. </p>
<div class="highlight"><pre><span></span><span class="kd">function</span> <span class="nx">WeightedQuickUnionUF</span><span class="p">(</span><span class="nx">N</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// Constructor</span>
<span class="kd">var</span> <span class="nx">id</span> <span class="o">=</span> <span class="p">[];</span>
<span class="kd">var</span> <span class="nx">sz</span> <span class="o">=</span> <span class="p">[];</span>
<span class="k">for</span> <span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o"><</span> <span class="nx">N</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">id</span><span class="p">[</span><span class="nx">i</span><span class="p">]</span> <span class="o">=</span> <span class="nx">i</span><span class="p">;</span> <span class="c1">// id[i] = parent of i</span>
<span class="nx">sz</span><span class="p">[</span><span class="nx">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// sz[i] = number of objects in subtree with root i</span>
<span class="p">}</span>
<span class="c1">// Returns the number of components, which initializes at N</span>
<span class="k">this</span><span class="p">.</span><span class="nx">count</span> <span class="o">=</span> <span class="nx">N</span><span class="p">;</span>
<span class="c1">// Returns the component id for the containing site</span>
<span class="k">this</span><span class="p">.</span><span class="nx">find</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">p</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="nx">p</span> <span class="o">!=</span> <span class="nx">id</span><span class="p">[</span><span class="nx">p</span><span class="p">])</span> <span class="p">{</span>
<span class="nx">p</span> <span class="o">=</span> <span class="nx">id</span><span class="p">[</span><span class="nx">p</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">return</span> <span class="nx">p</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Returns true if two elements are part of the same component</span>
<span class="k">this</span><span class="p">.</span><span class="nx">connected</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">p</span><span class="p">,</span> <span class="nx">q</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="k">this</span><span class="p">.</span><span class="nx">find</span><span class="p">(</span><span class="nx">p</span><span class="p">)</span> <span class="o">===</span> <span class="k">this</span><span class="p">.</span><span class="nx">find</span><span class="p">(</span><span class="nx">q</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// Connects the components of two elements</span>
<span class="k">this</span><span class="p">.</span><span class="nx">union</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">p</span><span class="p">,</span> <span class="nx">q</span><span class="p">)</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">rootP</span> <span class="o">=</span> <span class="k">this</span><span class="p">.</span><span class="nx">find</span><span class="p">(</span><span class="nx">p</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">rootQ</span> <span class="o">=</span> <span class="k">this</span><span class="p">.</span><span class="nx">find</span><span class="p">(</span><span class="nx">q</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">rootP</span> <span class="o">===</span> <span class="nx">rootQ</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span><span class="p">;</span> <span class="p">}</span>
<span class="c1">// make smaller tree point to larger one</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">sz</span><span class="p">[</span><span class="nx">rootP</span><span class="p">]</span> <span class="o"><</span> <span class="nx">sz</span><span class="p">[</span><span class="nx">rootQ</span><span class="p">])</span> <span class="p">{</span>
<span class="nx">id</span><span class="p">[</span><span class="nx">rootP</span><span class="p">]</span> <span class="o">=</span> <span class="nx">rootQ</span><span class="p">;</span> <span class="nx">sz</span><span class="p">[</span><span class="nx">rootQ</span><span class="p">]</span> <span class="o">+=</span> <span class="nx">sz</span><span class="p">[</span><span class="nx">rootP</span><span class="p">];</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="nx">id</span><span class="p">[</span><span class="nx">rootQ</span><span class="p">]</span> <span class="o">=</span> <span class="nx">rootP</span><span class="p">;</span> <span class="nx">sz</span><span class="p">[</span><span class="nx">rootP</span><span class="p">]</span> <span class="o">+=</span> <span class="nx">sz</span><span class="p">[</span><span class="nx">rootQ</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">this</span><span class="p">.</span><span class="nx">count</span><span class="o">--</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</pre></div>
<p><br> </p>
<h4>Percolation</h4>
<p>The percolation problem assumes a grid of sites that can be set to either open
or closed. If we imagine that water is flowing across the top of the grid, an
open site will become full with water when it can be connected by other open
sites in an unbroken path to the top of the grid. The system percolates when
an open site on the bottom of the grid can be connected by other open sites in
an unbroken path to the top of the grid, such that water will flow freely
through the system, as seen in the image below.</p>
<p><br>
<img alt="Percolation" src="https://michaeltoth.me/images/percolation.png" />
<br></p>
<p><em>Modeling percolation with the Union-Find algorithm</em><br />
The Union Find implementation is efficient for determining whether any 2
sites are connected, but this structure would require N^2 time to check whether
any of the N sites in the top row is connected to any of the N sites in the
bottom row of the percolation grid. Similarly, this structure would require
time proportional to N to determine whether a given site is full (i.e. whether
it is connected to any of the N top row sites). </p>
<p>To address these issues, we create two "virtual sites" on the top and bottom
of the grid. We automatically connect these virtual sites to sites on the
top and bottom rows as they are opened. To determine whether a site is full,
we check whether it is connected to the top virtual site. To determine
whether the system percolates we check whether the top and bottom virtual
sites are connected. </p>
<p><em>Interesting aside</em><br />
For a large square grid of sites, there exists a percolation threshold
probability p such that if we open fewer than p percent of the sites the
system will not percolate, and if we open greater than p percent of the sites
the system will percolate. An exact expression for the percolation threshold
of a square grid is currently unknown, but through experimentation we know the
value to be approximately 0.592746. That is, for a large square grid, the
system should first percolate after we have opened 59.2746% of the sites.
<br></p>
<h2>Challenges</h2>
<h4>Creating the Visualization</h4>
<p>Initially I had thought to create the grid of sites using a grid of divs which
I would then be able to color according to their status, similar to my
previous <a href="../pages/mondrian.html" title="Michael Toth - Piet Mondrian Painting">Mondrian Painting Project</a>. I wanted to support the ability to
change the size of the grid however, and a large number of divs seemed
unnecessarily cumbersome. I did some searching on <a href="http://codepen.io" title="Codepen">Codepen.io</a> for ideas, and found some examples using the HTML5 canvas element,
which was exactly what I needed. In particular, I liked that this output as an
image that was saveable. I got to learn a lot about the HTML canvas and how it
worked, and this was a fun part of the project. </p>
<p><br></p>
<h4>Converting Implementation from Java</h4>
<p>I had trouble initially with porting the Java code to Javascript. In Java,
the well-defined class relationships were clear to me, but I did not at first
understand how to implement similar structures in Javascript. After doing some
research, I found that there were many ways in Javascript to accomplish a
similar thing, and I ultimately used functions to implement this as they were
most syntactically similar and I found this approach easiest to understand. I
am still working to understand many aspects of Javascript, but this project
helped my greatly with learning about how to modularize my code. </p>
<p><br> </p>
<h4>Iteratively Opening Sites and Drawing</h4>
<p>Initially I implemented the process of opening sites and drawing to the canvas
using a while loop, but this was not ideal. I could run the entire
percolation simulation and then draw to the canvas, but if I tried drawing to
the canvas after each site was opened, the browser would freeze. I wanted to
be able to show each site being opened successively, so I needed to find a way
to delay the opening of new sites until I could draw to the screen. I ended
up accomplishing this through the use of Javascript's setInterval method. I
ultimately used both implementations, allowing the user to choose whether
to output the results instantly (using the while loop) or to output the results
iteratively at two different speeds (using setInterval). The timing is controlled
by the code shown below: </p>
<div class="highlight"><pre><span></span><span class="kd">function</span> <span class="nx">outputInstantly</span><span class="p">()</span> <span class="p">{</span>
<span class="c1">// While loop runs until the system percolates</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="nx">perc</span><span class="p">.</span><span class="nx">percolates</span><span class="p">())</span> <span class="p">{</span> <span class="c1">// Calls the percolates method of the perc object</span>
<span class="nx">openRandom</span><span class="p">();</span>
<span class="nx">count</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Once the system percolates, draw to the screen a single time</span>
<span class="nx">drawPerc</span><span class="p">.</span><span class="nx">drawGrid</span><span class="p">();</span>
<span class="p">}</span>
<span class="c1">// The user controls delay variable by selecting radio buttons on the page</span>
<span class="k">if</span> <span class="p">(</span><span class="nx">delay</span> <span class="o">===</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">outputInstantly</span><span class="p">();</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// Use the setInterval function to repeatedly open sites and draw to screen (checkPerc function)</span>
<span class="nx">interval</span> <span class="o">=</span> <span class="nx">setInterval</span><span class="p">(</span><span class="nx">checkPerc</span><span class="p">,</span> <span class="nx">delay</span><span class="p">);</span>
<span class="nx">interval</span><span class="p">();</span>
<span class="p">}</span>
</pre></div>
<p><br> </p>
<h4>Bug when running multiple calculations simultaneously</h4>
<p>My initial implementation suffered from a bug where if I reran the simulation,
either by refreshing the page or by clicking the button to run again,
the first percolation run would continue in the background. This caused
issues with the display and text output to the screen. I knew that the
solution would be to clear the interval on rerun, but I did not know how I
could access the id of setInterval, an instance variable of the previous
simulatePercolation instance, when creating a new instance of percolation.
After some experimentation, I found that if I declared the interval variable
in the head of my HTML, rather than in a separate javascript file, I could
assign it to setInterval when running simulatePercolation, which would eliminate
the possibility of duplicate intervals running simultaneously, and correct the
issues that I had been facing.<br />
<br>
See the full code for my percolation visualization on my Github:
<a href="https://raw.githubusercontent.com/michaeltoth/michaeltoth/master/content/pages/percolation.md" title="Percolation">Percolation</a><br />
<br>
<a href="../pages/percolation.html" title="Michael Toth - Percolation Visualization">Run my percolation visualization</a></p>