Looking At A Baby Names Dataset In R

dkmathstats (61)in #programming • 8 years ago

Hi. This post features a babynames dataset in R. We will look at the most popular baby names according to this dataset with the use of the R programming language and its associated R packages.

In the past, I have used the data.table package in R with this babynames dataset. The thing is thatI have forgotten some things from that package. This post focused more on the dplyr package for data wrangling and manipulation. (dplyr is comparable to pandas in Python).

This post is a shorter version of my website version post.

Image Source

Sections

Getting Started & The babynames Dataset
The Top 20 Baby Names
The Top 20 Female Baby Names
The Top 20 Male Baby Names (Graph Only)
Popular Baby Names By Letter (Graph Only)
Notes & References

Getting Started And The babynames Dataset

The R packages used here are dplyr, babynames, ggplot2 and stringr.

To install a package into R, use the code install.packages("package_name").

Loading in an installed R package requires the use of library(package_name) or require(package_name).

# Analyzing Baby Names Dataset In R With dplyr
# With Sideways Bar Graphs
# Ref: https://stackoverflow.com/questions/27141565/how-to-sum-up-the-duplicated-value-while-keep-the-other-columns


library(babynames) # Baby Names dataset:
library(ggplot2) # For Data visualization & Graphs
library(dplyr) # For data wrangling and manipulation
library(stringr) # For strings and regex

The (screenshot) image below contains information about the babynames dataset. It says that the data is provided by the SSA. I think SSA refers to the The United States Society Security Administration. Do remember that the baby names from this data do not represent global baby names. (We still deal with sample data as large as this dataset is.)

This babynames dataset into stored into a variable called baby_data. To quickly examine the data, I use the head(), tail() and str() functions to take a look of the data.


# Save the babynames data into baby_data:

baby_data <- data.frame(babynames)

# Preview the data:

head(baby_data); tail(baby_data)

This large babynames dataset is from the years 1880 to 2015. There are 1,858,689 rows with 5 variables in this dataset!

The colnames() function is used to fix the column names.

Some Naming Considerations

There are cases where certain names can be either Male or Female. Here area few examples:

Ashley is typically a female but there are male Ashleys out there.
Jordan is usually a guy's name but it can be female. (I have met a female one.)

I am not sure if names such as Ashley & Ashleigh are recorded together as one or are recorded separately. Other examples include:

Dave & David
Sara & Sarah
Mary & Marie
Greg & Gregory,
William & Will & Bill, John
Johnny, John & Jon
Caitlin, Kaitlin, Katelynn & Caitlyn
Tom and Thomas
Ana versus Anna

My guess is that there are not many of these cases in the data.

The Top 20 Baby Names

R's dplyr function in R is really useful when it comes to data wrangling and data manipulation. The data needs to be in a format that would be ready for data visualization purposes.

The variable sorted_names is created which groups the names (from different years) and Sex together. Total counts for the names are provided as well. The arrange(desc(Total)) part rearranges the rows in the data from the highest to lowest for the counts.

> sorted_names <- baby_data %>% group_by(Name, Sex) %>% summarise(Total = sum(Count)) %>%
+                       arrange(desc(Total))
> sorted_names <- data.frame(sorted_names)
> 
> head(sorted_names, n = 20)
          Name Sex   Total
1        James   M 5120990
2         John   M 5095674
3       Robert   M 4803068
4      Michael   M 4323928
5         Mary   F 4118058
6      William   M 4071645
7        David   M 3589754
8       Joseph   M 2581785
9      Richard   M 2558165
10     Charles   M 2371621
11      Thomas   M 2290364
12 Christopher   M 2004667
13      Daniel   M 1882435
14   Elizabeth   F 1610948
15    Patricia   F 1570954
16     Matthew   M 1566027
17    Jennifer   F 1464067
18      George   M 1457646
19       Linda   F 1451331
20     Barbara   F 1433339

Since I want the top twenty baby names, I subset the data and select the top twenty rows.

> top_twenty_baby <- sorted_names[1:20, ]
> 
> top_twenty_baby
          Name Sex   Total
1        James   M 5120990
2         John   M 5095674
3       Robert   M 4803068
4      Michael   M 4323928
5         Mary   F 4118058
6      William   M 4071645
7        David   M 3589754
8       Joseph   M 2581785
9      Richard   M 2558165
10     Charles   M 2371621
11      Thomas   M 2290364
12 Christopher   M 2004667
13      Daniel   M 1882435
14   Elizabeth   F 1610948
15    Patricia   F 1570954
16     Matthew   M 1566027
17    Jennifer   F 1464067
18      George   M 1457646
19       Linda   F 1451331
20     Barbara   F 1433339

This next section of code is for ensuring that the highest to lowest order is preserved in the upcoming bar graph.

> top_twenty_baby$Name <- factor(top_twenty_baby$Name, 
+                        levels = top_twenty_baby$Name[order(top_twenty_baby$Total)]) 
> 
> top_twenty_baby$Sex <- as.factor(top_twenty_baby$Sex)

Now that the data is nicely formatted for plotting, a plot can be generated with the use of the ggplot2 data visualization package in R.

Some Notes

The counts are in millions and are displayed with the geom_text() add on function.
You need geom_bar() for generating bar graphs.
coord_flip() converts the bar graph from vertical bars to horizontal bars.
Setting fill = Sex gives the different colours for the graphs depending on gender. This is good for the viewer.
In the theme() add on function, plot.title = element_text(hjust = 0.5) centers the plot title.
There are more male names than female which are in the top 20 most popular baby names from 1880 to 2015 (in the USA).
Further investigation can include why some on these names are so popular over time (historically).

These names are from 1880 to 2015 but these names do not necessarily match the popular baby names from the year of 2016. The image is from : https://www.ssa.gov/oact/babynames/

The Top 20 Female Baby Names

Finding the female baby names is not that difficult. We add in a filter() part into the dplyr code. The code is similar as the code in previous section.


> ## 2) Finding The Top 20 Female Baby Names:
> 
> female_names <- baby_data %>% filter(Sex == "F") %>% group_by(Name) %>% 
+                  summarise(Total = sum(Count)) %>% arrange(desc(Total))
> 
> female_names <- data.frame(female_names)
> 
> head(female_names, n = 20)
        Name   Total
1       Mary 4118058
2  Elizabeth 1610948
3   Patricia 1570954
4   Jennifer 1464067
5      Linda 1451331
6    Barbara 1433339
7   Margaret 1242141
8      Susan 1120810
9    Dorothy 1106106
10     Sarah 1065265
11   Jessica 1042201
12     Helen 1016686
13     Nancy 1001342
14     Betty  999066
15     Karen  984468
16      Lisa  964330
17      Anna  878988
18    Sandra  872827
19    Ashley  838186
20  Kimberly  830967
> 
> top_twenty_female <- female_names[1:20, ]
> 
> top_twenty_female$Name <- factor(top_twenty_female$Name, 
+                                levels = top_twenty_female$Name[order(top_twenty_female$Total)])

From the data, it appears that Mary is the most popular baby name. Elizabeth takes second place and the name Patricia takes third place. As mentioned before, it is uncertain if Mary is grouped along with Marie, Maria, etc.

The Top 20 Male Baby Names (Graph Only)

Here are the top twenty male baby names. I have used blue bars here.

Popular Baby Names By Letter (Graph Only)

The code for the popular baby names by letter is kind of long and technical. It is best to look at my code from my website.

You may or may not find the popular baby names by letter useful to you. I did this anyways as an exercise and to practice with regular expression in R.

Due to popular baby names such as John, James, Joseph, Jennifer and Jessica from the dataset, the letter J takes first place. The second place first letter goes to M with names such as Mary, Margaret and Matthew.

Notes & References

I had fun working with this dataset and I hope you enjoyed the findings from this data too. You could use these popular baby names to help you name your baby.

I am not sure if this stuff is good for predicting baby names. There is one school of thought where the past does not impact the future and the other side believes in the past affecting the future (correlation).

References