Prior Work
Since my dataset has changed since the last draft, so has my research. In Determinig how I would find research on my topic I simply googled “predicting hit songs”. To my suprise there was quite a few hits and many, if not all, were computer science related. The extent of the predictions ranged from simple varaible analysis, similar to what I am doing, to professional startups using machine learning and all the other tools to predict whether or not a song would be a hit “to an atonishing accuracy”. Also suprisingly was the fact that many of the studies used a similar dataset coming from spotify, with more extensive one using a greater amount then 100. While most of the information I found was too shallow or too deep, I understand that predictive technology like this is a frontier for buisnessess, mathematicians, and computer scientist.
https://www.digitalmusicnews.com/2018/01/31/hyperlive-hit-song-prediction-algorithm/
Title: Startup Says It Can Predict Whether a Song Will Become a Hit — With 84% Accuracy
Author: Daniel Sanchez
Year: Jan 31, 2018
Description: A company known as Hyperlive has made an algorithm based on a range of neurobiobehavioral responses to music as well as underpinning psychological processes in order to predict the success of a song.
https://towardsdatascience.com/song-popularity-predictor-1ef69735e380
Title: Song Popularity Predictor
Author: Mohamed Nasreldin
Year: May 4, 2018
Description: Using machine learning a group of students try to gauge whether the success of a song can be predicted based off past data analysis of hit songs. The students came to the conclusio that they could succesfully predict whether or not a song would be a hit to a higher degree then expected at random, but they also cite how it is not perfect and that it is much easier to predict whether or not a song will not be a hit. They list possible limitations due to dataset, cleaning of data, and lack of more variables.
https://pdfs.semanticscholar.org/e6cc/edb50d2c2b01bca108cb090943e86fb58135.pdf
Title: Predicting Hit Songs with Machine Learning
Author: MINNA REIMAN, PHILIPPA ÖRNELL
Year: May 30, 2018
Description: Two students attempt to gauge if machine learning of audio features in hit songs are good enough to predict the success of a new song based off the analysis of the same audio features. The students came to the conclusion that “results do not indicate that it is possible to pre- dict hit songs on our particular dataset.” They also go on to detail limitations of there dataset most of which stem from lack of meta data, time, or skill needed to do other optimizing tasks.
Data
Quoted Directly from Site
“Original Data Source:
The audio features for each song were extracted using the Spotify Web API and the spotipy Python library. Credit goes to Spotify for calculating the audio feature values.
Data Description: There is one .csv file in the dataset. (top2018.csv) This file includes:
Spotify URI for the song
Name of the song
Artist(s) of the song
Audio features for the song (such as danceability, tempo, key etc.)
A more detailed explanation of the audio features can be found in the Metadata tab."
Possible limatations of the dataset is the quality of subjective variable, small sample size, and the fact these top songs were “top tracks of Spotify” and may not be descriptive or inclusive of other forms of media such as youtube, cd, apple, and etc..
URL : https://www.kaggle.com/nadintamer/top-spotify-tracks-of-2018
Results
library(ggplot2)
library(dplyr)
library(gridExtra)
library(RColorBrewer)
library(corrplot)
library(ggthemes)
library(fmsb)
library(gridExtra)
library(tidyverse)
library(corrr)
library(devtools)
library(ggbiplot)
library("devtools")
library("factoextra")
base <- read.csv("top2018.csv")
base <- base[,-1]
Differentiating Groups
Rap_Songs <- base[c(1:7, 12:13, 16, 19:20, 22, 29, 31, 33, 39, 41, 43, 49, 50, 51, 54, 56, 59, 62:63, 73:77, 79, 80, 82:84, 88, 92, 95, 98),]
Edm_Songs <- base[c(8,21,32, 38, 40, 52, 57, 58, 65, 91, 97 ),]
Pop_Songs <- base[c(9,10,11,14, 15, 17, 23,24, 25, 26,27, 28, 30,34,36,37, 44, 46, 47, 48, 53, 55, 61,64, 66, 68, 69, 71, 72,78, 81,85, 87, 93, 94,96, 99,100 ),]
Latin_Songs <- base[c(18,35,42,45, 60, 67,70, 86,89,90 ),]
Rap_Songs$type <- "Rap"
Edm_Songs$type <- "Edm"
Pop_Songs$type <- "Pop"
Latin_Songs$type <- "Latin"
base_new <- rbind(Rap_Songs, Edm_Songs, Pop_Songs, Latin_Songs)
table(base_new$type)
Edm Latin Pop Rap
11 10 38 41
Correlation Matrix
corrplot(cor(base[c(3,4,6,8,9,10,11,12,13,14)]), method="color", type = "upper", col=brewer.pal(n=10, name="RdBu"),
tl.col="black",tl.srt=90, addCoef.col = "gray8", diag = T, number.cex = 0.65)
Reveals a strong positive correlation between loudness and energy of a hit song, while also highlighting a strong negative correlation between energy and acousticness of a hit song. These findings make sense as energy has a direct relationship with loudness and songs which are acoustic heavy tend to lack in energy because the use of natural sound producing instrument is generally never louder than an electric instuments. (ex: acoustic guitar vs electric guitar). Other correlations are pretty weak with most of them being significantly lower than .4.
Correlation Network
res.cor <- as.tibble(base)
res.cor2 <- res.cor %>% select(3:15)
res.cor2 %>% fashion()
res.cor2 %>% correlate() %>%
network_plot(min_cor = 0.1, color = c("firebrick", "darkturquoise"))
Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'
Correlation network highlights same aspects of correlation matrix with both the previous relationships mentioned in last section being the most saturated(correlated) connections.
Principal Component Analysis
mtcars.pca <- prcomp(base_new[,c(3:14)], center = TRUE,scale. = TRUE)
fviz_eig(mtcars.pca,addlabels=TRUE)
fviz_pca_ind(mtcars.pca, col.ind = base_new[,c(16)])
ggbiplot(mtcars.pca, groups = base_new[,c(16)], ellipse = true()) + theme_solarized() +
scale_color_manual(name="Variety", values=c("black", "darkseagreen", "sienna2", "blue"))
The principal components of the PCA is a lot more distibuted then other datasets I have seen. With the first and second component only accounting for about 35% of the explained variance. The genres EDM and Latin make decently defined clusters while Pop and Rap have a spread that is far too great too consider a cluster. This means that the features of EDM and Latin hit songs are very similar to one another while those of Pop and Rap origin are disimalar and have a greater variety(variance) then other groups.
- I am not for sure if this is correct because understanding of PCA is weak.
Histograms, Normal Curves, and Median of Variables
Energy
p7 <- ggplot(base, aes(x = energy), binwidth = .1) +
scale_x_continuous(name = "Energy ",
breaks = seq(.2, 1, .2),
limits=c(.2, 1)) +
scale_y_continuous(name = "Count",
breaks = seq(0, 11, 1),
limits=c(0, 11)) +
geom_histogram(fill= "black") +
theme_solarized() +
geom_vline(xintercept = median(base[,"energy"]), size = .5, colour = "gold1",
linetype = "dashed") +
geom_density( fill = "orange")
p7 <- p7 + stat_function(fun=dnorm,
color="red",
args=list(mean=mean(base$energy),
sd=sd(base$energy)))
p7
summary(base[,c(4)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2960 0.5620 0.6780 0.6591 0.7722 0.9090
Energy shows negatively skewed graph with a median energy level of about .68. Meaning most songs are energetic as a death metal bands are classified as 1 while bach would be classified as a .2.
Danceability
p7 <- ggplot(base, aes(x = danceability), binwidth = .1) +
scale_x_continuous(name = "Danceability",
breaks = seq(.2, 1, .2),
limits=c(.2, 1)) +
scale_y_continuous(name = "Count",
breaks = seq(0, 11, 1),
limits=c(0, 11)) +
geom_histogram(fill= "black") +
theme_solarized() +
geom_vline(xintercept = median(base[,"danceability"]), size = .5, colour = "gold1",
linetype = "dashed") +
geom_density( fill = "orange")
p7 <- p7 + stat_function(fun=dnorm,
color="red",
args=list(mean=mean(base$danceability),
sd=sd(base$danceability)))
p7
summary(base[,c(3)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2580 0.6355 0.7330 0.7165 0.7983 0.9640
Danceability shows negatively skewed graph with a median danceability of about .72. Meaning most songs are danceable to.
Speechiness
p7 <- ggplot(base, aes(x = speechiness), binwidth = .1) +
scale_x_continuous(name = "speechiness ",
breaks = seq(0, 1, .1),
limits=c(0, 1)) +
scale_y_continuous(name = "Count",
breaks = seq(0, 11, 1),
limits=c(0, 11)) +
geom_histogram(fill= "black") +
theme_solarized() +
geom_vline(xintercept = median(base[,"speechiness"]), size = .5, colour = "gold1",
linetype = "dashed") +
geom_density( fill = "orange")
p7 <- p7 + stat_function(fun=dnorm,
color="red",
args=list(mean=mean(base$speechiness),
sd=sd(base$speechiness)))
p7
summary(base[,c(9)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000282 0.040225 0.109000 0.195701 0.247750 0.934000
Speechiness shows positively skewed graph with median of about .1. Meaning most songs have sections of speech but time spent talking is far less then instrumental time comparitevly.Values below 0.33 most likely represent music and other non-speech-like tracks.
Acousticness
p7 <- ggplot(base, aes(x = acousticness), binwidth = .1) +
scale_x_continuous(name = "acousticness ",
breaks = seq(0, 1, .1),
limits=c(0, 1)) +
scale_y_continuous(name = "Count",
breaks = seq(0, 11, 1),
limits=c(0, 11)) +
geom_histogram(fill= "black") +
theme_solarized() +
geom_vline(xintercept = median(base[,"acousticness"]), size = .5, colour = "gold1",
linetype = "dashed") +
geom_density( fill = "orange")
p7 <- p7 + stat_function(fun=dnorm,
color="red",
args=list(mean=mean(base$acousticness),
sd=sd(base$acousticness)))
p7
summary(base[,c(10)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000e+00 0.000e+00 0.000e+00 1.584e-03 3.088e-05 1.340e-01
Acousticness shows positively skewed graph with a median of about .11.This makes sense as most hit songs use electric instruments to produce sound instead of acoustic one which would probably be limited to alternative, some pop, and country genres.
Key
p7 <- ggplot(base, aes(x = key), binwidth = .1) +
scale_x_continuous(name = "Key ",
breaks = seq(0, 10, 1),
limits=c(0, 10)) +
scale_y_continuous(name = "Count",
breaks = seq(0, 20, 2),
limits=c(0, 20)) +
geom_histogram(fill= "black") +
theme_solarized() +
geom_vline(xintercept = median(base[,"key"]), size = .5, colour = "gold1",
linetype = "dashed") +
geom_density( fill = "orange")
p7
summary(base[,c(5)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.75 5.00 5.33 8.25 11.00
Key shows a rugged plateau graph with the most frequent key being C.
Loudness
p7 <- ggplot(base, aes(x = loudness), binwidth = .1) +
scale_x_continuous(name = "Loudness ",
breaks = seq(-10, 0, 1),
limits=c(-10, 0)) +
scale_y_continuous(name = "Count",
breaks = seq(0, 11, 1),
limits=c(0, 11)) +
geom_histogram(fill= "black") +
theme_solarized() +
geom_vline(xintercept = median(base[,"loudness"]), size = .5, colour = "gold1",
linetype = "dashed") +
geom_density( fill = "orange")
p7 <- p7 + stat_function(fun=dnorm,
color="red",
args=list(mean=mean(base$loudness),
sd=sd(base$loudness)))
p7
summary(base[,c(6)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
-10.109 -6.651 -5.566 -5.678 -4.364 -2.384
Loudness has kind of a normal distributionwith median being about -5.5.
Liveness
p7 <- ggplot(base, aes(x = liveness), binwidth = .1) +
scale_x_continuous(name = "liveness ",
breaks = seq(0, 1, .1),
limits=c(0, 1)) +
scale_y_continuous(name = "Count",
breaks = seq(0, 11, 1),
limits=c(0, 11)) +
geom_histogram(fill= "black") +
theme_solarized() +
geom_vline(xintercept = median(base[,"liveness"]), size = .5, colour = "gold1",
linetype = "dashed") +
geom_density( fill = "orange")
p7 <- p7 + stat_function(fun=dnorm,
color="red",
args=list(mean=mean(base$liveness),
sd=sd(base$liveness)))
p7
summary(base[,c(11)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.02150 0.09467 0.11850 0.15830 0.17075 0.63600
Liveness shows positively skewed graph with median of about .1.Detects the presence of an audience in the recording.Most songs are not recorded live and generally do not include audience in background so liveness is really mean and median liveness is really low. Measure of how live as in audience wise the song is.
Tempo
summary(base[,c(13)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
64.93 95.73 120.12 119.90 140.02 198.07
Most songs are of high tempo with over 75% of songs being andante, allegro, or presto. All of those tempos are pretty fast, especially allegro and presto.
Group Boxplots Based off Genre
ch1 <- ggplot(base_new, aes(type, danceability, fill = type))+
geom_boxplot(outlier.size = 1.7, outlier.shape = 20, lwd = 0.8, fatten = 1.2)+
scale_fill_brewer(palette = "YlOrRd")+
labs(x = "", y = "Danceability")+
theme_solarized()+
theme(legend.position = "none")
ch2 <- ggplot(base_new, aes(type, energy, fill = type))+
geom_boxplot(outlier.size = 1.7, outlier.shape = 20, lwd = 0.8, fatten = 1.2)+
scale_fill_brewer(palette = "YlOrRd")+
labs(x = "", y = "Energy")+
theme_solarized()+
theme(legend.position = "none")
ch3 <- ggplot(base_new, aes(type, loudness, fill = type))+
geom_boxplot(outlier.size = 1.7, outlier.shape = 20, lwd = 0.8, fatten = 1.2)+
scale_fill_brewer(palette = "YlOrRd")+
labs(x = "", y = "Loudness")+
theme_solarized()+
theme(legend.position = "none")
ch4 <- ggplot(base_new, aes(type, speechiness, fill = type))+
geom_boxplot(outlier.size = 1.7, outlier.shape = 20, lwd = 0.8, fatten = 1.2)+
scale_fill_brewer(palette = "YlOrRd")+
labs(x = "", y = "Speechiness")+
theme_solarized()+
theme(legend.position = "none")
grid.arrange(ch1, ch2, ch3, ch4, ncol=2)
ch5 <- ggplot(base_new, aes(type, acousticness, fill = type))+
geom_boxplot(outlier.size = 1.7, outlier.shape = 20, lwd = 0.8, fatten = 1.2)+
scale_fill_brewer(palette = "YlOrRd")+
labs(x = "", y = "Acousticness")+
theme_solarized()+
theme(legend.position = "none")
ch6 <- ggplot(base_new, aes(type, liveness, fill = type))+
geom_boxplot(outlier.size = 1.7, outlier.shape = 20, lwd = 0.8, fatten = 1.2)+
scale_fill_brewer(palette = "YlOrRd")+
labs(x = "", y = "Liveness")+
theme_solarized()+
theme(legend.position = "none")
ch7 <- ggplot(base_new, aes(type, valence, fill = type))+
geom_boxplot(outlier.size = 1.7, outlier.shape = 20, lwd = 0.8, fatten = 1.2)+
scale_fill_brewer(palette = "YlOrRd")+
labs(x = "", y = "Valence")+
theme_solarized()+
theme(legend.position = "none")
ch8 <- ggplot(base_new, aes(type, tempo, fill = type))+
geom_boxplot(outlier.size = 1.7, outlier.shape = 20, lwd = 0.8, fatten = 1.2)+
scale_fill_brewer(palette = "YlOrRd")+
labs(x = "", y = "Tempo")+
theme_solarized()+
theme(legend.position = "none")
grid.arrange(ch5, ch6, ch7, ch8, ncol=2)
Median for most variables between groups tend to be the same besides large differences between Latin and other groups in areas such as valence and tempo. While most medians tend to be very similar the range of the upper and lower quartile for each genre can vary considerably in each variable. Such a case is evident in tempo, liveness, Energy, and etc..
ggplot(base_new, aes((duration_ms/1000)/60, fill = type))+
geom_density(alpha = 0.85)+
scale_fill_brewer(palette = "YlOrRd")+
labs(x = "Duration (in minutes)", y = "Destinity")+
scale_x_continuous(limits = c(0,8), breaks = seq(0,8,1))+
theme_solarized()+
theme(legend.position = c(0.8, 0.85))+
guides(fill = guide_legend(title = "Type of song"))
Most songs tend to be 3.4 minutes long besides rap which is about 3.6 minutes long. This is probably because people tend to lose focus or thing of songs longer then 3.5 minutes as being a more arduos task then listening to a quick song. I can personally atest because when I see a video longer then 4 minutes I generally consider it a long video and too long to view in a single sitting, similarly this concept applies to songs too.
Discussion
During this project I attempted to discern notable characteristics of a top hit song and determine whether or not I could predict if a song would be hit based off those characteristics. What I learned from this project is that most hit songs are relatively short, fast tempo, have little acoustics, have great energy, easily danceable too, and are of normal conversational loudness. What I also learned, or did not learn, is that being able to use data from this analysis in predicting other hit songs is currently out of my skill level but would require use of machine learning and statistical tools. Another aspect of hit songs I learned was there is a positive correlation between loudness and energy, probably because loudness is a property of energy and also, based off opinionated judgement, louder songs just tend to have more energy.
Limitations of the data included the small sample size(100), lack of more audio variables(more = better and more precise), and the lack of accounting for extraneous variables such as advertisement, artist popularity, and etc which can be influential in the success of a song.
Taking note of the previously mentioned limitations of the study the best data would be of a large sample size, have more audio variable(genre, feeling, etc..), and take into account variables such as release date, marketing, and artist popularity to more accurately predict whether a song will be a top hit. Holistically though that data would be very expensive to completely amalgamate so just having a few more audio characteristics and greater sample size would be good enough to predict whether or not a song will be hit based off auditory characteristics alone. This idea would probably be best achieved through a similar study such as the one done in my third citation regarding “Predicting Hit Songs with Machine Learning”.
In finality, the project was fun and helped introduce me to a few new packages, a few new data repositories, a few new visualization, but most importantly a greater appreciation for people like Peter and the TA’s who deal with this at a higher level in order to do learn something crazy everyday.
- Included previous discussion that was not in the previous draft you saw to show why I moved from previous dataset to this one
Maybe for better or worse this discussion reveals the weakness of my data and foresight. First I will detail the weakness of the dataset. The dataset was too simple and didn’t give enough information about attributes such as roles, genders, and participation which would’ve have allowed me to discern details of the network in a more critical manner. Because of the lack thereof such attributes the only comparisons I could make were based off the graphs of each other, random models, and what my opinion of what each network would look like based off their description ( e.g. family model would be highly connected). These comparisons would only yield “results” had there been a stark difference. Sadly, the results of my visualizations reveal no such difference and simply shows that there is a greater connectivity to the network then expected at random when compared to gnm and gmp erdos renyi models.
Second my inability to distinguish this dataset as a good model was caused by hopefulness, distraction by idea of dataset, and procrastination. Had I been more critical and more pragmatic of the possibilities of a simple undirected network with so few nodes and no attributes I would’ve realized beforehand the sheer hopelessness this data presented. Also it would’ve helped if the reference was not in a book I could not buy.
On a high note I learned some interesting facts about real networks such as online networks tend to have a lot of nodes with a few edges and a few nodes with a lot of edges. Also I have come to a better understanding of what good data looks like. Finally the killer study would still focus on what I had thought in the beginning but would overcome the downfalls I listed about the dataset and include more groups, so results could be extrapolated to wider audience.
