Data Critique

The data set we have chosen to focus on is titled "Evolution of Popular Music: USA 1960-2010." This data set was created by Matthias Much, Robert M. MacCallum, Mark Levy, and Armand M. Leroi. The original source for this data set is the Billboard Hot 100 lists between 1960-2010, limiting the data set to only focus on songs popular in the United States, beginning in 1960, the year Billboard was created.
The existence of this data is thanks to the emergence of digitization technologies which allowed music to be processed and studied in large scale. The authors aimed to investigate the evolution of popular taste, so they did not attempt to obtain a representative sample of all the songs in that time period—solely songs that were the most commercially successful.

The source of the song data is not the score of the music; it is the audio recordings, differentiating this data set from other pop-music research studies, as this research group attempted to identify musically meaningful features rather than aspects like loudness, vocabulary statistics, and sequential complexity. As a result, the extracted data is much more complex and detailed than other studies of music.

Songs were measured for a series of quantitative audio features such as tonal content and timbre. This was then discretized into words resulting in a harmonic lexicon of chord changes and a timbral lexicon of timbre clusters. Essentially, each song is represented as a distribution over eight harmonic topics and eight timbral topics.

The dataset mostly focuses on the results from statistical techniques like k-means clustering and principal component analysis, based off harmonic and timbre scores. While it helps quantify aspects of the music, it removes the emotional aspect of the music which would warrant the harmonic and timber choices in these top songs in the first place. The quantitative aspect of the data also allows the researcher to easily compare the data, giving insight to the name of the data set: Evolution of Popular Music: USA 1960–2010.

The k-means clusters were obtained through an unsupervised method and the grouping of eras might mean less than the historical context i.e. Vietnam War, Cold War, which inspired much more music. However, it is worthy of more exploration to see associations between the PC scores and clusters.

The data set includes much simpler variables than harmonic topic weights, timbral topic weights, number of chord changes and timbre class counts. Two defining variables include the identifying features of the song: track name and artist name. The data is also qualified by time; it identifies the date of entry to the Billboard 100, the release date by quarters, and era.

Although the data set includes a wide variety of variables, there are gaps of information that may have been useful to us. In terms of popularity, there is no information on how many awards the songs and artists won, or how many nominations they received. We also have no information of the rankings the songs had on other music charts. As far as information on bands go, we do not have details on who might be the lead singer and then who would be backing instrumentals for that respective song. A song’s duration may have been a factor in how many chord changes there were or how many different harmonies were used, which are major points in this dataset. The record label is also not included, but it would have been interesting to see which labels have the most charting songs and how much influence certain labels have on the music industry as a whole. There is also no record of the lyrics for each song which could have been used for a text analyzation to see what themes and ideas were common and recurring throughout the time period.

There is also little textual indication of the themes of the songs; while an important historical event could have caused the popularity of the song, the dataset leaves this out. Furthermore, lyrics could have been important in the harmonic and timber scores. Cultural preferences or the zeitgeist of an era can easily be tracked by certain harmonic and timbre scores.

Overall, this data set can potentially reveal the evolution of certain genres of music. Using the measure of chord changes, for example, one could analyze the level of sophistication or simplicity of certain of music overtime. In addition, the measure of release dates by quarter could elucidate whether certain songs posses different qualities depending on the season. After looking through the data set and the research, we decided to analyze the ways in which emergence of the internet has genre diversity. Also in the realm in the internet it would be interesting to analyze the change over music distribution and analyze the emergence of records, cassettes, CDS, and eventually online streaming/torrenting.

​Data Critique

Data Critique