This paper will discuss Standard American English, or General American English, in comparison with Minnesota English. It will begin with an introduction that includes our research question, our initial hypothesis, and a brief literature review, and then the introduction will be followed by a methods section describing how we conducted an interview to collect data and analyzed it in a collection of computer programs, a results section, a discussion section, and our conclusion. In class this term, we periodically mentioned Standard American English, and how it does not truly exist, although we do have perceptions of what Standard American English “should” sound like. Most notably, Standard American English is thought to be more “proper” than other dialects of English, which is why Midwestern English is often referenced in conjunction with Standard American English. However, we were wondering whether Midwestern English, and particularly its vowel production, truly aligns with Standard American English. This idea, combined with the knowledge that our interviewee is a Minnesotan, led to our research question—is Midwestern English, and namely Minnesota English, actually representative of the mythical Standard American English? More specifically, we are asking if there are statistically significant differences between the F1 and F2 values of vowels produced by Minnesota English speakers and the corresponding F1 and F2 values of vowels recorded as Standard American English vowels. Given our experiences hearing that the two dialects are linked, or at least described as highly similar, our hypothesis is that Minnesota English is representative of Standard American English, meaning Minnesota English speakers will produce vowels with F1 and F2 values that are not statistically significant when compared to those of Peterson and Barney’s “General American” Vowels.
One piece of literature that particularly highlights the premise of our research question is Thomas Bonfiglio’s Race and the Rise of Standard American. As he mentions, “in the first half of the twentieth century, Americans began to view the accent of the midwest and west as a ‘general American accent’” meaning they perceived it as a “standard for pronunciation” and a supposedly standard version of English (Bonfiglio, 2002, p.1). Network broadcasters would also speak according to (mid)western norms, furthering notions of Midwestern English being ideal compared to other dialects of English. This backstory explains why we have heard Midwestern English described as a more “standard” dialect, the motivating connection behind our research question. It similarly motivates our hypothesis, which is based on how Midwestern English has been “perceived as the standard” (Bonfiglio, 2002, p.1). The rest of Bonfiglio’s work proceeds to discuss why that phenomenon occurred, which informs the broader sociophonetic context of our question. One of Bonfiglio’s (2002) ideas is that because “early radio announcers were from the midwest,” Americans imitated the way they spoke, causing it, and therefore Midwestern English, to become standardized (p. 2). However, even that idea begs the question of why early radio announcers were from the midwest, and Bonfiglio suggests that may have been because there were less immigrants in the Midwest, and Americans wanted to adopt “western speech patterns as the preferred norm” due to “xenophobic and antisemitic movements” (Bonfiglio, 2002, p. 4). Another piece of literature that informed our research question and methodology is Ettien Koffi’s “The Acoustic Vowel Space of Central Minnesota English: Focus on Female Vowels,” which discusses three ways Central Minnesota English differs from Standard American English, partially contradicting Bonfiglio’s observations about Midwestern English. Koffi (2013) mentions that in the Central Minnesotan dialect, [æ] is pronounced [ɛ] before [g], [ɑ] and [ɔ] are completely merged, and /ʊ/ involves a wider mouth and less rounding than is typical (p. 2). These differences do indicate that Central Minnesota English is not consistent with Standard American English, which theoretically contradicts our hypothesis. However, we are not treating it as an explicit contradiction of our hypothesis because we would classify Central Minnesota English as Upper Midwestern English, rather than just Midwestern English. From there, the rest of Koffi’s article discusses how this study was conducted, which includes finding the Euclidean distance between corresponding vowels in Central Minnesotan and Standard American English, which inspired us to also include Euclidean distance calculations in our analysis. In contrasting our vowel formant values with those of Standard American English, we and Koffi both referenced Peterson and Barney’s (1952) General American English vowels, which they established in an experiment involving 76 speakers, including men, women, and children (p. 176). Their experiment involved two parts–a listener portion and a speaker portion, and the speaker portion was the one most relevant to us. Peterson and Barney (1952) measured formant values for vowels produced by their speakers in h_d contexts, which informed the tokens we pulled from our interview, as we will explain in the upcoming methodology section (p. 175).
In order to collect our data, two key software tools were used. Elan, a free and open source transcription software created by the Max Planck Institute for Psycholinguistics and Praat, an open source speech analysis software created by Paul Boersma and David Weenink of the University of Amsterdam. The process began without the use of software. We drafted an interview schedule as an aid to keep the time we had with our speaker structured and efficient. We had some fundamental decisions to make here that would impact the ease of data collection. The specific questions we asked, as well as the reading passage we chose, Arthur the Rat, likely changed the cadence of speech that we collected from our speaker. Although it would have been difficult to predict, we found that the deeper morally or intellectually a question was, the more natural our speaker developed their answer, as less input from the interviewers was required to produce an in-depth response. Our speaker was eager to cooperate with our requests for detail in responses, which greatly improved the quality of our recording and the processing of our data. The interview itself proceeded with little disruption. Our speaker spoke clearly and intentionally and throughout the course of the interview became markedly more comfortable and natural. This prompted us to exempt the initial section, Demographics, from analysis. The other question topics: Interests, Life, and Food resulted in a more fruitful discussion by our speaker. Technologically, the interview had slightly more obstacles. We were able to successfully obtain a Logitech AK5370 from Carleton’s Presentation, Events, and Production Support office, which provided high-fidelity unidirectional recording for our interview, which we recorded in mono sound with a sampling rate of 44000 hz at 16 bits. On the software side, we used Praat’s built-in recording function in order to record the entire interview. This would prove both a positive and negative, as Praat’s interface is well-designed, with adequate control over the device and its quality. However, Praat is only able to record in 10 minute segments, which although useful to break up the process of transcription, meant that we had to stitch our recording back into its full duration after our work was complete in Elan.
For transcription, the work itself was straightforward, but tedious. After agreeing upon stylistic elements of transcription such as processing of “uhms” and “uhs,” we were ready to import our audio into Elan and begin. Elan is a feature-rich, purpose-built software that accomplishes the task, but every other facet of its design is poor. Elan is not optimized for actual modern hardware and even scrubbing the recording timeline can lead to extreme slowdowns and frequent crashes. Speaking to others in the class, Elan’s difficulty to work with likely resulted in lower quality transcriptions, as many groups reported having to repeat entire sections of transcription after an Elan crash. Fortunately, Elan has a backup system, which, although inconsistent at its best, certainly saves hours of work from loss. However, Elan cannot be blamed for all of the issues we encountered with its use. The division of our recording into 10-minute chunks forced us to use Elan’s merge feature after our transcription was complete. The merge feature allows transcriptions to be saved into tab-delimited files, which can then be merged into larger transcription files. Merging was unintuitive, but precise, and after its steep learning curve, we were able to combine our transcriptions into the full length of our interview. However, the audio was still separated, but after a brief foray into Audacity, our completed Elan file could be exported as a Praat textGrid and audio.
With completed textGrids, we proceeded to use Praat to obtain separate groups of tokens for each part of the interview based on the vowel spacing protocol provided. Praat allowed us to layer text over our audio, allowing us to specify the word that was used to fulfill a token, the exact vowel in the word, and mark a nucleus, usually at the center of mass of the vowel but often simply where consistency could be found. Additionally, for diphthongs, it was necessary to mark a secondary glide point in order to show how the phoneme progressed from one vowel to the next. This was a smooth process for the main interview audio, as all of the tokens could be found somewhere in our speaker’s responses. However, for the passage, certain tokens were less well-represented. This contributed to our eventual choice to use the tokens found in the main interview to produce our results, rather than any from the word list or passage. Once these tokens were identified, the provided Praat script allowed us to easily sample our vowels and obtain the F1, F2, and F3 values of both the vowels and for diphthongs, glides as well. This process produces comma-separated value files, which can then be imported into Excel or other spreadsheet programs and exported into tab-delimited files for NORM, the vowel normalization suite created by the University of Oregon.
Although we did normalize all of our data, the abundance of well-defined tokens present in our main interview—the question and answer portion—prompted us to use that data for future processing. Our decision to include a female Minnesotan speaker did not impact this methodological choice, as Lydia and Margo had excellent tokens from their interview. We chose to include a female speaker so that we could compare Midwestern English to Standard American English for males and females in case there were differences. After normalization, we were forced to remove certain tokens which were absent from Peterson and Barney’s 1952 study. We chose to use “b–t” tokens for all of our vowels, save “put”, as the [ʊ] phoneme is not present in English with a “b–t” structure. Any nasalized tokens or tokens ending in laterals were removed to reduce the influence of coarticulatory effects. Peterson and Barney did not include any diphthongs, and thus those tokens were also removed from analysis. They did however, include a [ɚ], the rhotacized schwa sound, which we did not collect as a token. An important methodological decision was the normalization method to use. NORM includes nine methods, including the Labanov method we used to plot vowels in class and the Labov ANAE with Telsur G that we used to compare caught/cot vowels. Many different factors contributed to our eventual decision to use Labov ANAE with Telsur G normalization, but a fundamental issue that we found was the difference in output units. Labanov outputs unique values which, although they can be scaled to hertz, are not natively equivalent to frequency values. Labov ANAE however outputs native frequency values, and thus we believed that it would be more statistically sound to use those. Additionally, we consulted Professor Rood, who explained that “Lobanov's method has a tough time with broader dialectal differences (but it's great for the same dialect).” As Peterson and Barney’s speakers came from all over the United States, Labanov normalizations might be less accurate, especially given our sample size, when Labanov’s “original algorithm required 300+ speakers to get accurate results.”
In order to form valid conclusions about our data, we turned towards the widely accepted statistical programming language, R. R’s included phonTools package provides tools designed expressly for phonetic analysis, and even includes the Peterson and Barney study data as a dataset called pb52. Another indispensable package was the Tidyverse, which was used for data wrangling and graphing. After discussion with other groups, we decided on the following statistical steps to obtain results: Separate Peterson and Barney’s data into men and women, exempting children and the rhotacized schwa. Normalize both datasets and take standard deviations of the F1, F2, and Euclidean (F12+F22) values for each vowel. Use T-Tests to compare each vowel value between our male speaker and Peterson and Barney’s male speakers and our female speaker and Peterson and Barney’s female speakers. Create vowel plots to visualize the differences between vowels in Euclidean space. Steps 1 and 2 were straightforward with use of the Tidyverse’s data wrangling package, dplyr. Step 3 forced us to manually run a T-Test rather than use the included function due to the abnormalities and complexity of our data. This meant we manually subtracted the values for sample mean (the individual speakers) from the population mean (Peterson and Barney’s General American English values), and divided by the standard deviations. Using the population density function, dnorm, we could use these statistics to find the probability of the individual speaker having their T-Score if they were in the Peterson and Barney population.
This plot (in hz) shows the male speaker in aquamarine and the population in blue The male speaker’s vowels (statistics in Figure 1) are extremely close together, with a shape almost completely enclosed by the population vowels. His [u] is very distinct from the population [u], with a large distance in the F2-F1 direction. Its high value indicates significant u-fronting, which is confirmed by a P-value close to 0, indicating that it is unrepresentative of the population. Other vowels with statistically significant differences in formant values when compared to the population include [ɛ], [ʌ] [æ] and [ɪ]. [i] appears in the plot to have a high significance, but likely exhibits a high standard deviation, causing its P-value to be higher. The other vowels are mostly unremarkable, though they do contribute to an overall shape that is drastically different from Peterson and Barney’s population. In total, six of the nine means for vowels produced by the male speaker are statistically significant in terms of how different they are from the population’s formant values.
This plot (in hz) shows the male speaker in fuchsia and the population in red The shape of the female speaker and female population are much more similar upon first examination, but statistically there is much more complexity (vowels in Figure 2). Although generally the female speaker is much more similar to the population than the male speaker, she too has areas that are statistically significant. The highlighted vowel for the female speaker is certainly [ɛ] as it is the highest T-score for F1, F2, and Euclidean statistics. It has an extremely high backness and lowness, and although it appears moderately significant in the plot, it is proven by data to be extremely statistically significant, almost five standard deviations from the population mean. However, overall her vowels are much less statistically significant than the male speaker, and she has one less vowel that is statistically significant. Overall, the two speakers only differ in their significance from their respective populations in three vowels: the [ɔ], in which the female speaker was significant and the male was not, the [ʌ], in which the male speaker was significant and the female speaker was not, and the [u], in which the extreme fronting of the male speaker likely contributed to his significance, whereas she respectively lacks significance.
As we saw in the results section, the formant values for six out of nine vowels produced by the male speaker and five of the nine vowels produced by the female speaker are statistically significant when compared to the population, so Minnesota English is likely unrepresentative of Standard American English. We ultimately believe that our hypothesis was contradicted because, as Bonfiglio (2002) states, “American English has no real standard pronunciation” because “there are many speech areas and differing pronunciations within any given speech area,” meaning real speakers, rather than Peterson and Barney’s creation of an ideal speaker, speak a dialect of English influenced by a variety of cultural factors surrounding their strong and weak ties (p. 2). For example, our male speaker is very [u] fronted, likely due to a combination of factors. First, his mother is a Californian, and the California Vowel Shift does involve [u] fronting. The second factor could be his age since based on the in-class presentations, we have seen that young speakers may have a tendency to front their [u]s. However, that was not the case with the female speaker. On the whole, the male speaker appears not to use the entire vowel space in his speech given how fronted his [u] and [ʌ] are and how far back his [ɪ] is, but we would need a greater understanding of his linguistic background to determine what might have influenced those differences. Surprisingly, despite appearing merged, his [ɔ] and [ɑ] are not statistically significant, but because they are merged, that too indicates a difference from Standard American English. The bot-bought merger is actually quite common based on our understandings from class, so that merger and other vowel shifts may be part of why “in the second half of the twentieth century, American linguists began to reject the rubrics of midwestern and general American” (Bonfiglio, 2002, p. 1). Our female speaker is similarly merged, and her [ɔ] did have statistically significant formant values, likely because Peterson and Barney’s theoretical female speaker has a lower F1 value for [ɔ] than their male speaker, so even if our female speaker and our male speaker had similar F1 values, our female speaker’s would be the one registering as statistically significant. However, the vowel with the most statistically significant differences for the female speaker is [ɛ]. Again, we would need more information on our speaker’s linguistic background to determine why that might be, but in Koffi’s plot of a female speakers of Central American English, [ɛ] was similarly backed, so perhaps our female speaker’s dialect of English is closer to Upper Midwestern English than Midwestern English. However, Koffi (2013) also asserts that despite the fact that “phoneticians and speech scientists continue to rely on Peterson and Barney (1952) for all kinds of comparisons,” “SAE is an idealization” that “nobody speaks,” so the vast majority of speakers in the Midwest, or anywhere, are likely unrepresentative of a Standard American English speaker (p. 10). In an ideal study, we would have many more Midwestern speakers. Our results were certainly diluted in significance by individual differences, such as the u-fronting of our male speaker, that come from the uniqueness of our speakers’ voices. Further complication could have come from the use of such an old study for our comparative population. Peterson and Barney’s study was conducted in 1952, more than 70 years prior to our study. Speech has evolved a great deal since then, and our findings could have resulted from time as a confounding variable.
This paper discussed Minnesota English in relation to Standard American English through an analysis of two speakers’ vowel production in comparison with General American English vowel formant values. It was determined that Minnesota English, and therefore Midwestern English, is not representative of the mythical Standard American English, most likely because Standard American English does not exist, as Bonfiglio and Koffi assert. In fact, perpetuating the idea that Standard American English exists is harmful because it perpetuates the notion of some dialects of English being more proper or favorable than others, which could potentially lead to discrimination against speakers of certain dialects of English. Therefore, studies that reinforce the fact that most speakers produce vowels with formant values that differ from those of the imaginary Standard American English are critical because they challenge the notion that some dialects of English are more acceptable than others.
References
Bonfiglio, T. P. (2010). Race and the rise of standard American (Vol. 7). Walter de Gruyter.
https://scholarship.richmond.edu/cgi/viewcontent.cgi?article=1107&context=bookshelf
Koffi, E. (2013). The acoustic vowel space of central Minnesota English: Focus on female
vowels. Linguistic Portfolios, 2(1), 2. https://repository.stcloudstate.edu/cgi/viewcontent.cgi?article=1020&context=stcloud_ling
Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. The
Journal of the acoustical society of America, 24(2), 175-184.
https://pure.mpg.de/rest/items/item_2375480_4/component/file_2375479/content