New York Times Dialect Quiz: and You Guys Talk” was published in the New York Times in 2013. You’re likely to recall taking it, or at the very least hearing about it. It was the one that would ask you questions like, “What do you name something that is across both streets from you at a crosswalk?” You had the option of selecting answers such as “kitty-corner” and “catty-corner” (the latter being the obvious right choice). Its precision astounded everyone I knew.
You were categorised as having grown up in a specific area of the United States after answering 25 questions geared at eliciting your linguistic quirks (technically, the quiz shows you the region where people are most likely to speak like you, so it could ostensibly show you where your parents grew up, rather than where you grew up, as Ryan Graff points out).
To my astonishment, the quiz labelled me as being from a place no more than 15 miles from where I actually grew up every time I took it. And my experience wasn’t unusual: despite its December 21 publishing date, the quiz was the most popular thing the Times published that year. It was such a hit that Katz decided to write a book about it three years later.
Why should we worry about Katz’s dialect quiz now, given that it was a nationwide phenomenon in 2013? Because I am a language and information science nerd, I am really concerned about it. However, you should be concerned because it was a successful attempt to bring data science into the homes of millions of Americans, regardless of their technical or intellectual abilities.
New York Times Dialect Quiz brief overview of the quiz’s history
(Much of what follows is based on Katz’s presentation at the NYC Data Science Academy.)
• The questions in Katz’s exam were based on a bigger research study called the Harvard Dialect Survey, which was released in 2003 by Harvard’s Linguistics Department’s Bert Vaux and Scott Golder (you can find a good interview with Vaux on NPR here).
• Vaux and Golder’s 122-question quiz was disseminated online and focused on three areas: pronunciation, vocabulary, and syntax.
• About 50k observations were collected as a result of the original quiz, all of which were coded by zip code.
• As a graduate student intern at North Carolina State University studying statistics, Katz wrote the Times’ version of this exam in 2013. (After seeing his visualisations of Vaux and Golder’s original data, he was invited to complete an internship at the New York Times.)
• R and D3 are used in the Times quiz, the latter of which is a JavaScript library comparable to jQuery for linking data to a page’s DOM for manipulation and analysis.
Now, let’s talk about data science
So, how did the test turn out? It was built on the supervised machine learning technique K-Nearest Neighbors (K-NN), which is used to “predict the class of a new datapoint based on the value of the points around it in parameter space,” as my graduate-school TA explained. In a later post, we’ll delve deeper into the concept of machine learning and the specifics of the K-NN algorithm. Let’s take a look at some of the jargon in my TA’s definition for now.
What does it mean to have “parameter space”?
Parameter space is defined as “the set of all conceivable combinations of values for all the individual parameters contained in a certain mathematical model,” according to Wikipedia. While remarkable, it term isn’t very useful for the average individual. Because I am a visual learner, a doodle might be more instructive:
If you have parameters (i.e. arguments or variables) to plot, the space in which you plot them is called parameter space. For K-NN, parameter space is everything between the two axes, with the star being the point we’re trying to classify. (For the time being, ignore the k-values.)
There are two types of circles in the diagram above: yellow circles and purple circles. The goal of using K-NN on this dataset is to predict whether our new input, the star, will fall into the yellow-circle or purple-circle category based on its proximity to the circles around it.
the parameter space. Check
Before we go into the concepts and arithmetic behind K-NN, there’s one more thing to take care of. This concept, algorithmic laziness, was not included in my TA’s definition above, but understanding it will help us comprehend what exactly happens when we do a K-NN study.
The K-NN algorithm is a “lazy” algorithm
But how is it possible for an algorithm to be sluggish? Are algorithms capable of becoming tired? Is it possible for them to have terrible days? Unfortunately, no. According to Adi Bronshtein, “laziness” denotes that an algorithm does not employ training data points for any generalisation.
We still don’t comprehend what Bronshtein means because we haven’t bridged the concept of training an algorithm. All supervised machine learning algorithms, in essence, require some data on which to make predictions.
In the instance of K-NN, it requires information such as the yellow and purple circles in our chart to determine how to identify the star. Lazy algorithms, unlike eager algorithms (such as decision trees), store all of the training data they’ll need to categorise something and don’t utilise it until they’re given something to classify.
“Instance-based learning” is another word for slow algorithms that may transmit more of their purpose. As the name implies, these algorithms (usually) take one instance of data and compare it to all the other instances in memory.
With a grocery-store scenario, Cathy O’Neil, a.k.a. “mathbabe,” provides a wonderful illustration of instance-based learning:
Of course, what you actually want is a means to predict a new user’s category before they buy anything, based on what you know about them when they arrive, notably their qualities. So, given a set of attributes for a user, what is your best guess for that user’s category?
Let’s use k-Nearest Neighbors as an example. Let’s say k is 5, and Monica is a new customer. The programme then looks for the 5 customers who are the most similar to Monica in terms of qualities, and sees which categories they belong to. If four were “medium spenders” and one was a “little spender,” Monica’s best prediction is “mid spender.”
That was ridiculously easy
Of course, things are never that simple, but we’ll save the discussion of K-intricacy NN’s for another time. For the time being, K-NN is a lazy algorithm, which means it keeps the data it needs to make a classification until it is requested to do so.
That’s all there is to it! Now that we’ve established the foundation, we can talk about training, how K-NN works in practise, and, most significantly, how Katz used it for his dialect quiz. All of this will be revealed in Part 2!
Meanwhile, if you haven’t already, I encourage you to take the accent quiz (and take it again even if you have). Your responses will come in handy later!
Also Read: