The ghost of teenagers past or present? The voice of the customer holds the answer
Data from online natural language interaction is a fantastic opportunity to learn more about your customers, but it is not without its pitfalls. If a customer walks into your highstreet store, a glance provides an immediate impression of their age. In an online world, it is not so simple to tell the teens from their grandparents. If prompted to provide a date of birth when signing up for an account, the users may choose to protect their privacy by giving inaccurate information. In this blog post, we will see how data science combined with real customer inputs provide powerful insights about how different age groups interact with Artificial Solutions’ personal assistant, Indigo.
Indigo users must sign up for an account, where they also provide a date of birth. This is useful information, since we can act on it to improve and customize Indigo based on how different age groups interact with the assistant. But can we trust the age that users provide?
There are at least three reasons why we might be suspicious of the date of birth provided by the users online, unless there is some additional verification. First, some users might provide an alternative date of birth out of privacy concerns. Second, others will undoubtedly want to inflate their age somewhat, if the service has a minimum age for joining. Finally, if the user interface for entering the date of birth requires a lot of clicking and scrolling to set the correct day, month, and year, then why not save some time and pick the first available date? For business analytics, we need to maintain a healthy scepticism towards user-supplied age metadata. At the same time, the age of the user promises important insights. To act on those insights, we must first decide how trustworthy they are.
The key lies in data science. By combining data retrieval, data processing, and statistical techniques, we can draw an accurate picture of how the users, sorted into age groups, differ. The basis is what the users actually say to Indigo. This way, we can assess the validity of the age groups themselves, by comparing them to each other, looking for conspicous outliers. In theory, this sounds simple enough, but as any data scientist can attest to, obtaining the right data in the right format is often a practical obstacle.
Retrieving the Indigo data was straightforward in this case, thanks to Teneo Analytics API, a retrieval and processing API for conversational log data. However, Teneo Analytics API does far more than retrieve the raw logs. It provides the number of sessions that users have had with Indigo, tabulated by age groups, alongside keywords from the user inputs. These two-word keywords, or bigrams, have been identified in the user inputs by Teneo Analytics API. Teneo Analytics API easily interfaces with other software systems, and for this analysis data from September and October 2016 were piped into the statistics software R, one of the most popular data science platforms.
By applying an unsupervised clustering algorithm known as correspondence analysis to the most frequent bigrams, we can see the phrases that are most typical for the different age groups. The figure below visualizes the interaction of phrases and age groups in a revealing manner. The figure is a coordinate system, where age groups (blue dots) are clustered with their most typical bigram phrases (red triangles). The clustering algorithm has arranged the figure so that the strongest explanatory effects are found on horizontal left-right axis.
On the left hand side of the figure, we can see the age groups from 30-39 up to 80-89, and their most characteristic phrases are all related to Indigo’s productivity features: weather, news, and different ways of setting an alarm (“wake up”, “set alarm”, “alarm” followed by a number). Phrases such as “how many” and “tell joke” in the center of the figure are common with any age group, so they do not tell us much, except perhaps that users of all ages enjoy Indigo’s jokes. But on the right hand side of the figure, phrases such as “play music” and “change your name” (directed at Indigo) express the polar opposite of the productivity features. On the left hand side, we tellingly find the youngest users, age bands 0-17 and 18-29, and more surprisingly, the 90-plus age band.
What this tells us is that user age bands are for the most part reliable. If the date of birth supplied by the users had all been fake, we would not expect such a clear, visible divide between younger users on the one side and the older ones on the other side. The bigram keyphrases show us that this is not accidental: the separation into older and younger users correlates with the distinction between productivity and entertainment.
The odd outlier in this picture is the 90-plus age group. Although it is possible that these nonagenarians are enthusiastically asking Indigo to play music, the fact that they behave so differently from users in their 70s and 80s gives rise to suspicion. A much more plausible explanation is that the majority of users in the 90-plus group are in fact teenagers or twenty-somethings who have provided a fictional date of birth.
The combination of data science and the voice of the customer allows us to judge not just how accurate the user’s stated age is, but also to see how the age groups differ from each other. The problem of inaccurate dates of birth are largely associated with particular age group, taking the much uncertainty out of the important user age data. The moral here is that the voice of the customer is not the full story. The voice of the customer leaves an echo that reaches back to other data such as user age. Mature, data-driven organizations ought to heed that echo.