EVENT DETAILS
ADM+S Tech Talk: RADio
15 November 2021
Speakers:
Sanne Vrijenhoek, University of Amsterdam’s Institute of Information Law
Gabriel Bénédict, University of Amsterdam and RTL Netherlands
Watch the recording
Duration: 0:19:06
TRANSCRIPT
Kacper Sokol
Let us start then with an acknowledgment of country. So, in this period of reconciliation we acknowledge the traditional custodians of country throughout Australia and their connections to land and Community. We pay our respect to their Elders, past and present, and extended respect to all Aboriginal and Torres Strait Islanders peoples today.
Let me then introduce our two lovely speakers. So, we have with us Sanne and Gabriel. They are both PhD students at the University of Amsterdam. Sanne is affiliated with the Institute of information law and has a background in artificial intelligence. Gabriel on the other hand is also affiliated with RTL Netherlands. So, Sanne, what are you doing – maybe you could actually briefly tell us what your research focus is rather than trying to rephrase your bio on the spot.
Sanne Vrijenhoek
Yeah actually my research focus is what we’re going to talk about today. So, we really tried to look at normative diversity in news evaluations. news recommendations, and mostly my job is to talk to different types of people, social scientists, journalism studies, communication scientists, and talk about okay, what do you think it needs to be, and try to translate that into a computable thing.
Kacper Sokol
Great, thank you so much. And Gabriel.
Gabriel Bénédict
For me, it’s a mix of more applied in theoretic AI. One of them is, there’s the music condition, and another topic is about using metrics as losses for neural networks. And another is video to music AI.
Kacper Sokol
Amazing, thank you so much. Feel free to take it away then.
Sanne Vrijenhoek
Alright. So, then I think I will start.
Thank you very much for having us today. I said before what my name is, and we will present our interdisciplinary work RADio, for you today, which is a collaboration between University of Amsterdam faculty of law effective computer science, and RTL which is a Dutch Media company. Next slide please.
And first off I would like to talk a little bit about news in general. News recommender systems are right now studied extensively for their quite unique characteristics, such as data sparsity, a short shelf life of the articles that need to be recommended. However, what news we show a user may in a way, shape their world view. And as such, also have an effect on things like deciding to vote for, or choosing a good cause to support, Etc. And gatekeeping what news is shown to users has traditionally been done by newsrooms and its news editors, and they would decide which news items have priority over others. And news recommender systems would impart be taking over an important part of this role.
But without the means to evaluate and control them in the same way that we would news editors. And because of this, many news organizations are fairly hesitant of adopting this relatively new technology, which is a shame because personalized news recommendations could also do, but maybe do a lot of good. Next slide please.
So, what we need is a way to evaluate our news recommendations on the normative diversity. And we define normative diversity as enabling citizens to fulfill their role in a democratic society. And a normatively diverse news recommender could for example, not limited increased engagement of citizens with the news, and not just with entertainment or sports articles that people like to read, but also for example by presenting them with topics that are in some way specifically relevant to them . Maybe with opinions they have not encountered before with local news, or maybe even by catering to people’s specific stylistic and complexity needs. Next slide please.
And in practice, this whole procedure could work like this. And we think that news organizations themselves need to agree on what the goal of a recommendation should be. And this does not necessarily have to be one singular thing. You can have multiple recommender systems running in parallel, but after this has been decided editorial teams and data Engineers need to work together in determining how this goal would be expressed in concrete metrics. And our goal with this contribution is that we start building a bridge between data and editorial teams by allowing maybe on the one hand, data teams to evaluate the recommendations based on editorial principles, but also on the other hand by providing editorial teams with more understanding of what happens within a data team. Next slide.
So, the first two questions – oh no, wait sorry. This is kind of for our purposes.
The current standard of diversity falls short and we will refer to the current standard of diversity as descriptive diversity, which is commonly implemented as an inter-less distance. But this notion raises a lot of questions because what does it mean for two articles to be different or similar? Is more distance always better? And maybe it’s just a bit of a philosophical question, because diversity is not a concept that stands on its own. It’s always a question of your diverse in relation to something else and this means that in order to have some kind of notion of diversity you need to look beyond what is just in the recommendation itself. Next slide.
Yeah, never mind. Let’s leave it here. And these first two questions have been the topic of some earlier work where we proposed a set of diversity metrics grounded in democratic theory, and it will sometimes be necessary to refer back to the concepts developed there. But in regards to time, we cannot really go into that more. But I think on the previous slide there was a QR code for if anybody is interested. So, instead what we propose is a base formulation of diversity based on F Divergence which is here. Express SD diverges between P and Q, and with this we look at a recommendation as a distribution of values as opposed to just comparing means, for example. And this means that there are a couple of decisions that we need to make because first, we need to choose the relevant features – what you put into your distribution – and this could be for example, maybe political actors mentioned in an article. And we need to identify those in our recommendations that this would then be the peak distribution. And next we need to choose the relevant context distribution, or what do we compare our recommendation to. And this could, for example, be all the items that were available at a point in time or use reading history. Or what was recommended to other users. And last, we need to determine whether we need to do the divergence between these two distributions to be high, low, or something in the middle. And these choices could be inspired by the metrics and the theory proposed in the earlier mentioned work, but they’re also flexible for different interpretations or goals. And in summary, we do not propose a universal norm for diversity, but rather a procedure of expressing a type of diversity that is expressive of your specific use, case and goals.
And now Gabriel will talk a little bit more about the specifics of the divergence calculation and our example implementation on that Microsoft news data set.
Gabriel Bénédict
Thanks, Sanne. And so, remember, we have a general generic diversions to measure with Q, the recommendations that NP, the reference. Now we set up some mathematical requirements. We said okay, why not make this a traditional distance measure with these nice properties of identity, symmetry, and trigon inequality. Second, we also said it would be nice if he was bounded by our metric of divergence, was abounded by zero and one, so we can compare different recommendation algorithms. And then finally, let’s make this the rank aware because we are in a setting where again, people scroll on the website. They end up on a news website, they scroll on the website and there’s just a propensity to scroll and at some point, people get tired of scrolling or maybe someone clicks on the first item that they see and then they never see the rest. So rank is very important in that context.
So, we looked a little bit at the literature of divergence and we found our friends who use a KL a lot, even at training time, and they have some nice illustrations because they really illustrate how KL divergence is not symmetric. So, we stole that image and that QR code is here to the original blog post. What’s important is in our case, we didn’t want to make a statement about what should be – again, remember on the top right what should be P and what should be Q. What should be the reference set and what should be the recommendation set, and then which one should be compared to what. And because we didn’t want to make such a statement, we wanted to have a symmetric temperature. So, that’s why we went instead for Jensen divergence. And then the second point is the rank awareness that I mentioned before. So, there you can make something rank aware by simply decaying the importance of elements. As you go down you can decay in a traditional learning to rank setting in an MMR rate or NDCG rate. You can see on the image what it means. MMR decreases a lot more at the beginning. Importantly, this dot at the bottom right, if you want to decay and you have log data, you know what the users do, maybe you can do this in a click model based. So, maybe you actually check what is happening on your website. What is the propensity to click at different positions and then give that actual propensity as a discount rather than a generic discount.
So, to sum it up we have a rank aware F divergence. What do we mean by F Divergence? That the concept has been defined in 2006, and it’s just something that’s quite suitable because it’s a mathematical concept that allows you to summarize different kinds of divergences including KL leverages that I mentioned before from our VAE friend.
This is what we call our RADio framework, is that idea – or where it’s part of our RADio framework. It’s really the total rank aware of divergence. And then what F allows you to do is to basically replace that F with certain variables that gives you the intention and distance, which is a lot harder to write and to express. A few of you don’t have that F here with two limitations. First, we cannot in that discrete formulation, we cannot bin value. Sorry, we have to be in values. And then the second mutation is that we cannot induce a relation between categories. And the star stands for rank awareness. We looked a little bit – this is just one of the plots. We looked at a lot of plots of sensitivity. So, what happens now if we try – yeah the Shannon or chaos Avengers – what happens if we are rank aware or not rank aware. And this is an example where by using rank awareness we get more realistic results, especially for the random case.
It’s also the case, if we look at different recommender strategies. So, we benchmarked a lot of neural recommenders and then most popular recommender and random recommender and Elster is – well it was the best performing newer recommenders – so that’s why it ends up on this plot. And this is just to show what happens when you measure metrics at the different ranking sets.
So again, this is our final table of results. We just look at two columns in our final table results. To simplify the argument, we have on the top we have metrics, we have algorithms. It’s nice that I can appointment the mouse. We have here, we have four neural algorithms here. We have just the most popular pieces of news that appear recommended, and here we just have random recommendations. And then we look at what happens on our metrics that we measure with the radio. So, for calibration on topic, which means just how divergence, how the topic. We can see that the non-neural methods are actually performing better, and the reason for that is that the neural methods, they learn from the history of the user and it’s very likely that the new world commander will recommend something that’s similar to the history of the user and the history of the user is part of certain topics. So you would expect certain topics to appear again.
So, the idea is then to look at below and look at the norms. So, if you think that you are of the liberal mindset then recognition should be tailored to the user’s preferences. So, then you would expect low diversions in topic calibration and thus, you would favour neural network recommenders. But then if you’re – and then if we take another example and you say you are part of the critical model, then we go here on the right and we look at okay, this is where the diversions is the highest. So this is what we expect for alternative voices and in that case, high divergence in the presence of alternative voices means high alternative voices. And then it means that maybe we should favour random recommender systems to make sure that in terms of two voices appear effectively in the people’s recommendations. And so again, here on the left, we have norms and here on the top, we have the metrics depending on the norms.
The take-home message is not to say that in your recommenders are bad, it’s just to say that new recommenders can be suitable for certain aspects of diversity. But they need more optimization for others, and I’ll come back to that just now.
So, first, what are the next tests? First, there are many relevant features that are hard to extract in our preparation pipeline, NLP pipeline. First, how do you approximate the degree of activation, or what is the political standpoint or how we assess the political standpoint. How do we assess minorities? That’s a very hard topic. For example, how do you decide if you see someone that you can’t find on Wikipedia? How do you decide what is that person’s political side point? it’s also attempting to assign into a minority because then maybe you would think if that person is not, maybe it’s a minority. So, it’s all these kinds of troubles. And then second, we do not define a universal vision of normative diversity. What does it mean even to be normatively diverse? And that we don’t have a reference point outside of our data set when we measure diversity metrics. And then now I come back to what I said earlier.
So, metric. Maybe we could use these metrics as a loss for recommender optimization rather than just the postdoc evaluation. But it was important here to lay a first step towards diversity, first at evaluation time, and then finally we’re really keen to collaborate with news and organizations and we’re going to mention that I think later but there is really I think, concrete things coming up. And finally, my last slide. Thank you for listening. Happy to hear thoughts, ideas, comments. Here are our Twitter handles, although Twitter is over, I heard. I should put my Mastodon handle. Have a nice year and I hope to hear interesting stuff. On the left is the link to the code and the paper.