Automating Wikimedia: Open Knowledge, Linked Data and Search
13 May 2021
Dr Amanda Lawrence: Research fellow & Wikimedian in Residence, ADM+S (chair)
Assoc Prof Heather Ford: Head of Discipline for Digital and Social Media, UTS
Liam Wyatt: Senior Program Manager, Wikimedia Foundation
Prof Mark Sanderson: Chief Investigator, ADM+S
Prof Julian Thomas: Director, ADM+S
Watch the recording
Dr Amanda Lawrence:
Hello and welcome everyone. We’re just getting going. People are joining us which is fantastic. Just give it one more minute. Alright, so my name’s Amanda Lawrence and thank you all for coming along to the ADM+S seminar on automating Wikimedia. I’m Amanda Lawrence. I’m a research fellow at RMIT University and at the ACR Centre of Automated Decision Making and Society, and also I am a Wikimedian in residence at ADM+S, and this seminar is part of making connections between researchers and the Wikimedia ecosystem including Wikimedia Australia which is the local chapter of the Wikimedia foundation.
So, I’d like to start by acknowledging the Woiwurrung people of the Kulin nations as the traditional owners of the land on which RMIT University stands, and where I’m speaking from today. And to respectfully recognise elders, both past and present, and their collective wisdom handed down through the generations. I’d also like to remind you this seminar is being recorded and will be made publicly available, hopefully in about a week or so by the ADM+S Youtube channel. So, presenting today we have professor Julian Thomas, Director of the ADM+S, and distinguished professor in the school of media and communications at RMIT University, who will give a brief introduction. Followed by professor Mark Sanderson, Dean for research and innovation at RMIT University for the schools of engineering, and of computing technologies in the stem college. Mr Liam Wyatt, Senior program manager at the Wikimedia foundation, who’s been closely involved with the development of Wikipedia media enterprise. And associate professor Heather Ford, Head of discipline for digital and social media in the school of communication at UTS. And it was actually Heather being awarded an Australian Research Council discovery grant, that was the inspiration for bringing this group together and this topic. The three speakers will present for around 10 or so minutes each, and hopefully there will be time for some questions at the end. If you have questions, please put them in the chat and we’ll try to get to them at the end, where you can either – I can read them out or you can give them yourself. So, without further ado, I will hand over to professor Julian Thomas.
Prof Julian Thomas:
Thanks, Amanda. I won’t take 10 minutes. Just to provide I think, a brief introduction and context if I can. We thought it’d be good to tell you a little bit about our ADM+S centre where which is organised – sorry for that interruption – and also why this topic is so important for us there. So, the ARC centre of excellence for automated decision making society is an Australian Research Council funded centre that’s an Australian Commonwealth Government research funding body, and our centre is designed to develop the knowledge-base which is required for responsible, ethical and inclusive automated decision making. It’s a multi-disciplinary centre. I think you can sort of get the flavour of the multi-disciplinary that’s necessary to engage with the sorts of issues we’re interested in today, just from the composition of the speakers that we have today. It brings together researchers from the humanities, the social sciences, and the data sciences. It’s also a distributed centre. It comprises nine Australian Universities and many industry partner organisations, and overseas research organisations. We’re working on not just on how automated decision making systems work, but also on how people use decision-making technologies across the whole spectrum, from various kinds of automated or artificial intelligence, to the blockchain. We’re interested in understanding the institutional contexts of automation, and especially relevant to today, we’re very interested in how the outflows of data collections, data markets and data managing organisations, shape automation in a range of contexts.
We do that in order to better understand how we can manage the risks and benefits of accelerating digital transformation in four key domains of social life, in news and media, in health, in transport and mobility, and in social services. So, all areas where the kind of data we’re talking about today is particularly important. I wanted to thank Amanda and all our participants particularly for coming together to talk about this today. Wikimedia is critically important for our understanding of the broader information ecology of search, which we see as a fundamental technology here. I think we’ll be talking more about that soon. We also see it of course as a key location for the operation of pioneering automated systems in areas like content moderation, in editing, and in various kinds of automated content. So, in areas like summarising, and so forth. And it’s also a critical area where we can understand better, the engagement of people with automated systems. It’s an extraordinary source of course, of text and images for the training of machine learning systems. So, really significant for us, very important for us to know more about it. And I’ll hand back to Amanda so we can get into the content. Thanks very much.
Dr Amanda Lawrence:
Thanks, Julian. That’s fantastic. So, we’ll be starting with Mark Sanderson. So, I’ll hand over to Mark.
Prof Mark Sanderson:
Thanks very much. Let me get my, okay. So, Amanda asked if I could speak a little bit about search and also information boxes. And so, I thought the simplest thing that we could do, assuming you can see my screen is just type in some searches and kind of try and do a bit of a tour about what it is that search and information boxes – how they sort of work. I mean, the ultimate decision- so if for example I type in a search, let’s go for the name of the centre and search. Search is something that we’ve become extremely familiar with. In fact Google is coming up for its 25th anniversary. It was I think, 1998 when the first beta versions of Google were starting to operate. So, while something that we’re very familiar with, it’s always struck me that I don’t know how familiar people are with the way that search actually sort of operates. And here what I’m talking about – search – I mean this part of the search page, the so-called organic pages. The so-called organic part of the search result page. Many people talk about things called PageRank, which is a something that Google talked about in one of its very early papers, but actually that plays a very minor role. The way that Google does when you type in a query, you’ll notice that it claims there’s about 80 million documents that have the words that I typed in, those five words. So, how does Google decide which of those 80 million to put into the top 10 of that organic search result page?
It’s largely a process of using multiple clues. You might be surprised, I mean actually just if the words match in the title of the web page, that’ a pretty strong clue that it uses. They also use a little known but I think very powerful technique, that they use is, they look at the links that point to people’s pages so the ADM+S centre. When we first launched the ADM+S centre web page, it wasn’t featuring very high in Google’s ranking. And so, possibly one of my more significant contributions to the centre so far was to advise Julian on how to get the web page on to number one position in Google. And the answer was to get as many of the people who are partners in the centre to link to this website and to make sure that the blue anchor text that was in that link had the name of the centre in that, because Google actually doesn’t worry too much about what words are on your page, but it worries a lot about what the words that other people use to describe your page. And in order to get a description of what those words are, they look at the links, at the text and the links, that point to your pages. And that’s a very strong clue that Google uses when it’s doing search. So, the kind of organic part of the search that we’ve been seeing for so long is essentially a word overlapped process. It looks for matches between queries and documents and sorts the document, it sorts the list of documents they have by the degree to which the document matches your query. But we also see these boxes over here, these info boxes, and Amanda has given me 10 minutes and I wouldn’t really have the time to talk about how Google goes about doing this, but what you’re seeing in these info boxes on the right-hand side is Google’s knowledge graph . So, gosh, I don’t know 15-20 years ago they started building a knowledge graph which was a structure of information that represents knowledge that is gained from trusted sources.
So, a big, trusted source that they use is Wikipedia. But they use a number of different trusted sources and they’ve essentially come up with methods to try and automatically extract that information from these trusted sources, and then they get placed into the knowledge graph. And then when we do searches on Google we get matches on the documents that Google has, but we also get matches in their knowledge graph, which they then structure into this information box. The knowledge graph is prone to making mistakes. People quite often talk about, have you seen the mistake that Google has made in trying to pull this information out of the knowledge graph.
My favourite one is this one. How many legs does – I keep checking this every time I do it because I keep wondering what they fixed it but, if you ask Google how many legs does a chicken have it tells you that it has four. Apparently this is because it’s really it transputes the word legs into limbs, and then birds have four limbs. But you get this kind of answer. So, the knowledge graph does have mistakes in it, or can end up making these kinds of mistakes. One of the things that is also perhaps worth noting is that we’re a lot less tolerant of mistakes with the information box. I mean, I could type in ADM+S which is the name of our centre, and discover that ADM+S does not get the automated decision making in society centre, but I don’t blame Google for that, I blame myself. Oh, I typed in the wrong query, I should try a different query. But when you ask Google to find information about a particular person you sort of set this higher standard about what you want to see and so the information box is something that you want to see, that works well. It doesn’t only pull information out from Wikipedia. So, I don’t have a Wikipedia page but what it’s pulling here is it seems to have discovered that there is this entity called Mark Sanderson who seems to be a researcher, and it seems to be getting that probably from its Google scholar pages and then noticing that there are connections with other social media pages, as well. It struggles with these information boxes. It struggles with them with ambiguity. So, there are actually three Mark Sanderson’s according to Google. There’s a screenwriter who’s written stuff for Hollywood and has his own website, has twitter and so on. He’s got an entry on IMBD. I’m sure IMBD is another trusted source that they use.
There is a separate author called Mark Sanderson, and the system is a bit confused about this, so this isn’t me. There’s somebody who writes literature in the UK, and the knowledge graph is getting confused about who’s there. So, these info boxes as I said, are pulling information out of the knowledge graph, but the knowledge graph is not something that is perfect by any means. Another thing that you can notice when you’re looking at these info boxes – just a little tip for you – you’ll notice this is my info box here. If I type in the info box of somebody else, say, in the centre of excellence – we have not typed her name correctly, but Deborah Lupton, for example. So, here’s an info box from Deborah’s collected information. One of the things you’ll notice is that a lot of these info boxes have this little button down here that says claim this knowledge panel. So, one of the other sources for info boxes is, people can actually register themselves with Google and actually control the info box themselves. So Deborah hasn’t done that, not really any need to, Deborah’s info box is in excellent shape. You’ll notice that the button isn’t there for me because I did actually claim access to this particular page. You have to go through an authentication process with Google and then they give you access to that page. And so, I can send them edits and suggestions to that page. So, these info boxes are actually a mixture of information extracted from things like Wikipedia, pulled into their knowledge graph but then they also actually have trusted people who’ve gone through an authentication process, who can also make suggestions about how the box works. I could probably stop there, Amanda, but that was just a quick overview about search and information boxes.
Dr Amanda Lawrence:
Great. Thanks, Mark. I’m not sure if there’s any questions on that straight up, or we can move on to Liam. If anybody wants to raise their hand we could have time for a quick question, otherwise we will move on to Liam. Liam Wyatt for the Wikimedia foundation.
Hello everyone. Good morning from Italy. As you probably can tell, I am Australian, but I am coming to you from the other side of the world where I live these days. I’d like to speak to you if I’m able to share my screen, about a – let’s try this, is that working for shared screen purposes?
Yes, Excellent. So this is Wikimedia enterprise, the new, shall we say, search engine optimisation system from the Wikimedia foundation. The result of many years of discussing what is the best way to have an actual relationship of equals with search engines and other major commercial reuses of Wikimedia knowledge. The committee information which is a really interesting challenge from the perspective of knowledge equity, a concept that is key to the strategy of the Wikimedia movement these days. Knowledge as a service, a play on the concept of software as a service, that you will have heard in the computing world, in the IT sector a lot. These are the two key principles of the strategic plan that’s currently in the process of being enacted in the Wikimedia movement. Knowledge equity and knowledge as a service. Given that we all know that there is a kind of symbiotic relationship between Wikimedia and search engines for content use, useful information but also for visibility in the reverse, but that relationship is not of equals. It’s not necessarily a friendly one. It’s a kind of useful rivalry but mutual dependence at the same time. How to simultaneously provide greater access to high quality information to improve the kinds of things that were just presented in search boxes, both the organic results and the knowledge panels on the side. Not just the Google but for any search engine. And most importantly or equally importantly these days, audio. So things like Alexa and Siri, those kinds of virtual digital assistants that are audio operated. And also, here’s the rub, it’s not just a question of the data it’s a question of the finances. We have this really interesting challenge of sustainability within the movement. The data is provided under free license. Anyone can use it. Anyone can use it for commercial purposes, that is entirely within the point and purpose of Wikimedia knowledge. It is freely licensed and so, being a non-commercial organisation and a non-commercial movement, we don’t have ads. Simultaneously is core to the mission that we don’t restrict or have any problem with someone else downstream using that information for commercial purposes, that’s fine. But that results in a bit of a question of the tragedy of the commons, if you’re familiar with that metaphor. That by virtue of having a libertarian approach to anyone can do whatever they want, means that the majority of people are crowded out by the large players in the field. The metaphor attracted to the comments refers to the idea of a commoners in the town being over grazed by a couple of people, to the benefit of self-interest. But it means that the most people no longer have access to good quality grass. So, this is a question of reversing that financial dependence. Where large organisations like search engines pay for high speed, high stability, commercially reliable, contractually provided access to the same data as everyone else, thereby reversing the financial trend, they subsidize us rather than we subsidize them.
And the second part is ensuring that the quality of the information is if not better, because it’s the same data that the ability of those users, to have more signals. We’re calling it credibility signals. As a concept, to know what to update or when potentially on their own basis, they can do what they want to. Hold back from updating because maybe there’s a lot of vandalism, maybe there’s a lot of things changing. Suddenly it’s important to acknowledge or to re-emphasize that the data is the same but providing more signals to re-users who – particularly those who are using it at high speed like search engines – some may wish to take everything as fast as they can. All updates, all changes, and refresh their internal knowledge graph as frequently as possible. And that’s fine. Some others may wish to hold back for precisely the same reason, because there’s lots of changes and that’s different business decision, a different policy about how they want to take information from the internet and refresh it within their own data set Wikimedia foundation.
And through this service, Wikimedia enterprise is agnostic to how you use the information. That’s quite important as a principle, but it’s not saying this is good or this is bad. This is England, this is not, that’s a question for the community to define. But allowing downstream users to have information within the metadata of what is being sent to them, to make up their own opinions based on publicly available information- things like this page currently has a spike of readership, this page currently has a spike of anonymous editors, lots of different editors are suddenly changing this page… That could be an indication of vandalism, that could be an indication of newsworthiness. They’re just different definitions of the same data. And it’s up to the re-user to define how they choose to do that. That’s a kind of introduction to the principle, or the purpose of this process, of this new concept. Both from a financial ideological perspective, but rather than continue talking about a business model and so forth, I want to throw the microphone back to Amanda and Mark and potentially Heather, to make it a more conversational. I can of course continue talking about this. but I wanted to show that we’re responding to the interest of the audience not just to the interests of the speaker.
Dr Amanda Lawrence:
Thanks Liam. We actually had a little bit of trouble getting Heather forward here. So that’s, we’ve just got a few technical difficulties which we’re trying to work through. It would be, I personally – oh, I think that might be Heather. Hello Heather. Thank you.
Assoc prof Heather Ford:
I’m so sorry about the problems with the Zoom login. All good, I’m just on a completely different machine, my colleague brought me his machine which is super helpful. But I just managed to get my presentation up. So, I am ready to start. I haven’t heard the other presentations but if you let me share my screen it should work.
Dr Amanda Lawrence:
Okay let’s go on with that, and then we can bring things together and have a chat afterwards.
Prof Heather Ford:
Okay, awesome. So, 10 minutes writes Amanda. 10-12 minutes. Okay, great. So I just wanted to share a little bit of the research that I’ve been doing over the last few years, and kind of leading into the new research that I’m really interested in. And some of it’s going to come out in a book which I’ll talk about, but I’ve entitled that the presentation today, The Automation of all Human Knowledge. So, if this helps, if this works. No, this is a completely different presentation apologies.
Oh, I’m playing it from Dropbox, that’s why. Okay, so the thing that I want to talk about today is two events that happen, which you’ve probably now heard about from Liam, definitely if not the other speakers, that happened in 2012. And one was the launch of Wikidata, which described itself as a kind of knowledge base. That house structured data so the idea about Wikidata is rather than housing information in articles like Wikipedia would do, or documents, information would be organised according to relationships. And the idea about structured data is that it enables information to be automatically updated across a range of different languages in Wikipedia. So for example, Wikidata now houses the interwiki links between different language versions of the same article and also the infobox data. So, if we had a change of presidents or prime minister in Australia for example, you’d very easily be able to change that on one language and have it automatically replicated to over 300 languages of Wikipedia. So the idea here is that it really saves time and makes Wikipedia more current.
Now, another thing that happened in 2012, which you may have also heard about from them, is the knowledge graph, and this is also really significant. Google launched the knowledge graph in 2012 and what it did was, Google was saying that instead of receiving a list a possible websites where you could resolve your query or your question, now you could actually have facts being represented to you directly within Google. And those facts would be presented in what’s called a knowledge panel. And you can see an example there from the blog post that launched this by Ahmed Singh in 2012. And so, both of these implementations were around the same idea, the development of structured data. And this idea of the semantic web that had been boiling for many years, but hadn’t seen any major projects. And these were two really significant projects that gave the semantic web a real boost. Now the argument that I am going to advance today is the following.
So, the first idea is that the rhetoric around these new projects is that they’re purely technical. We thought about them as really technical, just technical means of solving problem, and that their data is objective. So, as social scientists we always balk at this idea that data can be objective, but it’s really been spoken about a lot by multiple projects, that somehow semantic data or knowledge graph data, really rises above other kinds of data in its objectivity. And then finally, that their outcome isn’t to enhance efficiency. So, that is what the outcome of this data is, and you know, there’s many people in the Wikimedia movement for example, that believe that this is a really effective way of sharing open factual data. So, this was always the point of Wikipedia information, that it would be shared as widely as possible. The argument that I’ve been advanced just in a very short time that I have today, is to say that actually, this development of structured and semantic data in this kind of alliance that I’m calling the new move towards structured data by many of these large organisations, it actually requires significant social and political resources that aren’t often considered. And actually have really important implications for human agency and algorithmic decision making, and the diversity of knowledge production. So, they really have important social and political consequences, and importantly, these impact on how we come to know ourselves and each other. So, I think this is a really significant area that is not really being addressed so much, at the moment.
So, many of these ideas are – apologies I keep having to move this thing because it’s in the drop box – but many of these ideas are in my book, which is going to be released in November, and with a forward by Ethan Zuckerman. But today I’m just going to talk about three examples.
So, my argument is really what I’m trying to show you today is just some examples of how this move to structured data is actually very political, and that the decisions that I’ll be making have significant political representational consequences. So, what I did in this book is I looked at a single article on Wikipedia. The 2011 Egyptian revolution. And I basically traced over the period of 10 years, how facts that were curated and created in Wikipedia then moved through the infrastructure end of the internet, and then eventually moved into things like the knowledge graph, which was developed and launched a year later. So, I’ll show you three examples in the time that I have.
So, the first example, now nowadays using semantic search you can actually ask what is the Egyptian revolution to Google and what you’ll get is an info box. Today, this is what it looks like. I’ve been following and tracing it over ten years which has seen some very interesting changes, and one of the things that we’ll – not that you’ll notice – is that Wikipedia is usually cited on the first kind of definitional statement. But then usually it’s not cited in any of these statements afterwards. There is computational research that suggests that Wikipedia is a common source for even those facts that aren’t cited. Now, some of the debates around Wikipedia and – or not Wikipedia, apologies – about the info-box, in particular in relation to this, are around the politics of extraction, citation, and verification. So, these are questions like, what selections are made in the extraction. Because as you’ll see in that example, it’s not a perfect extraction of what the Wikipedia definition, the first line in the Wikipedia article looks like. So, there are selections being made here. The second really important question that’s been subject to lots of debates in the Wikimedia movement is where the hyperlinks just bring users back to the internal platform pages. So, in Google there has been some research that says that with the development of the knowledge panels, increasingly users are being sent back within Google pages, or whether they sent out to the source. And then finally, if and how the source is cited. So, questions like, can users easily go to the target article to do things that are related to user agency? Things in Wikipedia that are very normal, things like checking the source, changing and correcting information, or at least engaging in a debate with other editors. And then finally something that’s really important to the Wikimedia foundation, being able to donate. Because you won’t see the buttons to donate if you’re not on that page. So, those are some of the debates that have been happening.
The second question, or second aspect of how political this is, is just an example of this date. Now I followed this article as I say, for many years. And in particular on the 11th of February. And the dates when a revolution ends, ends up being incredibly politically. Rife as you can imagine. So, here I call this event data politics. So, what we know about data is that by definition what data’s doing is it’s summarising, simplifying abstract, and it’s removing nuance and context. And it’s placing this data into new contexts which gather new meanings. And so, it’s not just a matter of simplifying. It really is a matter of new meanings being gathered in different contexts. And so, what ends up happening when these kinds of choices are made is that these choices reflect certain groups, points of view, often at the expense of others. And in this case, the date when the protests end, ended up being really important, because it enabled some to claim revolution. Because there was a debate about whether this could be called a revolution and success, when others were more reticent to do so before they, for example, thought that real change had occurred. And the final example is around the politics of feedback.
Now, process of feedback. This is really around two key questions. One is the ability of users – and perhaps even more importantly, those who the data and facts represent – whether they are able to have any agency over the representation of that data if they believe it’s incorrect. But probably even more importantly, about the rules that are used to determine whether a change should be made available. So, in the case of the Google knowledge panel, we don’t know how many times it takes for a user to say, click on feedback and say this is incorrect, for the information to be changed. We don’t know what it means, we don’t know what the rules are. So, these are questions that are very contentious about whether people receive meaningful responses to their feedback, and whether those rules are actually available. So, finally, the thing I’d say is just to reiterate that the development of structure and semantic data, and here there’s work being done by Andrew Eliades for example, on ontologies, databases, and specific implementations. So, at multiple levels of the semantic or structured data question are contentious, right. These questions are really politically contentious, because they have meaningful impact on communities around the world. And the second thing to say is that certain actors will necessarily prevail over others, and we’re starting to see how that’s playing out. So, one of the reasons is because of the way in which the data is framed as technical, that will necessarily involve certain types of people in their representation. Also, how the data is resourced. If it’s resourced primarily by advertising when we look at building ontologies about events for example, we’re going to get ticketed events being built. Ontology is about ticketed events rather than political events, for example, and also organised. So, it makes a difference that these semantic web platforms are really arriving at us as monopolies, and where billions of people are using them to get the information. And finally, just to reiterate this has really important consequences or implications for those representing statements, that end up being reflected as consensus reality. And then for those that are relegated to the category of opinion, random error, or even misinformation. And this is beyond reasons of their truth value. So, all of these kinds of social implications that play a role in determining what we end up seeing in the knowledge panels. Thank you.
Dr Amanda Lawrence:
Thank you, Heather. And that was really fascinating, rounding out the sort of more straightforward information we’ve looked at, so far. Looking at what actually comes up in those search results and information panels, and how this has been set up to be able to flow into. So I’ll just submit somebody and flow into from Wikipedia via Wikimedia enterprise. So, I’m not sure if anybody has some questions. I would like to put, but I would was wondering if Liam would be able to talk a little bit about how some of the flows from Wikimedia into something like Google, and whether there’s an understanding of how that is going to be updated, and the dynamics of that. Because actually a lot of these things as Liam has said in the chat – things on Wikipedia – are actually often debated extensively, and continually updated. But it starts to look very concrete once you see it in a knowledge graph and in a database as Heather’s pointing out. So can I put that over to you Liam?
Certainly. And to exactly both Heather and Mark’s point, once something appears in the right hand side of a Google search result, it seems more fixed. More finite, more official, than on the left-hand side in the list of results, which is interesting. It’s on us, that’s psychological. It’s still just the web page, the Google result. But it gives you the impression as the reader, that this is a fixed fact. In time, what makes wiki data fascinating for me – I’ll get to your question Amanda – but I wanted to emphasise that what makes Wikidata fascinating to me, what makes it unique I believe, is it’s the only structured data source or database out there that is designed with a metadata schema that acknowledges the messiness of knowledge. You can have contradictory statements for the same fact inside a Wikidata item, each with their own references to different sources. And that’s okay because it is not a breaking of the data schema. It is designed that way to acknowledge that knowledge is messy information. It’s not finite or merely objective truth in all circumstances. And I don’t think any other kind of catalogue in your categorisation system, builds this flexibility and messiness into the design.
I should also say that search engines do not currently use the Wikimedia enterprise API service. It is new, it’ll take them a long time, in the order of a year I would say, to switch over their systems for something so deeply embedded as Wikipedia. Let alone wiki data, from the normal systems they have built up over the years, to a new API service. API application programming interface is for the benefit of the tape. It’s the method by which one computer talks to another computer and the one sending the information structures that information to say this is the title, this is the first paragraph, these are the categories. It tells you what it is and how to use it. It describes itself, rather than what is primarily used by search engines and reuses of Wikimedia information. To this day, scraping is taking the website as it is, presented to a human looking at the code of that which is available in anyone’s browser. You can click on view developer information, view the source code, and it’s just all the html, and having their robots go through the entire website and just copy and paste it into an internal database where they turn it around, re-munch it, strip it apart and try and put it back together in a format that’s useful for that company. What makes the difference with say, the Wikimedia enterprise API, is that scraping services are designed for mass ingestion from various websites. Wikipedia, they probably have a special one. Of course they don’t tell us it’s a secret source inside these companies highly proprietary information, but all of these different search engines are doing it in their own different way. This makes it highly fragile. The idea that Wikimedia enterprise is trying to together, as I was saying in my original presentation, is to make something that is consistent, stable, and high velocity. Both in their ability to extract information and our ability to send new information. Such that it is more easy for these organisations to obtain reliable results, even if something else on the human readable version of Wikipedia, if we change something in the interface, if we move where the search box is from the left-hand column to the top right-hand corner- that would break a scraper. And it has done in the past. It shouldn’t because that’s not designed for the interests of a search engine. How, where things are laid out in the page, what the font is, but that does break scraping. So, simply providing a more structured and stable environment for various organisations to take the information for their needs with information about what that information actually represents, is a much more stable, technologically and informationally environment to work from, than simply saying as we have always done until this day – here’s a website, everyone have at it.
That has been ideologically consistent with the idea of free and open source. Anyone can use it for whatever, any purposes they want. They can expect information and that’s fine, but what we’ve discovered is that acknowledges, or that promotes, or is to the advantage of, the largest players. Because they have the infrastructure and resources to be able to throw at the problem, benefiting monopolistic organisations, the largest companies in the world that have ever existed. And smaller organisations that do not have, or competitors to those large organisations that do not have the technological human or financial resources to reverse engineer large data sets, such as Wikidata or Wikipedia, are at a disadvantage. And can’t compete because they can’t take that information and make it useful by building this new API service that is standardised. It includes the same information as you can get normally, but in a more structured format. It allows a leveling of the playing field where different organisations can compete and can provide access to their customers, to their re-users downstream, to the high-quality information that has previously only been available. If you could throw large amounts of technological and human capital at the problem, that is, the importance of knowledge as a service. The phrase coming from the Wikimedia movement strategic plan, that it is no longer sufficient to merely say here’s knowledge have at it, we actually have to provide a helping hand. And if, through the language that these re-users understand which is commercial contracts, uptime standards, rather than just saying in a libertarian format here’s knowledge, it’s open, it’s freely licensed, but it’s no longer sufficient to meet the mission of the Wikimedia movement. I’ll stop there and pass the microphone back.
Dr Amanda Lawrence:
Yeah, thanks, Liam. I was wondering, I don’t think we’ve got any other questions, so I’ll keep asking them unless somebody else wants to jump in. I guess to Heather’s point and perhaps Mark might have some views on this- whose responsibility is it, or how do we think about the verifiability or the contestability of that knowledge? And we’re assuming that it’s high quality knowledge coming through. But what if it’s not. Is that on Wikipedia and Wikimedia, or is that on Google and other sources that are using that, and to be able to kind of alert us to that not everything is actually totally agreed on, or even a clear fact.
Prof Mark Sanderson:
I mean you know, I think the search engines Google or Bing or you pick your favourite search engine, they will certainly feel the responsibility to the results that they’re providing. You know, they’re not only using the Wikipedia sources, they’re using many other ones and so I’m certain that they have some sort of checking processes that are going on internally to make sure that they’re reaching some sort of standard. Presumably a fairly high standard. For those info boxes you know, just for the reasons that I showed earlier with the ‘how many legs does the chicken have’ – if you make a mistake people share that thing around and it harms Google. No one’s blaming Wikipedia for the ‘how many legs does the chicken have’, they’re blaming Google. So, I’m pretty sure the organisation, the search engine, cares about those kinds of things. Certainly I find having been this person that’s gone and claimed his knowledge panel, that they’re very responsive to any changes that I ask to be made. They both tell me when they’re not going to do it, and when they have done it, and the response times are usually 24-48 hours. I’ve no idea about the feedback thing that Heather was talking about.
Just the other sort of, just an anecdote, but the other thing just to say about different knowledge. I was over in the UK and if I’m seeming a bit jet lagged it’s because I am jet lagged. I was in the UK last week and I was talking to my niece and she’d been told by my sister that I was some professor or something. And she said hey, have you got one of those box things, you know, when you type in your name. And I said oh, I do actually. And then I typed in my name into Google in Britain, and it was nowhere to be found. And the reason is that my info box doesn’t get shown. If you push, you can squeeze it out of Google but, it really is quite reluctant to show my name in the UK, and it’s obviously decided for reasons that I don’t fully understand, that well I’m an Australian-based academic, I’m an Australian citizen, and so the box pops up here in the UK the screenwriter who I was showing earlier. His box comes up as the principal one. And it’s actually quite hard to find me. And I think that’s really down to Google making some decisions about you know, in different countries there are different priorities. It doesn’t know my accent, doesn’t know that I was born there, but there is that sort of concept within the search engines, or the different countries have different views.
Assoc Prof Heather Ford:
We actually just finished a study that should be published soon and we asked Google Assistant, Apple Siri, and Amazon Alexa, in both the mobile and smart speaker versions, who are the 34 Order of Australia winners. And actually they were pretty bad at disambiguating the Australian people from the Australians, from their equivalents in mostly the US. So it was really interesting to see actually, they may be doing it at some level but not that great. Google is by far the better one of the three, but still I mean…
Prof Marc Sanderson:
Actually, my PhD was in disambiguation many, many, many years ago. It is a difficult topic, and you know, there really are just some perplexing – my old colleague in Sheffield, Peter Willett who’s now retired – there’s another Peter Willett and both of them worked in sort of information studies. There’s actually there’s another Mark Sanderson who worked at the University of Sheffield who was also born in December in the same year as me. In fact, there was apparently a giant red notice on our HR record saying do not confuse these two people. So, disambiguation is not a trivial process. The only thing you can really do is name your kids a unique name in hopes that you in fact, that’s one of the things you’ll notice, is that quite often those info boxes work better for people who are either super famous or have very unusual names. And then the chances of that ambiguity problem become less prominent. But dealing with ambiguities is much harder than perhaps one might imagine.
Assoc prof Heather Ford:
Yeah, it’s just curious because in this case, Google will definitely bring up these people in the Australian. Google will bring them up immediately, but not in the structured data of it. So, that’s what’s intriguing about it, I guess.
Prof Marc Sanderson:
Somebody else complained on Twitter, to one of the Google people about these info boxes and then I chimed in because I was a bit grumpy that my info box was getting all confused with those other two people. And one of their public people, his name slips my mind, but he said look, it’s really tricky. He said it’s just hard, basically these info boxes about people are being done completely automatically and those algorithms just aren’t perfect. It’s a real limit of the technology.
Dr Amanda Lawrence:
Yeah, so Heather, I guess I might give you the last word in where you see some of the next research questions for this space, and the key issues that you think we should be looking at to either draw attention to these issues, or how to just make sure that people are aware in various ways. Or even that Google, it takes these issues more seriously and Wikipedia, and the Wikimedia community.
Assoc Prof Heather Ford:
Thanks Amanda. John I’ll have to watch the video to hear what others have said I’m very curious. But yeah, I think from where I see some really interesting new work is really around structured data in Q and A systems. Because at the moment we have more smart speakers than people on the planet. Increasingly people using smart speakers and digital assistants to ask and answer questions about the world. And yes, getting the kind of, well, imagining how these global corporations like Google who has such a monopoly over the technology for a question answering development, how they are able to, and how they respond to this local knowledge question is going to be really interesting and have significant uh political and social effects.
So, I think question answering is a really interesting area. Especially in terms of machine learning and figuring out where that takes us, is super interesting. And yeah, I think the local question hasn’t really been asked to such an extent before. And it’s about people you know, far from the headquarters of Google to what extent are we creating these unequal systems where some people can get verified and some people can’t? And have control basically over knowledge that’s represented about them? Because that’s essentially what it what it is, what it’s about. So, yeah, those are just some of the things that I think are important but I guess there’s a lot of work happening at the ADM+S centre that is related in terms of search, and I just think structured data is something that doesn’t often get examined a lot, and it’s really powerful. So, yeah, it’s a really important area.
Dr Amanda Lawrence:
Fantastic, and we’ve just got one minute to go, so I was just wondering if you would briefly mention your ARC discovery grant, and what you’ll be doing with that, Heather.
Assoc Prof Heather Ford:
Thanks, Amanda. How about putting me on the spot? No, no, no. It’s actually about looking at how history is written about – Australian history – is written on Wikipedia, and there will be quite a big data component to that, actually. So, we’re looking at the concept of bias and how people think about bias in theory. And then also looking at a bunch of different types of articles related to Australian history. And we’re doing a kind of multifaceted study with the historian Tamson Pietsch who I’m working with on it, and hopefully with input from the Wikimedia foundation. And with Nathaniel Katz who is Australian, but is at work and has written a beautiful book about Wikipedia. So, it’s a really exciting project over three years. We haven’t started it yet but it should start mid-year and some really interesting findings, I think, that are really interesting for Australia and how the challenges that people face and also the successes that they are seeing in reporting a kind of people’s history.
Dr Amanda Lawrence:
Fantastic. So exciting to get a research project like that funded and yeah, really wonderful. So, I would just like to thank Liam for getting up early in the morning, for Mark for coming, staying awake with jet lag. And to Heather, for being very patient while she could not get on to this event. I’m so sorry about that. So, thank you everybody for coming along. Thanks very much to Wikimedia Australia for holding a promotion of this event, and to the ARC centre of excellence for automated decision-making society for supporting ongoing interest in Wikimedia, and its role in the knowledge ecosystem. With that, we can probably, if we just turn off the recording. But thank you everybody, so much.