Jenny: Welcome to the Automated Decision-Making and Society podcast. My name is Jenny Kennedy and today we’re talking with Doctor Aaron Snoswell from the ARC Centre of Excellence for Automated Decision-Making and Society about machine learning and recent technical advances in the ability of AI to produce creative content like images and text.
Aaron Snoswell is a computer scientist and a research fellow at ADM+S working in computational law. He has previously worked across varied industries, including biomedical device development, health service Policy Research and pilot and astronaut training.
Thanks for joining us, Aaron.
Aaron: My pleasure, thanks for having me.
Jenny: Can you tell us a little bit about your research interests?
Aaron: Sure, so I was originally a robotics engineer but recently went back to study a PhD in Computer Science and my research was looking at reinforcement learning, which is an area of machine learning that you might have heard of. It’s behind sort of recent successes in things like playing games, for example, Go or StarCraft in folding proteins in steering, sort of autonomous vehicles, and also believe it or not, in controlling nuclear fusion reactors.
In my research at ADM+S, I’m interested in how we should hold AI systems and their developers accountable when and if things go wrong.
Be that through regulation or political mechanisms or technical mechanisms or that sort of thing.
Jenny: So today we’d like to chat to you about the rise in the ability of AI to create text, images and videos and even original works of art. Some of the images in particular are quite incredible.
How has AI developed the ability to create these works?
Aaron: So, these impressive results that you’re talking about come from this new family of AI models, which are funny names like DALL-E or PaLM or GPT.
What’s really new here is the approach to how machine learning is being done. So previously AI developers tended to start from scratch for each new problem that they were interested in. So, if you wanted to detect pictures of cats in photos, you would start from scratch and build a new AI model.
If you wanted to summarize long documents automatically, you’d start from scratch and build a new AI model, etcetera, etcetera.
But more recently, AI developers have started building what’s sometimes called foundation models, so these are very large general-purpose models that learn patterns from very general data.
So, for example, a copy of a large chunk of the entire Internet. And it turns out, if you learn very generally about the kind of content that humans produce, these huge models can actually serve as a foundation or like a springboard for numerous other and different tasks, such as generating images or writing stories.
Uh, if you like it’s analogous to sort of imagining an artisanal person handcrafting wooden toys, perhaps?
Or comparing that with lots of little Lego bricks, and they can piece together toys much more quickly in a variety of different ways, perhaps as an analogy.
Jenny: So this example you’ve given us of DALL-E model. Can you tell us a bit more about what that is and how it can generate images from text captions?
Aaron: Yeah, so DALL-E, and in particular DALL-E 2 is the recent version of that, but it’s been in the news a little bit.
It was originally trained on a humongous data set to match pictures, like a photo of a cat with the corresponding caption so for example, text that says this tabby cat is relaxing in the sun. So it did this by looking at hundreds of millions, hundreds of millions of examples.
So once trained, the model knows what cats and other objects look like in the pictures it’s learned some very general patterns about how strings of text can relate to blobs of pixels in image.
But this model can then be used as a foundation for many other interesting tasks, like generating new images from a caption alone.
So, one famous example is show me a koala dunking a basketball. This is not a picture that exists anywhere in the world prior to this model, but the model can actually generate this sort of imaginative image.
It can also like edit images based on written instructions. You could type, make it look like this monkey is paying taxes and it can edit images based on text prompts like that. And so, by building on this general foundation is a starting point you can quickly make new AI systems that do different things, and this process is faster and cheaper than sort of the old way off doing things relatively speaking.
Jenny: I love this so I can ask for an image of a panda doing a backflip on a bike. Why were these models created? Was it to help us search for images on Google?
Aaron: Because we can, or because someone at Google had a lot of research money to spend. More seriously though, we can think about this sort of why question at two different levels.
So, on one level this trend fits into the bigger picture of AI research where we’re trying to build machines that are intelligent however you choose to define intelligence. So, with foundation models, they are researchers and companies are betting that memorizing and combining patterns learn from data raking behavior such as content for the Internet, is actually a viable approach to then produce intelligent behavior itself in these machines.
At a more practical level though, companies have been realising that whether these models are intelligent or not in Bunny ears, they’re actually going to be useful for numerous tasks that can probably be used to turn a profit.
For example, if Amazon has a better language model than Apple or Google, then the Alexa Home Automation assistant might be purchased more often than the competing products or services like Siri or the Google Assistant, and so in this sense, these research innovations can be linked to really tangible financial and economic benefits for the companies that own that tech.
Jenny: Right, so it’s more about using it as a product to bring people into a wider ecosystem than anything else.
Aaron: Yeah, in from one perspective it is, yeah.
Jenny: Let’s talk more about the how the DALL-E model works. It looks like the images it produces have been professionally photoshopped together. How does the model need this?
Aaron: Yeah, so when you type in a text prompt to this DALL-E system it doesn’t find an image in the sense that it’s not looking at the database or doing a Google image search or something like that. It’s sort of imagining if you don’t mind me using that anthromorphic term or generating new images from scratch.
And so, the technology behind how this works is a type of machine learning called deep neural networks and so these are loosely inspired by how the brain works.
They involve a lot of sophisticated mathematics and a huge amount of computing power, but they really boil down to doing a sophisticated type of pattern matching. So one influential paper in this area even describes these models as stochastic parrots in the sense that they just randomly spit out things that they’ve heard before with no understanding.
So, continuing our example of the image models, if you look at millions of example images, a deep neural network would start associating the word cat, for instance, with pictures of with patterns of pixels that often appear in pictures of cats. So soft fuzzy hairy blobs of texture.
The more examples of the model sees, so the more data you’ve got and the bigger that the model is, the more layers or depth that it has is sometimes how we refer to that, then the more complex and advanced these correlations and patterns can be.
Jenny: So is this part of the deep learning approach to AI.
Aaron: Yeah, it’s in one sense.
Just sort of like an extension on this existing deep learning paradigm that’s really dominated our research for the past decade.
It’s it’s bigger, it’s better, it’s shinier if you like, but there are also hints that maybe something new is also going on here. So these foundation models seem to have the potential, for some people refer to as emergent behaviors that can be surprising and novel.
So, Google’s PaLM model, which is a language model. It seems to be able to produce explanations for complicated metaphors and jokes, which perhaps goes a little bit beyond just imitating the types of data it was originally trained on, which is a little bit surprising.
Jenny: How do these models go beyond simple imitation?
Aaron: I’ll tell you that just as soon as I figure it out and collect my Turing award.
The truth is we don’t really have a good understanding of the theory behind why deep neural networks are so effective. In this sense, the application of these systems in practice has gone way beyond the math and the theory.
Our research, which is trying to understand how and why these systems work.
As I said, we know that deep neural networks memorise patterns from data, so it’s possible that these foundation models are just combining patterns in more and more complicated ways, and maybe at some point if you combine patterns in really complex ways that are beyond how a human can understand the combinations, then maybe that starts to look like creativity or intelligence to humans, perhaps.
Maybe I can give an example to help illustrate this, so one of the examples from this PaLM language model is a question that the model benefit doesn’t answer for.
So the question is, I’ll read the text here.
“Michael is at that really famous museum in France looking at its most famous painting. However, the artist who made this painting just makes Michael think of his favorite cartoon character from childhood. What was the country of origin of the thing that the cartoon character usually holds in his hands?”
So when I read this I was completely baffled, I understood the first part was maybe talking at the nerve and the Mona Lisa but then I was just completely lost.
The response that PaLM generates is as follows.
“The most famous painting in the Louvre is the Mona Lisa, the artist who made the Mona Lisa is Leonardo da Vinci. Leonardo Davinci is also the name of the main character in the cartoon Teenage Mutant Ninja Turtles. Leonardo Davinci is from Italy.
The thing that Leonardo Davinci usually holds in his hands is a Cortana. The country of origin of the Cortana is Japan. The answer to the question is Japan.”
And so I read this and my mind was blown. These sort of leaps of reasoning.
If you like well in one sense, I wasn’t able to get there ’cause I didn’t know much about Teenage Mutant Ninja Turtles but it’s able to sort of follow this chain of reasoning and sort of extends to actually try and find the answer after several jumps, sort of, that require different sort of lines of thought if you like.
In general, they like these sorts of questions inevitably, but up against philosophical questions about what is intelligence or creativity, or what features constitute humanity in consciousness, which I think is very interesting questions, but perhaps a topic for another podcast.
Jenny: Absolutely, I mean they are questions that drive a lot of our research in the area, but what do you think about how these models will affect human creativity? Will they?
Aaron: Yeah, I’m great question. So one thing that’s important to mention here is that access to these models is quite limited due to a few reasons at the moment.
So in one sense the sheer scale of these systems is difficult to think about, so PaLM that I just mentioned has 540 billion parameters and so for example, if everyone on the planet memorized 50 numbers, we still wouldn’t even have enough storage to actually reproduce that model, it’s just mindbogglingly big.
And so there’s some enormous that training them requires massive amounts of computational and other resources, like one estimate of an open AI model GPT-3 guessed that it took around 5 million U.S. dollars to train and so what this means is these huge tech companies that build these systems. They are limiting access to them for economic reasons and so now they’re tightly controlling who can actually play with this technology, and they’re also doing this for ethical reasons to try and prevent malicious use, for example, such as generating deep fakes or fake news.
And so, these kind of restrictions give us a little bit of comfort that they won’t be used for, sort of nefariously creating creative content like fake news anytime soon.
But this also means independent researchers aren’t able to interrogate their systems and share the results in an open and accountable way, so we don’t yet know what the full implications that they use are going to be.
Jenny: That’s a great example of I guess the tension between the potentiality of these systems and the practicality of them. Are developers looking to develop smaller, more accessible models?
Aaron: Absolutely, AI researchers are working hard making this kind of technology more efficient and accessible, and smaller models are already being published in open source forms.
Actually, Facebook or Meta just a few days ago released a new model OPT which is designed to be open source and academics can request access to it and use it for their research, and tech companies are also experimenting with licensing and commercialising these tools, such as like paper use models to access some of these models.
And so this emerging economy is going to have very interesting dynamics and it’s going to be really interesting to sort of watch how this sort of market evolves in the coming years.
But the bigger picture here is this trend of sort of these models get becoming more accessible combined with this creativity, if you like of these models suggests that may be creative and professional those could be impacted sooner than perhaps we initially expected.
So, what I mean here is traditionally AI predictions always said that robots would displace blue collar jobs first if you like. There was this mantra in robotics that it was the dirty, dull and the dangerous jobs where robots are going to be useful first.
White collar jobs or work which requires sort of creativity or training was supposed to be may be safer for longer, and let me say I don’t like this framing that the robots are coming for your jobs, I think it’s perhaps a little bit harmful when it comes to the discussion, but the point here is that these deep learning models, like they already exhibit superhuman accuracy in some tasks, like reviewing X rays or looking at certain types of medical images and so we might expect that foundation models could soon provide cheap or good enough creativity if I can use that term in some fields like advertising or stock photography or graphic designer editing.
The huge caveat here is that where these models are going to fall completely on their face is at the interface between these creative roles and other humans. So for example clients.
So the two way discussion and communication is needed to elicit and develop creative vision or accounting for and applying human feedback on creative content. For example, that’s completely out of the remit of these deep learning systems.
All that to say, the future professional creative where it could look a little bit different than we traditionally expected as AI pontificators.
Jenny: So, what does this mean for other sectors for say legal evidence or news and media?
Aaron: Yeah, well, it’s pretty clear to see that foundation models will affect, for example, law in areas like intellectual property and evidence because we won’t necessarily be able to assume creative content is the result of human activity.
And we’re also going to have to confront this ongoing challenge of disinformation and misinformation generated by these systems.
This is already an enormous issue. For example, the ongoing invasion of Ukraine and problem of deep fakes on social media, for example.
But in one sense, foundation models could supercharge these challenges. This is also especially important to think about during periods like right now we’re in an election period in Australia and so it’s important to be especially aware of content on social media that might be generated by these types of models and keeping that in mind.
Jenny: I think there’s something in that the point at which we there is, there’s a point in creative thinking that a machine can’t yet replicate that a human can like when you’re talking about like using creative AI for copywriting for a client, there’s a certain interpretation of a brief if taken literally is OK but actually that there’s another step there that we can’t quite code.
And that’s I think, what like Janelle Shane’s kind of the neural nets that she creates are trying to demonstrate by saying like, well, we can take all this like this database, but like she had my favorite one of hers is when she gets uses AI to come up with a list of potential cat names by inserting in all the names of cats. in a cat shelter and then asking it to come up with other examples.
And you know some of them are kind of, you know cute and wholesome like taffeta or Tom glitter, but then others are things like miss Vulgar.
Aaron: So there’s a level of interpretation required here that in this situation will probably require more information from the person that you’re interacting with, but in the case of DALL-E, it doesn’t know which way it’s supposed to go, so just does both.
It draws a picture of a robot hand drawing a picture of a robot hand, just kind of interesting it covers all it’s bases.
So it’s really important not to just focus on the exciting potential of these other systems, but also to think about their limititations.
So one example limitation, because these systems are just doing correlation and pattern matching and patent combination if you like, they’re not good at logic sort of reasoning, and so that’s what I’m referring to here is there’s this problem called the binding problem in AI, which is how you bind concepts to objects, and so there’s a bunch of examples now on the Internet where people have tested DALL-E and said, can you show me an image of a red cube on top of a yellow cube next to a green sphere or something to this effect and it’ll spit out images with some cubes and some spheres and some of them will be red, yellow and green, but they’re never in the right order, or if they are, it’s just, a fluke.
So this system, like it knows the individual concepts, but it’s not yet able to associate those correctly and so there’s like a lack of understanding there in terms of actually binding concepts to objects, and this is sort of a key composition problem that these systems have not yet solved, and I don’t think we’ll solve in the near future personally.
Jenny: So as a researcher studying the effects of AI on society, can you tell us a little bit about what you’re going to be working on to mitigate these risks that you’ve outlined?
Aaron: Certainly, so one key question that I’m interested in is called the value alignment problem, which you can state at one way is how can we build AI systems through aligned with human values.
So that sounds really general, I know, but in the context of foundation models, this requires reflection on a number of things.
So, what do we mean when we say human values like whose values we talk about? What values are we talking about?
So, for instance, we know that these foundation models are trained on data and that they reproduce the biases and stereotypes present in their data, and in particular these models are trained on data scraped the Internet, which some people have described as the cesspit of humanity.
So right from the get go we know that these models are going to have baked in biases, norms, stereotypes, etc. potentially very harmful and then this is the starting point how do you even set about amending this?
These kind of questions are really curly because they require a close interaction between computer scientists and humanity scholars, which I think is a really interesting area of research, and what I’m sort of hoping to focus on, and that’s why institutes like ADM+S are so critical at the moment.
Jenny: Absolutely thank you so much for chatting with us today, Aaron.
Aaron: My pleasure.
Jenny: You’ve been listening to a podcast from the ARC Center of Excellence for automated decision making and society. For more information on the centre, go to admscentre.org.au