Musicians and Machines: How Machine Learning Can Create New Ways of Making New Sounds

Interview with Jesse Engel, Research Scientist at Google Brain’s Magenta Project

2022 has seen tremendous leaps forward in artificial intelligence, surpassing many experts’ predictions around how soon we might expect to see something like artificial general intelligence. While the large language model GPT-3 has been demonstrating the success of transformers in generating extremely human-like text since 2020, DALL-E 2, released this past April by OpenAI, demonstrates an astonishing capacity for generating surprisingly accurate images based on human-produced text prompts. Google’s Imagen, Parti, and DeepMind’s Flamingo have also proven the potential for large language models to handle multi-modality, or different kinds of input and output.

These new A.I. models – and the interfaces that allow users to interact with them – carry profound implications for creative work. The images DALL-E produces are impressive, but we won’t see designers and artists disappearing any time soon – rather, the ways in which we create will likely shift dramatically. The exciting work that lies ahead for both technologists and cultural producers, then, will be in devising ways in which humans and artificial intelligence can collaborate to produce new aesthetic forms entirely unique to the new nature of creative work.

Google’s Magenta project, housed within the Google Brain team, adds another dimension to the nascent terrain of A.I.-assisted creativity. Magenta conducts research on music and machine learning, and also releases many tools and plug-ins that allow artists to experiment with integrating ML-generated sounds into their work.

Jesse Engel, Senior Research Scientist at Magenta, is helping lead these efforts around creating generative music tools that amplify the human experience. We talked with Jesse about how Magenta keeps artists and musicians at the center of the products they build, and how the recent leaps forward in artificial intelligence might continue to reshape how we think about creativity.

C/Change: Tell us about your journey to machine learning research and Google’s Magenta Project.

Jesse: My background is actually in the physical sciences. I did physics as an undergrad at Berkeley, and after that I wanted to do renewable energy. I was told I had to study material science to do that – you know, solar panels and these types of things. So I did a PhD in material science working on solar panels in the nanotechnology department at Berkeley. It was great, but it was a lot of working with chemicals and mixing things. Meanwhile, I was hanging out with neuroscientists, and they were always doing the coolest stuff. As a result, I became interested in the computation of complex systems and did a postdoc with half at Berkeley’s neuroscience department and half at Stanford, in electrical engineering. At the same time, one of my housemates got a job at a new startup lab for Baidu, the Chinese search company. Around that time, Doug Eck was starting up this research lab at Google focused on applying machine learning to creativity and music. I’ve always been a musician – I play jazz guitar and improvisational stuff. I’d been blending technology and music in my free time; when I was at Stanford, I made a synthesizer based upon the vibration of molecules. This opportunity to actually do the two things as one job and blend my passions was perfect for me, so I joined in 2016. It’s been a great opportunity to follow my whimsy and be supported in pursuing ideas that sound fun.

“How can we rephrase the machine learning problem to move away from just outputting new music or art and toward collaborating in a way where you’re not just achieving the best goals by yourself, but you’re actually bringing the best out of the other?”

C/C: How do you design for collaboration in the music tools you build? How might such technologies catalyze new, different, and surprising collaborations than previously possible?

J: Collaborations between people and technology have been happening since there’s been technology. The ways we express ourselves have always been very closely tied with the tools we use to do so. An example I like to use is the fact that the first bone flutes are actually older than the earliest clay pottery. As far as we can tell, people have actually been using tools to make music longer than they have been using tools to make soup. In other words, the flute is older than soup. From there, you can go up through the electric guitar, the drum machine, and digital audio workstations and see how the new tools enable new forms of expression. Machine learning is really just another layer of different types of technologies. The question is, how do we look at designing an understanding of how to interact with these systems in such a way that we still feel inspired and in control? What’s the right metaphor? Is this tool a synthesizer? Or is it a composition? Is it a random number generator? Is it a compass? Is it an assistant? Fundamentally, it’s just matrix multiplication, but depending on how you build interfaces and demonstrate that technology to people, the way people interact with it will change drastically.

What’s really interesting to us lately is what happens when you bring other people into the situation, when you have this human-to-human collaboration intermediated by a machine learning tool. Anna Huang is a great researcher on our team who developed CocoNet, which harmonizes melodies in the style of Bach. Then we developed an extension called CoCoCo with a lot more user control that focused on collaboration between users. We did a study to find out whether it really facilitated collaboration, and we got some pretty surprising results. It was great at helping generate ideas, but it also served as a helpful social intermediary. Some people felt less inhibited because if anything went wrong or didn’t sound great, they could blame the model. Users didn’t have to feel self conscious about decisions, so they could create more freely. On the flip side, people also felt less ownership in the output.

Music is a really great microcosm for the incoming wave of technology. Take robotics, for example – at the end of the day, there’s you and an embodied algorithm, and you’re trying to accomplish a goal together. With our musical tools, you’re trying to jam with an algorithm, and you want to bring out the best in each other. So how can we rephrase the machine learning problem to move away from just outputting new music or art and toward collaborating in a way where you’re not just achieving the best goals by yourself, but you’re actually bringing the best out of the other? How can you make that a part of the algorithm’s learning process?

Magenta’s N-Synth Super, a hardware version of the team’s Neural Audio Synthesis model.

C/C: What might that look like technically? How would we have to change how we build machine learning models in order to create more collaborative A.I.?

J: Fundamentally, it’s about moving from generative models to incorporating what’s called reinforcement learning. We’re trying to model the generative process, instead of just the generative outcome. Thinking about art as a verb instead of a noun. The things that feel most aligned for me are technologies that help people improve – how can interacting with a model help someone learn to play better? Or to even just allow people to become more aware of how they interact with a model so that they’re more introspective about how they interact with others. So it’s explicitly about multiple people. multiple agents. This idea of modeling others is actually a big area of research. Researchers will look at multiplayer games like Overcooked, which is a video game where multiple players have to cook food together. It’s turned into a machine learning benchmark to help developers train a model to collaborate with the human in order to achieve goals. You can train the models to collaborate with each other, but then all of a sudden, you take that agent, and then you stick with a human person. Sometimes it doesn’t work out. So an active research question is how can we change the training system in such a way that it works more adaptively?

C/C: How would you characterize the relationship between a musician and the ML tool that he or she uses to make music?

J: I think it spans the whole gamut between ML models being an accelerator for human creativity and more collaborative practices. That’s why it’s really important to talk about that explicitly and not treat music as a problem to be solved. Instead, it’s important to say there are different people with different needs or wants in interacting with an algorithm. It all comes down to how these technologies are helping people have richer human experiences, either just with themselves and a piece of technology, or with other people.

C/C: How do you think the proliferation of artificial intelligence will affect our creative communities, or how we create together, in the decades to come? What predictions can you make about the future of artificial intelligence and society?

J: Things move so fast, so it’s hard to predict too far into the future. One decade ago, AlexNet came out, and now we have models like Imagen, Parti, and DALL-E. The world is going to be radically different in the sense that the ability to generate within virtual worlds will not be limited, much like how computers eliminated digital scarcity. Think about people writing books, and then make copies of those books. Computers produced so much value in eliminating the scarcity of physical resources. So similar to that replication scarcity being removed by digital storage, creative machine learning is going to remove the scarcity of creation. If you can think about a nonexistent movie you’d like to see, you can just generate it. Just because anything can be created doesn’t mean that it negates the value of artists’ work. The value is not just in the asset itself, but the human experience creates its own value. So what types of human experiences are valued with these technologies. The really interesting part will be lowering that barrier of the cost of creating aesthetics. So many people don’t think of themselves as musicians because they don’t play music for a living. I talk, but I don’t consider myself a talker because it’s just a natural part of being a human being. When you eliminate those barriers to expressing yourself through music, things do radically change, like when all of a sudden anyone can take a photograph with their cell phone. It changes what it means to be a photographer. The economics and labor implications of that shift are one thing. Fundamentally, though, it also means that people can experience photographs, share memories with each other, and create personalized narratives around the technology without having to dedicate their professional lives to it. I think the real question is how do these technologies help change your human experience? Your social experience?

“It all comes down to how these technologies are helping people have richer human experiences, either just with themselves and a piece of technology, or with other people.“

C/C: What kinds of sounds might Magenta’s models deem “beautiful”? How do humans and machines achieve alignment in the creative process?

J: From an entropy perspective, white noise is the most beautiful music. You are least able to predict what’s going to happen next. The human reference is not just about unpredictability, but also predictability and the dynamic between the two. I’ve tried to formulate multi-agent partnerships where human priors are used to generate something that’s bound to our reference in certain ways, but unbound in other ways. I think that’s where things get really interesting. When you just learn from data, you’re just learning that specific space, interpolating within that, and maybe doing some combinatorial operations here and there. But it’s never really going off the human guardrails totally, because if it does, it doesn’t make any sense to our ears.

And, you know, if you can understand the priors that we use to understand the world enough that you can know, let’s hold all those fixed, but this one loose? And, you know, I think there’s plenty of examples of that. A lot of algorithmic composition to date is built like this – some notes are played by a synthesizer and are built on a lot of constructs that we already know. But then we’ve loosened the constraint around which notes are played. At Magenta, we’re really interested in exploring this next generation of learning with people in the loop and getting people and machines to collaborate.