Covariant CEO on building an AI that helps robots learn
Covariant was founded in 2017 with a simple goal: to help robots learn how to better grasp objects. It’s a big need among those looking to automate warehouses, and it’s a lot more complex than it looks. Most of the goods we come across have passed through a warehouse at some point. It’s an incredibly wide range of sizes, shapes, textures and colors.
The Bay Area company has built an AI-based system that trains network bots to improve picks as they go. A demonstration on the floor at this year’s ProMat shows how quickly a connected arm is able to identify, grasp and place a wide range of different objects.
Co-founder and CEO Peter Chen sat down with TechCrunch at the show last week to discuss robotic learning, foundational model building, and, of course, ChatGPT.
TechCrunch: When you’re a startup, it makes sense to use as much off-the-shelf hardware as possible.
CP: Yeah. Covariant started from a very different place. We started with pure software and pure AI. The company’s first employees were all artificial intelligence researchers. We had no mechanical engineers, no robotics. It allowed us to go much further in AI than anyone else. If you look at other robotics companies [at ProMat]they’re probably using an off-the-shelf model or an open-source model – things that have been used in academia.
Yeah. ROS or open source computer vision libraries, which are excellent. But what we do is fundamentally different. We look at what academic AI models provide and it is not enough. Academic AI is built in a lab environment. They’re not designed to withstand real-world testing — especially testing by many customers, millions of skills, millions of different item types that need to be processed by the same AI.
Many researchers take many different approaches to learning. How is yours different?
Much of the founding team came from OpenAI – like three of the four co-founders. If you look at what OpenAI has done in the last three or four years in the language space, it’s basically taking a base model approach to the language. Before the recent ChatGPT, there were many natural language processing AIs. Search, translation, sentiment detection, spam detection – there was a lot of natural language AI. The pre-GPT approach is, for each use case, to train a specific AI, using a smaller subset of data. Look at the results now, and GPT basically abolishes the translation domain, and it’s not even trained in translation. The base model approach is basically, instead of using small amounts of situation-specific data or training a circumstance-specific model, let’s train a large generalized base model on a lot more data, so that AI is more generalized.
You focus on picking and placing, but are you also laying the foundations for future applications?
Certainly. The ability to grip or pick and place is certainly the first general ability that we give to robots. But if you look behind the scenes, there’s a lot of 3D understanding or object understanding. There are many cognitive primitives that are generalizable to future robotic applications. That being said, grabbing or picking is such a big space that we can work there for quite some time.
You go after selection and placement first because there is a clear need for it.
There is a clear need, and there is also a clear lack of technology for it. What’s interesting is that if you had come to this show 10 years ago, you might have found picking robots. They just wouldn’t work. The industry has struggled with this for a very long time. People said it couldn’t work without AI, so people tried niche AI and standard AI, and it didn’t work.
Your systems feed a central database and each selection informs the machines how to select in the future.
Yeah. The funny thing is that almost every item we touch goes through a warehouse at some point. It is almost a central clearing place for everything in the physical world. When you start out creating an AI for warehouses, that’s a great foundation for AI coming out of warehouses. Let’s say you pull an apple from the field and take it to an agricultural factory – it has seen an apple before. He’s seen strawberries before.
It’s a one-on-one. I pick an apple from a fulfillment center so I can pick an apple from a field. More abstractly, how can these lessons be applied to other facets of life?
If we want to take a step back from Covariant specifically and think about where the technology is heading, we see an interesting convergence of AI, software, and mechatronics. Traditionally, these three areas are somewhat separated from each other. Mechatronics is what you will discover when you come to this show. This is a repeatable motion. If you talk to the sales people, they tell you about reliability, how this machine can do the same thing over and over again.
The truly amazing evolution we’ve seen from Silicon Valley over the past 15 to 20 years is in software. People have cracked the code on how to build really complex, really smart software. All of these applications that we use are actually people exploiting software capabilities. We are now at the forefront of AI, with all the incredible advancements. When you ask me what’s beyond warehouses, where I see it really going is the convergence of these three trends to build highly autonomous physical machines in the world. We need the convergence of all technologies.
You mentioned the arrival of ChatGPT and the blindness of people who make translation software. It’s something that happens in technology. Are you worried that a GPT is coming and effectively blinding Covariant’s work?
This is a good question for a lot of people, but I think we had an unfair advantage in that we started out with pretty much the same belief that OpenAI had with fundamental model building. General AI is a better approach than creating niche AI. That’s what we’ve been doing for five years. I would say that we are in a very good position, and we are very happy that OpenAI has demonstrated that this philosophy works really well. We are very excited to do this in the world of robotics.