Zhixindi reported on August 12th that last week, at the 2024 Technology Innovator Conference, a gathering of top scientists in the embodied intelligence industry, well-known star entrepreneurs, and leaders of major players, conducted an in-depth analysis of the new technologies, trends, and scenarios in the current embodied intelligence industry.
Huang Tiejun, Chairman of the Beijing Academy of Artificial Intelligence and Director of the National Key Laboratory of Multimedia Information Processing at Peking University, Peng Fangyu, Vice Dean of the School of Mechanical Science and Engineering at Huazhong University of Science and Technology and Deputy Director of the National Engineering Technology Research Center for Numerical Control Systems, and Wang Tianmiao, Honorary Director of the Robotics Institute at Beihang University and Dean of the Zhongguancun Zhiyou Research Institute, shared the stage to dissect the current breakthroughs in humanoid robot technology. Among them, embodied intelligence in the brain, cerebellum, spatial intelligence, limbs, and upstream core components is currently a hot topic of research, and the exploration of professional, multimodal lightweight models at the brain level has become a major trend. At the same time, true intelligence requires interaction with the environment, learning and evolving from it, which is why embodied intelligence is currently considered an important path to AGI (Artificial General Intelligence).
Advertisement
Tencent, Yushu Technology, Lingxin QiaoShou, Mei Kamande, and other players in the upstream, midstream, and downstream of the embodied intelligence industry chain, from the dimensions of robot body, algorithm, and dexterous hands, explained why embodied intelligence is one of the best paths to AGI. Robots that operated in industrial scenarios through pre-programming belonged to zero intelligence, and only through autonomous learning and dealing with complex tasks could true intelligence be achieved.
Representatives from academia, industry, and research, including iFLYTEK, Xiaomi, Foxconn, Galaxy General, Tsinghua University, and Peking University, also had a summit dialogue on the current status and bottlenecks of the development of embodied intelligence technology and its potential future application scenarios. The major scenarios where embodied intelligence is more optimistic include high-value scenarios such as military and national defense, logistics scenarios, and if the market size reaches the level of hundreds of millions of units, it will appear in the To C market.
I. Huang Tiejun from the Zhiyuan Research Institute: The current large models are truly intelligent, and the Peking University team has open-sourced the SpikeCV pulse vision algorithm.
Huang Tiejun, Chairman of the Beijing Academy of Artificial Intelligence and Director of the National Key Laboratory of Multimedia Information Processing at Peking University, said that historically, the three schools of thought - symbolicism, connectionism, and behaviorism - have been constantly debating.
Describing intelligence and turning it into an algorithm, and then letting robots execute it, such a machine without intelligence is symbolicism. What is truly effective now is the second connectionism, which allows neural networks to train and learn by themselves. The third important behaviorism is currently less studied, and true intelligence comes from the interaction process between the entity and the environment.
He believes that today's large models are truly intelligent, and the neural network architecture is trying to calculate the possible relationships between tokens and all other words. When there is enough input corpus, it can iterate the connection relationships between these nodes.Embodied intelligence is a complete intelligent system with independent perception and motion capabilities, and currently, training a motion control model to control the body using the technology route of large models is a hot topic.
Huang Tiejun said that there is not much research on the eyes of embodied intelligence at present, and cameras are not eyes; they are just a traditional way of image collection. The human brain is a spiking neural network, and behind the eyeballs, there are one million nerve fibers, each of which transmits a sequence of neural pulse signals, turning the sequence of light into a physiological pulse sequence.
Peking University invented the principle of pulse continuous photography, which enables each pixel of the pulse camera to work independently, achieving ultra-high-speed continuous imaging. It can take photos even facing the sun, and can still produce clear images under shaking conditions, without being affected by the movement of the robot. The SpikeCV pulse vision algorithm has been open-sourced for free.
At present, the large models show the emergence of static intelligence, which is not the expression and processing process of dynamic information production needed for real-time perception by robots. Therefore, Huang Tiejun believes that this field will definitely continue to develop in the next 20 years.
II. Huazhong University of Science and Technology Peng Fangyu: In-depth integration of mechatronic systems and life systems, disassembling 6 major key technology trends
Peng Fangyu, Deputy Dean of the School of Mechanical Science and Engineering at Huazhong University of Science and Technology, and Deputy Director of the National Engineering Technology Research Center for Numerical Control Systems, said that the current development of large models has brought about an in-depth integration of mechatronic systems and life systems for robots.
The future application scenarios of humanoid robots may include special environmental applications, dangerous scene operations, creating highly reliable robots for special application scenarios; in the medical field, for gait training and semantic and emotional recognition during the accompanying process, as well as minimally invasive surgical robots that guide the doctor's surgical operations based on preoperative imaging examinations and accurate positioning in three-dimensional space.
Peng Fangyu believes that in the future, the manufacturing industry will present a new paradigm of ubiquitous robots, ubiquitous sensing, and ubiquitous intelligence.The key technological trends in humanoid robots encompass three major aspects: physical capabilities, skills, and intelligence, which involve different dimensions such as the skeleton, muscles, lower limbs, upper limbs, brain, and nervous system.
The first key technology is the structure and architecture of the skeleton. The technological trend aims to address the issue of lightweight design. With the emergence of new technologies such as 3D printing, it is expected that in the future, the integration of structure, function, and materials with high performance will be achieved.
The second key technology is the muscles, with core components including electric drive joints, integrated hydraulic cylinders with functional structures, power units, and array technology. Currently, there are relatively few research teams and companies in the country that are conducting disruptive design research on core components using new structures and principles in the muscle core components.
The third key technology is the lower limbs, which involves dynamics and reinforcement learning. The domestic focus on end-to-end control is not sufficient.
The fourth key technology is the upper limbs, mainly robotic arms and dexterous hands. The upper limbs have already achieved the ability to pick up eggs and play the piano, but in engineering applications, they are mainly used to replace people in assembly engineering. This requires two robotic arms with 10 fingers, involving the design of dexterous hands, visual flexible sensing, and grasping strategies.
Finally, there is the perception and integration of the brain and nervous system. How to achieve visual intelligent perception will present new challenges.
The future market size of humanoid robots is considerable, and downstream core component manufacturers are gradually achieving technological breakthroughs and capacity increases. This is crucial for the future of humanoid robots to achieve low cost and high reliability in entering the scene.
III. Wang Tianmiao, Dean of the Institute of Intelligent Friends: Sorting out 4 major embodied intelligence research hotspots, large models reversing small models become a trend.
Robots are physical tools in the physical space. From the perspective of intelligent systems, embodied intelligence is to enable physical entities to interact with the environment and demonstrate the ability to generalize and adapt to the environment through learning and reasoning. Embodied intelligence is an irreplaceable tool constructed by AI + robots.Honorary Director of the Robotics Institute at Beihang University and Dean of the Zhongguancun Zhiyou Research Institute, Wang Tianmiao, said that the current hot topics in embodied intelligence innovation research include: the brain, cerebellum, spatial intelligence, limbs, and upstream core components. Among them, the brain involves general robot large models, data simulators, data manufacturing factories, and end-to-end computing power chips; the cerebellum includes motion, spatial intelligence, visual recognition, acquisition, modeling, and understanding capabilities; and research on core components related to limbs, etc.
The Chinese national system has two obvious dynamic trends in the logic of organization, innovation, and entrepreneurial incubation investment, one of which is that the supply chain is still evolving. Wang Tianmiao said that this includes performance and cost based on underlying innovative design, the evolution of industrial robots, the evolution of humanoid robot structures and supply chains, and the evolution of the core component supply chain of all-electric drive.
In addition, large models have strong natural language interaction capabilities and generalization capabilities, but their understanding of visual space is limited, which leads to weaker operability, safety, and dexterity in operation. This year, large models have been counter-rolling small models, and researchers exploring professional, multimodal lightweight models have become a trend.
The scenario of humanoid robots has not yet formed, but the price has already rolled up. Wang Tianmiao believes that on the one hand, when disruptive technology occurs, there needs to be a bubble, which can bring various resources to accelerate the transformation of results and product applications; on the other hand, entrepreneurs without rich resources should focus on the application of sub-domains, while entrepreneurs with rich resource platforms can actively achieve breakthroughs and be highly vigilant about how to accumulate competitive advantages when there are no project investments afterward.
In terms of the application of embodied intelligence scenarios, he mainly considered two dimensions: pain points and maturity, difficulty and scale. In terms of pain points and maturity, commercial and industrial applications are the fastest; in terms of difficulty and scale, work such as polishing, polishing, welding, and handling can have a strong commercial prospect by using small models, and the commercial prospect reaching L3, L4 requires close cooperation with large factories.
IV. Tencent Zhang Zhengyou: True intelligence requires autonomous learning, proposing a hierarchical embodied intelligence system
Tencent's Chief Scientist and Director of the Tencent Robotics X Laboratory, Zhang Zhengyou, first expressed generative AI with a functional relationship, that is, input X is generated through the output of Y, and the model in the middle is now basically Transformer, and the format of input and output is often text, images, audio, and video.
In the generative AI of robots, the input and output methods are more complex, as shown in the figure below, which can be given 3D environments, body states, tasks, and output motor torque, sub-task sequences, answers, etc.The initial robots on the production line performed a series of actions within a fixed environment, possessing zero intelligence. In the era of large models, some believed that placing large models on robots could achieve embodied intelligence. However, this is now equivalent to placing a 20-year-old brain on a 3-year-old body, as some operational capabilities of robots are still relatively weak. True intelligence requires autonomous learning and problem-solving, as well as automatic adjustment of planning in response to environmental changes.
Therefore, he believes that embodied intelligence is a very important process towards achieving AGI (Artificial General Intelligence).
Embodied intelligence acquires knowledge through human-like perceptual methods, such as hearing and vision, and abstracts it into a form of expression and semantics to understand the world. Currently, the challenges faced by embodied intelligence include complex perceptual abilities, powerful execution capabilities, learning abilities, adaptability, efficient multi-ability collaboration and integration, data scarcity and privacy protection, safety and reliability, and social ethical issues. Only by achieving an organic integration of intelligence and the self can robots truly exhibit intelligence in environmental interactions.
Zhang Zhengyou proposed the A2G theory in 2018, where ABC constitutes the basic capability layer, D, E, and F represent the interaction between robots and the physical world. Through communication with the environment, they enhance capabilities, engage in deep companionship and communication, and flexibly grasp objects. G allows for the exchange of information between sensors and robots.
He classifies autonomous robots into two types: reactive autonomy and conscious autonomy. The realization of embodied intelligence requires a change in the control paradigm. The traditional paradigm is perception, planning, and action.
He mentioned another paradigm, S (sensing) L (learning) A (acting) P (planning), which allows robots to closely connect perception and action, responding in real-time to the constantly changing environment and integrating learning into various modules.
Tencent Robotics Research Institute has developed a hierarchical embodied intelligence system, divided into three levels. The bottom level is Proprioception, which is the robot's perception of its own state and control. The second level is Exteroception, which is the perception of the environment, allowing the robot to know what capabilities to call upon to complete tasks. The top level is the Strategic Level planner, which enables the robot to plan and solve problems for specific tasks and environments.Based on this hierarchical embodied intelligence system, knowledge at each level can be continuously updated and accumulated, with decoupling between levels, so updating one level will basically not affect the existing knowledge of other levels.
VI. Vision, Dexterity Hands... Core Component Manufacturers Compete for Supremacy
Players in the industry chain such as dexterous hands and vision perception systems are also a key link in accelerating the embodied intelligence industry.
1. Dexterous Hands Need More Than 20 Degrees of Freedom to Fully Map Human Movements
Zhou Yong, co-founder and CTO of Dexterous Hands, said that "dexterous operations" in daily life will be the hallmark of the true arrival of the embodied intelligence era. The next step for embodied intelligence is multimodal perception and interaction algorithms, which require dexterous hands with multimodal perception capabilities for algorithm implementation.
In language models, it is difficult to perceive the texture, density, friction, and other properties of objects, which requires the supplementation of vision and touch. Therefore, the next generation of dexterous hands for embodied intelligence needs to meet the requirements of high degrees of freedom, multiple sensors, and a sufficient amount of data combined with landing scenarios. Only by achieving these three aspects can it provide power for the realization of embodied intelligence.
In terms of degrees of freedom, the current degrees of freedom of humanoid robot hands are around 6. Zhou Yong believes that more than 20 degrees of freedom are needed to fully map human movements, and multiple sensors need to be able to perceive position, force, touch, and touch.
2. Multimodal Large Models Can Help Robots Understand Instructions, AI Applications May Bring in More Than 300 Million in Revenue
Mecaman is a company with AI + 3D vision at its core. The multimodal large model MechGPT created by the company allows robots to understand natural language instructions and make reasoning and decision-making based on multimodal information such as vision and drawings. It intelligently decides how to complete tasks, and its application is not limited to a specific type of robot. It can provide services for various robots, including humanoid, service, collaborative, and industrial robots.Mecaman's founder and CEO, Shao Tianlan, stated that the importance of AI to robots lies in the fact that they can solve complex and variable problems that are not easily solved by rules through AI in a large number of business practices. It is expected to have an income of 300 to 400 million this year.
The challenges faced by the mass deployment of intelligent robots include a long technology chain, many technologies are still rapidly evolving, but customer requirements are high. In this context, Mecamand's experience is to focus on the fields of sensing, perception, and planning, and to work with ecological partners to finally form a positive cycle of technology, product, business, and capital.
3. Weijing Intelligence Dong Xiaojian: Embodied intelligence relies on the continuous growth of intelligent agents, and only three-dimensional vision can provide environmental perception.
Weijing Intelligence founder and CEO Dong Xiaojian said that the entire process of embodied intelligence is a continuous learning process between "machines - people - intelligent agents". Sufficient sensors are the foundation of machine learning, and all learning is not only digital samples on the Internet, but also requires data from the scene. Embodied intelligence is a gradual learning process, vision, touch, inertia, distance sensors, etc., filter the data to be useful and become a part of the intelligent agent.
For the intelligent agent, it needs to collect hearing, vision, touch, etc., and the output is sound, operation, perception, motion, and decision-making. Therefore, he believes that the core of embodied intelligence is the continuous growth of the intelligent agent.
The important sensor for humanoid robots is vision, and it is three-dimensional, colorful, memorable, and can gradually deepen. He believes that only three-dimensional data can provide robots with environmental visual perception, operational perception, and guidance. Weijing Intelligence's new generation of humanoid robots is equipped with four pairs of three-dimensional vision systems, which can be adapted based on all practical application scenarios.
5. Embodied intelligence is the mainstream direction of AI industry development, and interdisciplinary intersection is the main line of research.
In the last session of the morning, Professor Fang Bin of Beijing University of Posts and Telecommunications, Professor Liu Huaping of Tsinghua University, Associate Professor and Doctoral Supervisor Tao Yong of the School of Mechanical Engineering at Beihang University, Associate Professor Gao Fei of the College of Control Science and Engineering at Zhejiang University, Assistant Professor Wang He of Peking University, Director of the Peking University Galaxy General Embodied Intelligence Joint Laboratory, and Director of the Intelligent Center of the Peking University Human-Machine Integration Laboratory, Researcher Ruan Lecheng, discussed "Boiling Embodied Intelligence: Disruption, Bottlenecks and Technical Ambitions".The concept of embodied intelligence has been a hot topic in recent years. Liu Huaping stated that the surge in embodied intelligence is closely related to breakthroughs in AI. In a sense, these technologies have radiated into the fields of robotics and automation, bringing them back into the public eye.
The outbreak of embodied intelligence involves interdisciplinary integration. Gao Fei believes that robots themselves are the intersection of disciplines such as electronics, mechanics, and control. Embodied intelligence is the intersection of AI and traditional robotics. Good results definitely require interdisciplinary integration. Persisting in interdisciplinary integration and multi-angle fusion is a main thread.
For general-purpose robots, Wang He said that generalization is a turning point in the robotics industry. General-purpose robots can be applied to more flexible scenarios and interact better with humans. At the same time, the surrounding environment is designed for the human body. General-purpose robots are consistent with the human body form, which is also a major reason for the development of humanoid robots.
Facing young researchers who are currently involved in embodied intelligence, Tao Yong gave two pieces of advice. First, they need to understand the international frontier and development hotspots. Second, they need to find their own research positioning, such as more specific research directions like multimodal fusion and skill learning and transfer.
At present, the development of embodied intelligence in China and the United States is in the exploratory stage. Ruan Lecheng mentioned that there are significant differences, as domestic embodied intelligence companies are backed by academic institutions, while American companies are mostly pure corporate forms. He believes that the integration of industry, academia, and research should be promoted to advance research and development.
Regarding embodied intelligence as the mainstream direction of AI industry development, all four guests reached a consensus. They will develop in the direction of technological breakthroughs, application scenarios, integrated development of multi-industry modules, and security and ethics.
Six, Embodied Intelligence Super Scenario: The market scale of hundreds of millions of units must be To C.
In the afternoon session, Liu Jinchang, a researcher at the National Natural Science Foundation's High-Tech Center and a second-level expert in professional technology of the Ministry of Science and Technology, Zhao Mingguo, a researcher at the Department of Automation of Tsinghua University, director of the Robotics Control Laboratory, and chief scientist of Accelerated Evolution, Ji Chao, chief scientist of iFlytek Robotics, Duo Shao, vice president of Xiaomi Mobile Department and general manager of Xiaomi Robotics Business Department, and Shi Zhe, chief digital officer of Foxconn Technology Group and CEO of Yunzhihui, had a dialogue around "Looking for Super Scenarios: Where is embodied intelligence used?".Zhao Mingguo first distinguished the concepts of embodied intelligence and humanoid robots. He stated that embodied intelligence is a foundational concept, while humanoid robots are a product concept. Humanoid robots act as carriers of embodied intelligence, and these two happen to be the most recognized intersection in the industry at present.
Many also mentioned that with the support of reinforcement learning, humanoid robots will achieve freedom of movement, and to some extent, embodied intelligence has achieved freedom of interaction. The combination of freedom of movement and freedom of interaction makes the integration of embodiment possible. In addition, humanoids can attract more public attention, which is also a consideration for embodied intelligence to be applied to humanoid robots.
Embodied intelligence is a platform-based virtual concept, and humanoid robots are application carriers. The combination of platform and track corresponds to the integration of embodied intelligence with humanoid robots to achieve the integration of hardware and software. Ji Chao believes this.
In terms of transforming traditional industrial robots, Zhao Mingguo believes that embodied intelligence can solve more complex problems and will extend the concept from simple control perception, combined with visual sensors, to uncertain interactions with the environment.
Many mentioned the trend of solving the cost of FAE (industrial production line). Based on embodied intelligence, FAE time can be shortened, and a certain degree of cross-station transfer issues can be solved, which will greatly improve factory management efficiency.
Based on the assistance of embodied intelligence for optimization, in the design of automation, there is the courage to use another method for this line. Industrial robots need to have perception, vision, and feedback to match with traditional mechanical arms to complete automated tasks under complex conditions. Foxconn Technology Group's Chief Digital Officer and CEO of Cloud Wisdom, Shi Zhe, said.
Ji Chao compared general-purpose robots with special-purpose robots. General-purpose robots can be compatible with multiple workflows and can solve continuous problems by sacrificing certain efficiency and combining prior knowledge to achieve unification in multiple task scenarios.
Regarding the current situation where it is difficult for embodied intelligence to be implemented, Zhao Mingguo believes that the development of embodied intelligence is in its infancy, and a complete theoretical system has not yet been formed. In addition, its supply chain is very long and involves a wide range of technologies, which is in a stage of contention among various schools of thought.
Liu Jinchang summarized that the reasons behind this are, first, large models do not yet have deductive capabilities, and large models for humanoid robots or intelligent equipment need further development. The second is the cost-benefit ratio, which does not have market competitiveness.
Ji Chao believes that one category is high-value scenarios such as military and national defense, and the second category is scenarios with a high degree of flexibility, replacing labor to perform general and repetitive work. If the market scale reaches hundreds of millions of units, it will ultimately be a To C logic, with one for each person in every household.Zhao Mingguo added that embodied intelligence will not emerge in the existing market during its path to large-scale application, because it will be directly compared with traditional methods, and thus requires a significant technological breakthrough. Therefore, he believes that people are most willing to pay for the emergence of new things.
VII. Yu Shi Technology Wang Xingxing: Embodied intelligence is the most effective way to achieve AGI, and a general-purpose robot model will appear by the end of next year.
The development of embodied intelligence is at the starting point, and at this current time node, there are greater opportunities for small and medium-sized enterprises. Wang Xingxing believes that embodied intelligence is the most effective way to achieve AGI.
Since its establishment in 2013, Yu Shi Technology has released high-performance quadruped robots, humanoid robots, and other products. He revealed that the humanoid robot H1 has already been produced and shipped in small batches, and Yu Shi Technology is the company with the largest shipment of quadruped robots in the world.
In the first half of last year, Yu Shi Technology released the quadruped robot Go2, which adopted the OpenAI interface and can achieve voice interaction and large model planning execution, but the final effect is not ideal. Once it exceeds the cognitive range that has been planned, the performance of the machine dog will be very poor. He believes that the current large language model is not the most ideal way to achieve AGI.
Many people think that AI control is a black box with low reliability, but he believes that AI training will be much more reliable than human coding. When AI is trained, it will also perform a large number of tests at the same time. When the complexity of a software system reaches a certain scale, it is beyond human maintenance, such as in the field of autonomous driving. However, AI only needs to provide sufficient computing power.
The future technical development trends include the network model architecture of deep reinforcement learning, and there is still a large amount of human work in the current generalization, fine-tuning functions, and making the model more perfect, end-to-end perception, planning, completing more complex terrain movement, etc.
The current large language models understand the world through language or voice input, without cognition and understanding of the real world. AGI needs to have a physical robot for physical deployment, to collect the latest data in real-time for training, and to participate in the physical interaction with the entire world, which can experience and understand human emotions and personality.
Wang Xingxing believes that the dawn of AGI has arrived, and at least one company will be able to achieve a general-purpose robot model before the end of next year.Conclusion: Technology, Scenarios, and Policies Advance in Parallel, Embodied Intelligence Becomes a New Engine for Industrial Transformation
Embodied intelligence, as the next wave of AI, deeply integrates multiple disciplines including AI and robotics, and is accelerating the arrival of the intelligent economy era. In the field of robotics, an increasing number of domestic manufacturers involving robot bodies, core components, and software algorithms have emerged, and they continue to produce innovative results.
Currently, robots have penetrated into industrial and other scenarios to achieve practical applications, but the industry of embodied intelligence still needs to be improved in terms of achievement transformation, application scenarios, and business models. This is related to the long development cycle of large models related to embodied intelligence and robot products.
Under this background, China has a solid foundation in robot hardware manufacturing, a rich and diverse range of application scenario resources, and corresponding policy support in the field of embodied intelligence, which has opened the curtain for one of the most epoch-making technologies, embodied intelligence.