As China’s artificial intelligence industry explode, the need for data labeling grew exponentially. Every year, more than 2 million hours of voice data and more than hundreds of millions of picture data need to be labeled so that machines can learn how to recognize them.
The service of collecting, sorting, cleaning, and labeling these basic data is a prerequisite for training artificial intelligence models. According to a recent report, the size of China’s artificial intelligence data service market in 2018 was RMB2.586 billion yuan, of which 86% was the data resource customization service. It is expected that the market size will exceed RMB11.3 billion yuan in 2025.
In the Alibaba ecosystem alone, there are more than 200,000 practitioners of artificial intelligence data labeling trainers. It is estimated that by 2022, relevant employees at home and abroad are expected to reach 5 million. By then, AI data training workforce may have new characteristics, but now almost 100% of China’s AI data labeling work are done by young employees from China’s small cities and towns.
The skill threshold of AI labeling work was not high. They mainly collected data through data crawlers and conduct mechanized work, attracting a large number of "youth from small towns" who did not have high professional technical reserves. The AI trainer industry was once considered to be the "AI industry’s Foxconn, meaning it is labor intensive, requires low qualification and professional skills.
According to Alipay’s new occupation survey data, "youth from small towns" is the main force of more than 40 new occupations, including AI labeling staffers.
According to the announcement of China Employment Training Technical Guidance Center, the precise definition of artificial intelligence trainers is: "Using intelligent training software, database management, algorithm parameter settings, human-computer interaction design, performance test tracking and Other assisted personnel. "
In the "Notice on the Promulgation of Public Information of New Occupations", the work content of this special group of trainers is described as:
1. Annotate and process the raw data of pictures, text, voice and other services;
2. Analyze and refine the characteristics of professional fields, train and evaluate related algorithms, performance and functions of artificial intelligence products;
3. Design interactive processes and application solutions for artificial intelligence products;
4. Monitor, analyze, and manage the application data of artificial intelligence products;
5. Adjust and optimize the parameters and configuration of artificial intelligence products.
Their work is similar to software operation and maintenance engineers. They participate in every link from the initial data annotation to product parameter optimization. They are a key part of the algorithm and technology from theory to application, and also an indispensable link in the industrialization of AI technology.
Sending "feeds" to robots, and strengthening the training of semantic understanding models, so that robots can better understand humans, is the most important part of the work. However, as artificial intelligence enters the stage of commercial application, vertical use case data has become a major requirement, and the requirements for data type, quality, and accuracy have also increased significantly.
Voice, image, and NLP datasets have begun to emerge, and the strength of leading companies and professional third-party companies in the data service field has gradually emerged. According to related reports, in 2018, about 34% of business volume went to third-party companies that specialize in data acquisition. The demand for professional data is evident.
The improvement of data professionalism and accuracy also requires relevant professional knowledge and stimulates creativity for practitioners in order to meet the customized needs of users.
The labeling process is no longer rough and general, such as labeling images as "sky", "vehicle", and "crowd". On the contrary, the dimension of the labeling is more subdivided and vertical. Faces now needs to be labeled with emotion detection eeper subdivisions such as micro-expression recognition, which requires data service practitioners to have corresponding domain knowledge.
In this context, the originally highly fragmented data labeling industry began to be less relevant. Data labeling has gradually shifted from labor-intensive to skill-intensive. AI trainers in pipeline operations have also evolved into more professional and sophisticated work models. They are slowly becoming "experts" in this field.
In the long run, although AI is becoming more and more intelligent and can assist in data labeling, it is still difficult for machines to judge many situations. For example, AI’s recognition of text evolution and emotion is still weak. In the future, AI will have to deal with more complicated industries. Human perception and judgment cannot be replaced.
By 2025, the scale of basic data services in the single industry of autonomous driving alone will exceed RMB2.4 billion yuan, and the total amount of industry data tasks will exceed 100 million. With the widespread application of artificial intelligence in intelligent manufacturing, intelligent transportation, smart cities, intelligent medical treatment, intelligent agriculture, intelligent logistics, intelligent finance, and other industries, the scale of artificial intelligence data labeling workforce will usher in explosive growth.