Job Responsibilities:
Build Data-Centric's full-modal large model data closed loop, from pre-training data, instruction data to live network return data, consolidate data science and engineering practice, and enhance Xiaoyi's product competitiveness and user experience. The work content includes:
...
1. Data technology pre-research: Insight into the evolution trend of large model data technology, solve the medium- and long-term problems of large model data, data problems include but are not limited to: data course learning, multimodal data alignment, large model Agent data capabilities, instruction data theory and practice, etc.;
2. Data cleaning and quality evaluation: formulate detailed, comprehensive, and executable full-modal data acquisition, cleaning and quality standards, build a data cleaning engineering platform, classify training data, establish a sub-classification data quality and optimization system, and solve data subject distribution, content compliance and other issues;
3. Data labeling: Responsible for the full-modal data labeling platform, exchange Design and build the interface, iterate data annotation tools, annotation standards and quality inspection standards, and continuously improve the quality and efficiency of difficult data annotation;
4. Instruction data: Alignment data design and research, including data construction, use and evaluation of its impact on model capabilities, explore the data combination optimization problem of pre-training data and Alignment data to improve the overall model capabilities;
5. Large model evaluation: design model evaluation dimensions, improve evaluation methods, evaluate the capability boundaries of large models, and continuously improve data diversity, data quality and data matching strategies based on model evaluation results;
6. Current network return data: design the current network data system for Xiaoyi return data, explore current network problems, formulate data repair standards and methods, close the high-value current network data to pre-training and Alignment data, and effectively improve model capabilities and current network user experience.
Job Requirements:
1. Bachelor degree or above in computer science/mathematics/pattern recognition/statistics and other related majors, with a doctoral degree preferred;
2. Priority will be given to those who have published papers at top conferences and have experience in large model development and LLM/VLM algorithm development, data development, and platform development for large companies;
3. Ability to quickly read and reproduce papers, be able to systematically compare relevant results in the research field, and formulate large model data improvement plans;
4. Good teamwork awareness and communication skills, strong logical thinking ability, business analysis ability, strong ability to innovate and promote implementation.