Job Search
| This posting is managed by: | 4M Career KK |
|---|---|
| Company Name | Company is not publicly visible |
| Job Type |
![]() IT (Mainframe) - Programmer IT (PC, Web, Unix) - Web Application SE |
| Industry | Telecommunications/Information Services |
| Location |
Asia
Japan
Tokyo
|
| Job Description |
Position Details Data Pipeline Development for LLMs: Design, develop, and maintain highly scalable, reliable, and efficient data pipelines (ETL/ELT) for ingesting, transforming, and loading diverse datasets critical for LLM pre-training, fine-tuning, and evaluation. This includes structured, semi-structured, and unstructured text data. High-Quality Dataset Creation & Curation: Implement advanced techniques for data cleaning and preprocessing, including deduplication, noise reduction, PII masking, tokenization, and formatting of large text corpora. Explore and implement methods for expanding and enriching datasets for LLM training, such as data augmentation and synthesis. Establish and enforce rigorous data quality standards, implement automated data validation checks, and ensure data privacy and security compliance (e.g., GDPR, CCPA). Data Job Management: Establish robust systems for data versioning, lineage tracking, and reproducibility of datasets used across the LLM development lifecycle. Identify and resolve data-related performance bottlenecks within data pipelines, optimizing data storage, retrieval, and processing for efficiency and cost-effectiveness. Data Infrastructure & Orchestration: Build and maintain scalable data warehouses and data lakes specifically designed for LLM data on both on-premise and public cloud environments. Implement and manage data orchestration tools (e.g., Apache Airflow, Prefect, Dagster) to automate and manage complex data workflows for LLM dataset preparation. Mandatory Qualifications: - Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related quantitative field, with 3+ years of professional experience in Data Engineering, with a significant focus on building and managing data pipelines for large-scale machine learning or data science initiatives, especially those involving large text/image/voice datasets. - Direct experience with data engineering specifically for Large Language Models (LLMs), including pre-training, fine-tuning, and evaluation datasets. - Familiarity with common challenges and techniques for preprocessing massive text corpora (e.g., handling noise, deduplication, PII detection/masking, tokenization at scale). - Experience with data versioning and lineage tools/platforms (e.g., DVC, Pachyderm, LakeFS, or data versioning features within MLOps platforms like MLflow). - Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) from a data loading and preparation perspective. - Experience designing and implementing data annotation workflows and pipelines. - Strong proficiency in Python, and extensive experience with its data ecosystem. - Proficiency in SQL, and good understanding of data warehousing concepts, data modeling, and schema design. |
| Company Info |
Join a global tech-driven powerhouse headquartered in Tokyo, with over 70 services spanning e-commerce, fintech, mobile, and digital content. Operating in 30+ countries and serving 2 billion users, foster an entrepreneurial culture built on innovation, diversity, and speed. With English as our official language and a team from 100+ nationalities, we value open communication and global collaboration. From powering Japan’s largest online marketplace to pioneering cloud-native telecom networks, we’re creating an ecosystem where ideas scale and people thrive. If you're passionate about building impactful products and shaping the future of digital services, this is your place to grow.(Edited) |
| English Level | Business Conversation Level (TOEIC 735-860) |
| Japanese Level | None |
| Salary | JPY - Japanese Yen JPY 6000K - JPY 10000K |
| Other Salary Description |
Social Insurance Commuting/ Transportation Allowance Relaxation Facilities |
| Holidays |
Five-Day Workweek Summer Holidays Winter Holidays Refresh Holidays Sick Leave Paid Holidays Paid Holidays Child-care Leave |