Hi, I'm Santosh Yadavwaving hand😊

I design and build reliable, scalable data and machine learning infrastructure.

With a strong foundation in data engineering and machine learning operations (MLOps), I enjoy solving the "last mile" problems of getting models and pipelines into production. Whether it's building real-time data pipelines with Kafka, orchestrating workflows with Airflow, or deploying models using Docker, Kubernetes, and MLflow—I'm all about creating systems that are robust, automated, and easy to maintain.

  • 🏗️ Designing data pipelines that scale
  • 🚀 Operationalizing ML models in production environments
  • 📊 Monitoring, observability, and continuous delivery in ML workflows
  • 🤝 Bridging the gap between data science and production engineering

I believe in writing clean code, documenting processes, and collaborating across teams to bring data-driven products to life.

Current Role & Focus

I currently work at Scania (Traton Group), helping drive innovation in Autonomous Transport Solutions by building robust data infrastructure for large-scale machine learning systems.
My focus lies in developing reliable, scalable, and automated data pipelines to power intelligent mobility systems. From ingesting and transforming terabytes of sensor and telemetry data, to orchestrating machine learning workflows in production—I build the backend that keeps autonomous systems smart and responsive.

🚛 Current Focus
  • Designing and managing distributed data pipelines using PySpark on Databricks
  • Leveraging AWS cloud infrastructure for compute, storage, and automation
  • Building infrastructure-as-code with Terraform, orchestrated through GitLab CI/CD
💡 What I care about
  • Data quality, reproducibility, and observability
  • End-to-end automation and deployment of ML workflows
  • Collaboration between data, ML, and platform teams

Outside of work, I'm constantly learning about MLOps, streaming architectures, and scaling data systems for real-time decision making.

2014–20192019–20212021–20222022–Present
DeveloperResearcherData EngineerData Engineer
Capital Eye SolutionsUniversity of SkövdeKambi GroupScania Group

Tech Stack

Languages & Tools

Python
SQL

Frameworks & Platforms

Python
SQL

Frameworks & Platforms

Apache Airflow
dbt

Projects

Agentic AI with MCP

Agentic AI

Model Context Protocol (MCP) addresses this by creating a standard way for AI models to interact seamlessly with different data sources and applications.

Data Pipeline that scales

Data Pipeline

A method to accept raw data from various sources, processes this data to convert it into meaningful information, and then push it into storage like a data lake or data warehouse.

Real-Time Analytics Dashboard

Project 3

Lakehouse Apps seems like it will make the handoff of data from data engineers more streamlined, even allowing them to do their own data ingestion.

Latest Articles

Building Scalable Data Pipelines

A comprehensive guide to building and maintaining production-grade data pipelines.

Read More →

Certifications

Databricks Certified Data EngineerDatabricks Certified Data Engineer
Databricks Generative AI FundamentalsDatabricks Generative AI Fundamentals
Databricks Lakehouse FundamentalsDatabricks Lakehouse Fundamentals
AWS Certified Cloud PractitionerAWS Certified Cloud Practitioner

Get in Touch

Want to collaborate or chat data? Let's connect and discuss how we can collaborate.