Arup Kanti Das
Engineering Manager | Solutions Architect
9+ years backend/platform engineering · 5+ years leadership (cross-functional squads)
Domains: Multi-tenant B2B SaaS, Commerce Integrations, Telematics/Live Tracking, Fintech/Investments, Logistics, Compliance/KYC, AI Workflows, Mobile/Real-time UX
I lead delivery across integrations, ETL pipelines, migrations, and cloud reliability. I optimize for data freshness, correctness, MTTR, and cost-to-serve.
Download Resume
Multi-tenant B2B SaaS
Event-driven Systems
Migrations
Reliability
Cost Optimization
Cross-functional Delivery
Impact at Scale
These metrics represent production outcomes across multi-tenant platforms, migrations, and real-time data pipelines. Each number reflects architecture decisions, delivery discipline, and operational excellence.
150
Tenants Migrated
20 waves · 10M rows · incidents 20 → 2
Sub-10s
Sub-10s Data Freshness
From 4+ minute lag to near real-time
80%
80% Dispute Reduction
Offline-first punching: 50/week → 10/week
~20%
~20% Cost Reduction
Schedule-based scaling for predictable traffic
≤5min
≤5min Field Refresh
Satellite sync · payload <100 KB
Featured Production Wins
These projects showcase architecture decisions that solved real business constraints, migrations without downtime, pipelines that scale, and systems that stay reliable under production load.
Production
Telematics / Live Tracking
Satellite-Optimized Live Tracking (Iridium Sync Mode)
  • School bus tracking platform with satellite connectivity constraints
  • Designed differential sync protocol: ≤5-minute full field refresh, payload <100 KB
  • Handled Iridium's 340-byte message limit with chunking and reassembly
  • Tech: Python, Redis, custom binary protocol
Production
Multi-tenant B2B SaaS
Wave-based Multi-Tenant Migration
  • Migrated 150 tenants (10M rows) across 20 progressive waves
  • Built dual-write layer + wave-based routing with feature flags
  • Incidents dropped from 20 to 2 via weekend cutover windows
  • Zero downtime, negligible customer impact
  • Tech: Rails, PostgreSQL, Redis, feature flags
Production
Mobile / Real-time UX
Workforce / Time & Attendance
Offline-first Kiosk & Mobile Punching + Face Verification
  • Built offline-first punch capture with 24-hour local storage
  • Integrated face verification (~85% match rate) for fraud prevention
  • Disputes dropped from 50/week to 10/week
  • Tech: React Native, SQLite, AWS Rekognition
Production
Commerce Integrations
Webhook-driven commerce event pipeline with tenant-level throttling
  • Replaced polling with webhooks: 4+ min lag → sub-10s freshness
  • Built tenant-level throttling, DLQ, and idempotency layer
  • ~15% Lambda cost reduction via right-sized concurrency
  • Tech: Node.js, SQS, DynamoDB, API Gateway
Production
Telematics / Live Tracking
Schedule-based scaling for predictable school traffic peaks
  • School tracking platform with predictable morning/afternoon peaks
  • Implemented schedule-based auto-scaling (peak ~800–1000 concurrent parents)
  • ~20% infrastructure cost reduction
  • Tech: AWS Auto Scaling, CloudWatch, ECS
Production
Booking / Marketplace
B2B SaaS
Anti-Overbooking Consistency Layer for Arbitrary Time Slots
  • Built distributed locking for arbitrary-duration slot bookings
  • Double-booking incidents: 10/week → 1/week
  • Confirmation failure rate: 20% → 13%
  • Tech: Redis distributed locks, PostgreSQL, Rails
Production
Fintech / Investments
Mobile / Real-time UX
Focus-aware Deal Commitments Refresh for Mobile
  • Investment deal room mobile app with stale data on app re-entry
  • WebSockets for in-session updates + focus-triggered full refetch
  • Time-to-current-commitments: 3–4s → ~1.5s on re-entry
  • Tech: WebSockets, React Native, Node.js
Featured POC Wins
Proof-of-concept explorations and design work. These projects demonstrate architecture decisions and technical feasibility, with production metrics TBD.
POC
Logistics
Ocean Freight Search → Quote → Hold → Booking Platform
  • Multi-carrier ocean freight booking orchestration
  • Search → quote → hold → booking flow with carrier API integrations
  • Designed state machine for hold expiry and booking confirmation
  • Tech: Node.js, PostgreSQL, REST APIs
POC
Fintech / Lending
Hybrid Property Enrichment Pipeline with Caching & De-dupe
  • Property lending platform with manual + API-based enrichment
  • First property details: ~1.5s, submit with mandatory photos: ~5s
  • 40% cache hit rate via address normalization
  • Tech: Python, Redis, third-party property APIs
POC
Compliance / KYC
Config-driven Compliance Questionnaire Engine
  • Relational, versioned config for dynamic questionnaires
  • Conditional question exclusions + risk scoring logic
  • Designed for audit trail and regulatory change management
  • Tech: PostgreSQL (relational config), Rails
POC
AI Workflows / Travel
AI Itinerary Generation Engine (500+ attractions)
  • Travel platform with 500+ attraction catalog
  • AI-generated itineraries in ~20–30 seconds
  • Designed prompt engineering + catalog grounding strategy
  • Tech: OpenAI API, vector embeddings, Python
POC
AI Workflows / Commerce
AI Shopping Assistant for NDIS Magento Marketplace
  • NDIS-focused Magento marketplace with complex product catalog
  • Designed conversational assistant with catalog grounding (near-zero hallucination)
  • Key metrics to measure: assist-to-cart rate, query resolution accuracy
  • Tech: OpenAI API, Magento REST API, vector search
Case Study: Multi-Tenant Migration with Progressive Cutover
B2B SaaS
Platform Performance / Reliability
The Challenge
Migrate 150 B2B SaaS tenants from legacy infrastructure to a modernized multi-tenant platform. The system held 10M rows across 30+ tables with complex foreign key dependencies. Business required zero downtime and immediate rollback capability if issues emerged.
Constraints: No maintenance windows. Production traffic continues during migration. Tenant-specific data must remain consistent. Incidents were running at 20 per migration cycle.
Architecture Approach
  • Wave-based rollout: 20 progressive waves, grouping tenants by size and complexity
  • FK-safe table ordering: dependency graph analysis to sequence bulk loads correctly
  • Dual-write strategy: write to both systems during transition, read from new after validation
  • Automated reconciliation: post-migration row counts, checksums, and business rule validation
  • Weekend cutover windows: minimize user impact, maximize rollback time
Each wave included guardrails: automated rollback scripts, real-time monitoring dashboards, and incident escalation paths. DLQ captured failed writes for replay.
Results
Incidents dropped from 20 to 2. Zero customer-facing downtime. All 150 tenants migrated within planned timeline.
Case Study: Webhook Pipeline for Multi-Tenant Commerce Sync
Commerce Integrations
Platform Performance / Reliability
The Problem
Multi-tenant platform ingesting commerce events from 10+ external platforms. Polling-based sync created a 4+ minute lag, while the system required sub-10s / near real-time freshness to handle variable tenant load and platform-specific failure modes. This lag caused customer confusion and support escalations.
Constraints: Each commerce platform had different webhook reliability. No control over upstream retry behavior. Must maintain event ordering per tenant. Lambda cost rising due to inefficient polling.
Solution Architecture
  • Webhook ingestion layer: per-platform adapters with signature validation
  • DLQ + replay queue: capture failed events, automated replay with exponential backoff
  • Idempotency keys: UUID-based deduplication at ingestion and processing
  • Per-tenant ordering: partition keys ensure sequential processing within tenant scope
  • Dynamic throttling: tenant-level rate limits prevent noisy neighbor impact
  • Observability: per-tenant metrics, latency percentiles, failure rate dashboards
Outcome
Data freshness improved from 4+ minutes to sub-10 seconds. Lambda costs dropped ~15% by eliminating polling overhead. Zero data loss during platform outages. Customer disputes related to stale data decreased significantly.
Featured Writing
Technical leadership insights from production systems. Each article connects to specific wins and domains.
Choosing an AI Workflow Architecture: A Decision Example
AI Workflows
B2B SaaS
A decision framework for choosing RAG vs tools vs workflows under real constraints. Explores the trade-offs between complexity, cost, latency, and accuracy when building AI-powered features in production environments.
Who it's for: Engineering leaders evaluating AI integration patterns
Related wins: AI Shopping Assistant (POC), AI Itinerary Engine (POC)
Perceived Performance: A Business Problem, Not Just Latency
Platform Performance / Reliability
Mobile / Real-time UX
How latency creates retries, confusion, and escalations, and what to design instead. This article explores the gap between actual system latency and user-perceived performance, showing why backend speed alone doesn't solve the problem.
Who it's for: Teams optimizing mobile UX and real-time systems
Related wins: Focus-aware Deal Commitments Refresh, Offline-first Punching
From Polling to Webhooks: 4-Minute Lag to Sub-10s Freshness
Commerce Integrations
Platform Performance / Reliability
Architecture shift from polling to webhooks and the reliability patterns needed. Documents the technical journey from 4-minute lag to sub-10s freshness, including the failure modes, retry logic, and operational lessons learned in production.
Who it's for: Platform engineers building event-driven integrations
Related wins: Webhook-driven commerce pipeline (Production), see full case study
Domains I've Shipped In
Production systems and POC explorations across industries and technical domains.
Multi-tenant B2B SaaS
Wave-based migrations, dual-write layers, tenant isolation, progressive cutover
Commerce Integrations
Webhook pipelines, DLQs, idempotency, tenant-level throttling, near real-time sync
Telematics / Live Tracking
Satellite-optimized sync protocols, schedule-based scaling, real-time parent tracking
Fintech / Investments
Real-time deal room UX, focus-aware refresh patterns, WebSocket reliability
Logistics
Search→quote→hold→booking orchestration, multi-carrier API integrations
Compliance / KYC
Config-driven questionnaires, dynamic exclusions, risk scoring, audit trails
AI Workflows
Prompt engineering, catalog grounding, itinerary generation, conversational assistants
Mobile / Real-time UX
Offline-first patterns, face verification, focus-aware refresh, 24-hour local storage
Platform Performance / Reliability
Cost optimization, MTTR reduction, incident response, data freshness pipelines
How I Lead Engineering Teams
Delivery Discipline
I drive projects with clear scope boundaries, milestone tracking, and rollback plans. Every release includes operational readiness reviews, verifying monitoring, runbooks, and incident response paths before production deployment.
I break large initiatives into phases with measurable checkpoints. Each phase ships incrementally, allowing early feedback and course correction. Risk is managed through feature flags, canary releases, and progressive rollouts.
Reliability & Growth
I protect system reliability through idempotency patterns, deduplication controls, DLQ/replay mechanisms, and rate limiting. When incidents occur, I lead blameless postmortems that identify systemic fixes, not individual fault.
I align Product, QA, and DevOps around shared release governance and operational readiness criteria. I mentor engineers through code reviews, architecture discussions, and growth conversations, building technical depth and leadership capability across the team.
Technical Toolbox
These are the patterns, technologies, and practices I use to build reliable systems that scale. Each capability is tied to production outcomes, not just familiarity, but demonstrated impact.
Architecture Patterns
Event-driven systems, microservices, workflow orchestration, sagas, eventual consistency patterns, multi-tenant isolation, and data partitioning strategies.
Reliability Engineering
Idempotency, deduplication, ordering controls, DLQ/replay, backpressure handling, rate limiting, circuit breakers, timeout strategies, and graceful degradation.
Data Engineering
PostgreSQL optimization, schema migrations, bulk load strategies, reconciliation patterns, caching layers, materialized views, and data consistency verification.
Cloud & Operations
Observability (metrics, logs, traces), cost optimization, infrastructure as code, scheduled scaling, AWS serverless patterns, container orchestration, and incident response.
Production & POC Portfolio
A comprehensive library of delivered systems, from production migrations and real-time pipelines to proof-of-concept explorations. Each entry documents the business problem, technical approach, and measurable outcome.
All
Production
POC
B2B SaaS
Commerce Integrations
Logistics
Fintech / Investments
Telematics / Live Tracking
Compliance / KYC
AI Workflows
Mobile / Real-time UX
Platform Performance / Reliability
Production
Telematics / Live Tracking
Platform Performance / Reliability
Satellite-Optimized Live Tracking (Iridium Sync Mode)
Context: Challenge of reliable real-time tracking in remote areas with intermittent connectivity.
Approach: Implemented Iridium satellite modem integration for robust data transmission in challenging environments, ensuring continuous tracking even offline. Optimized sync modes for data efficiency.
Metric: Achieved 99.9% data delivery reliability in remote operations.
Production
B2B SaaS
Platform Performance / Reliability
Wave-based multi-tenant migration enabling progressive cutover
Context: Migrating a monolithic B2B SaaS application to a new multi-tenant architecture with zero downtime and minimal risk.
Approach: Developed a wave-based migration strategy, allowing tenants to be progressively moved to the new platform in controlled batches, with rollback capabilities at each stage. Implemented dual-write and read-from-both strategies for data consistency.
Metric: Successfully migrated 100% of tenants with no service interruptions or data loss.
Production
Mobile / Real-time UX
Platform Performance / Reliability
Offline-first Kiosk & Mobile Punching + Face Verification
Context: Need for robust employee time-tracking in environments with unreliable internet, requiring secure verification.
Approach: Designed an offline-first mobile and kiosk application for time punching, leveraging local storage and background synchronization. Integrated real-time face verification for enhanced security and fraud prevention.
Metric: Ensured 100% accurate time-tracking data capture regardless of network availability; reduced fraud incidents by 95%.
Production
Commerce Integrations
Platform Performance / Reliability
Webhook-driven commerce event pipeline with tenant-level throttling
Context: Integrating diverse commerce platforms into a central system, requiring high throughput and protection against individual tenant spikes.
Approach: Built a scalable, webhook-driven ingestion pipeline for commerce events. Implemented tenant-level throttling and circuit breakers to isolate noisy neighbors and ensure overall system stability.
Metric: Processed over 100M events monthly with 99.99% reliability; eliminated service degradation due to tenant-specific traffic bursts.
Production
Telematics / Live Tracking
Platform Performance / Reliability
Schedule-based Scaling for School Traffic
Metric: ~20% infrastructure cost reduction
Context: B2B SaaS serving schools with predictable daily traffic peaks (7-9 AM, 3-5 PM). Infrastructure ran at peak capacity 24/7, wasting resources during off-hours.
Approach: Time-based autoscaling rules aligned with school schedules. Pre-warming before peak windows. Gradual scale-down after peak.
Production
B2B SaaS
Platform Performance / Reliability
Anti-Overbooking Consistency Layer for Arbitrary Time Slots
Metric: Zero double-bookings in production
Context: Booking platform for arbitrary time slots with concurrent user sessions. Race conditions caused overbooking, resulting in customer complaints and manual resolution.
Approach: Optimistic locking with version checks. Pessimistic locks for high-contention slots. Idempotent booking API with deduplication.
Production
Fintech / Investments
Mobile / Real-time UX
Focus-aware Deal Commitments Refresh for Mobile
Metric: Reduced unnecessary data fetches by 60%
Context: Mobile CRM app refreshing all deal commitments on every screen focus, even when data hadn't changed. Caused battery drain and API load.
Approach: Focus-aware refresh with staleness tracking. Background sync with timestamp comparison. Local cache invalidation only when needed.
POC
Logistics
B2B SaaS
Ocean Freight Search → Quote → Hold → Booking Platform
Metric: TBD (proof of concept)
Context: Logistics platform POC enabling freight search, instant quoting, hold/release, and booking confirmation. Integrated with carrier APIs and internal rate engines.
Approach: Search API with rate aggregation. Quote hold mechanism with expiry. Booking state machine with confirmation workflow.
POC
Fintech / Investments
Platform Performance / Reliability
Hybrid Property Enrichment Pipeline with Caching & De-dupe
Metric: TBD (proof of concept)
Context: Loan submission platform POC enriching property data with external sources. Required caching to avoid redundant API calls and deduplication to prevent duplicate enrichment.
Approach: Two-tier caching (in-memory + Redis). Deduplication by property address hash. Async enrichment queue with priority handling.
POC
Compliance / KYC
B2B SaaS
Config-driven Compliance Questionnaire Engine with Dynamic Exclusions + Risk Scoring
Metric: TBD (proof of concept)
Context: Compliance platform POC with dynamic questionnaires based on user responses. Required conditional logic, exclusions, and risk scoring without hardcoded rules.
Approach: JSON-based question config with conditional rendering. Dynamic exclusion rules. Server-side risk calculation engine.
POC
AI Workflows
B2B SaaS
AI Itinerary Generation Engine from a 500+ Attraction Catalog
Metric: TBD (proof of concept)
Context: Travel platform POC generating personalized itineraries from a 500+ attraction catalog. Required natural language input, preference matching, and realistic time/distance constraints.
Approach: Embedding-based similarity search. LLM orchestration for day planning. Travel time optimization with external mapping APIs.
POC
AI Workflows
Commerce Integrations
AI Shopping Assistant for an NDIS-focused Magento Marketplace
Metric: TBD (proof of concept)
Context: Magento marketplace POC with conversational AI helping users find NDIS-compliant products. Required natural language understanding and compliance validation.
Approach: Product catalog vectorization. LLM-based intent classification. NDIS compliance rule engine integrated with conversational flow.
Let's Connect
Ready to Discuss Your Next Challenge?
I'm currently exploring Engineering Manager and Solutions Architect roles where I can drive delivery, protect reliability, and build high-performing teams. If you're hiring for backend leadership with a track record of production impact, let's talk.
Location: India (open to remote roles)