Data Engineer who delivers infrastructure, not just queries.
I design and maintain scalable pipelines, architect modern data platforms on AWS & GCP, and enforce rigorous governance — so Analytics teams make decisions with data they trust.
About
My edge is combining Software Engineering rigor with Data Engineering scalability. I lead global technical projects, architecting platforms to ensure resilient orchestration, high availability, and cost efficiency.
I believe Data Engineering is the foundation for modern AI. Recently, I led the modernization of an AI agent architecture for a Global SaaS Startup, where I reduced extraction latency by 75% by consolidating fragmented pipelines into a Single Structured-Output architecture with structured RAG.
- Engineering & Platform: Specialist in high-volume GCP and AWS ecosystems with Databricks.
- Orchestration: Airflow (MWAA/Composer) in Kubernetes/Docker at global scale.
- Processing: Scalable ETL/ELT pipelines with PySpark for Kafka ↔ BigQuery/Redshift integration.
- LLMops & AI: Dynamic metadata injection (>15K tokens) for real-time entity resolution.
- Quality: Observability via fail-fast checks and unit tests as delivery standard.
João Saraiva
Data Engineer · Avdata Consulting
github.com/jsaraivx
Experience
Sep 2024 — Present
Data Engineer
Avdata Consultoria LTDA · Remote
- Accelerated pipeline deployment by 80% by developing a metadata-driven DAG generation system for Apache Airflow (MWAA/GKE), automating complex backfills across global data sources.
- Achieved zero data loss during streaming outages by designing a high-performance resilience layer using DuckDB and S3-compatible storage to recover 100% of events into the Data Warehouse.
- Reduced AI agent latency by 86% (6s → 0.8s) by optimizing production architectures with FastAPI and async Python for global SaaS clients.
- Stabilized cloud costs (FinOps) and unlocked analytical consumption by orchestrating Kafka ↔ BigQuery integration via custom DAG Generators.
- Enabled international metrics tracking by architecting the full data cycle — from RAW ingestion to structured Data Marts in BigQuery — unlocking executive-level analytics.
- Increased global leadership decision accuracy by implementing Star Schema dimensional modeling and complex DAX calculations in Power BI.
Nov 2024 — Sep 2025
Software Developer (Freelance)
Tropical Bud · Portugal (Remote)
- Accelerated the European expansion sales cycle by developing custom Liquid scripts and advanced Shopify platform integrations.
- Centralized commercial KPI visibility by connecting the store backend to data engineering pipelines for performance dashboard creation.
Aug 2023 — Jul 2024
Co-Founder | Software Engineer
Ultraform Supplements
- Enabled brand launch by developing end-to-end tech infrastructure — from database modeling to logistics flow integration via ERP Bling.
- Centralized multi-marketplace inventory management by fully deploying and parameterizing the ERP Bling system across multiple seller accounts.
Nov 2021 — May 2024
Software & Data Engineer
Independent Consulting · Freelance
- Reduced manual workload for a 100+ salesperson team by building scalable Python bots for marketplace ad management via REST APIs.
- Structured market intelligence databases from non-standardized sources by developing robust Web Scraping routines and log sanitization pipelines.
01 / Cloud
Cloud Platforms
02 / Orchestration
Orchestration & Streaming
03 / Compute
Python & Data
04 / AI
AI Engineering
05 / BI
Visualization & BI
06 / Software
Software Engineering
07 / Architecture
Architecture & Modeling
08 / Infra
Infrastructure
Selected Projects
VectorHire
Intelligent ATS with RAG Pipeline
End-to-end recruitment system that cross-matches resumes with job postings using Retrieval-Augmented Generation. Extracts PDFs with PyMuPDF, applies Semantic Chunking, vectorizes locally with sentence-transformers, and searches similarity via PostgreSQL + pgvector. Gemini 2.5 Flash generates structured assessments (Pydantic). Async API with FastAPI, persistence via SQLAlchemy (Repository Pattern).
RepositoryAI Architecture & LLMops
Modernization & Structured Output
Re-engineered an AI agent ecosystem focused on model consolidation for performance gains and tech debt reduction. Migrated sequential pipelines to a Single Structured-Output architecture, reducing "cold-cache" latency by 75%. Implemented Structured RAG with dynamic metadata injection in context windows exceeding 15K tokens and strict JSON Schema validations.
Professional ProjectGlobal Data Orchestration
Resilience at Global Scale
Implemented data resilience for a Global SaaS Startup. Designed an automated backfill system using Apache Airflow on Kubernetes (GKE), reducing manual technical interventions by 80%. Established a batch reconciliation mesh as contingency for continuous Kafka flow failures, ensuring 100% data integrity.
Professional ProjectStream-Guard-Kafka
Real-time Fraud Detection
Kappa Architecture pipeline with hybrid processing: ksqlDB applies Tumbling Windows for high-speed transactional anomaly detection, while Python processes complex business rules (blacklist, context). Kafka 3.6 in KRaft mode (no ZooKeeper), realistic synthetic data via Faker, typed contracts with Pydantic, and PostgreSQL persistence via SQLAlchemy. Fully orchestrated with Docker Compose.
RepositoryBQ Schema Migrator
BigQuery Schema Governance
Migration tool inspired by Flyway/Liquibase, purpose-built for BigQuery. Each SQL script runs exactly once with idempotency control via SHA256 checksums and a control table. Scripts with @scheduled headers are auto-deployed as Scheduled Queries via the Data Transfer API. Supports placeholders (${PROJECT}, ${DATASET}), credential auto-discovery, and deterministic testing with Pytest.
RepositoryWhat-Price
Crypto Market ELT Pipeline
End-to-end pipeline extracting market data from the top 20 cryptocurrencies via CoinGecko API hourly. Orchestrated with Apache Airflow 3 (TaskFlow API + Astro), loads into Serverless PostgreSQL (Neon) with idempotent ingestion, and exposes metrics in a Streamlit dashboard with Plotly: market dominance, performance index, 24h volatility, and ATH distance. Bronze/Silver architecture inspired by Medallion.
RepositoryCertifications & Education
Google Cloud Computing Foundations
Google · Nov 2025
Data Engineering with dbt
LinkedIn Learning · Mar 2026
Databricks Fundamentals
Academy Accreditation · Feb 2026
Introduction to Apache Kafka
DataCamp · Feb 2026
Databricks for Data Engineering
Databricks · Dec 2025
Data Engineer Path
Borderless · Nov 2025
HackerRank SQL (Intermediate)
HackerRank · Oct 2025
Software Engineering — Bachelor's
UNOPAR · Expected Jun/2027
Claude Code in Action
Anthropic Academy · 2026
Intro to Machine Learning with Python
USP / ESALQ · Mar 2026