Available for projects

Data Engineer who delivers infrastructure, not just queries.

I design and maintain scalable pipelines, architect modern data platforms on AWS & GCP, and enforce rigorous governance — so Analytics teams make decisions with data they trust.

View Projects Get in Touch

About

My edge is combining Software Engineering rigor with Data Engineering scalability. I lead global technical projects, architecting platforms to ensure resilient orchestration, high availability, and cost efficiency.

I believe Data Engineering is the foundation for modern AI. Recently, I led the modernization of an AI agent architecture for a Global SaaS Startup, where I reduced extraction latency by 75% by consolidating fragmented pipelines into a Single Structured-Output architecture with structured RAG.

Engineering & Platform: Specialist in high-volume GCP and AWS ecosystems with Databricks.
Orchestration: Airflow (MWAA/Composer) in Kubernetes/Docker at global scale.
Processing: Scalable ETL/ELT pipelines with PySpark for Kafka ↔ BigQuery/Redshift integration.
LLMops & AI: Dynamic metadata injection (>15K tokens) for real-time entity resolution.
Quality: Observability via fail-fast checks and unit tests as delivery standard.

github.com/jsaraivx

Experience

Sep 2024 — Present

Data Engineer

Avdata Consultoria LTDA · Remote

Accelerated pipeline deployment by 80% by developing a metadata-driven DAG generation system for Apache Airflow (MWAA/GKE), automating complex backfills across global data sources.
Achieved zero data loss during streaming outages by designing a high-performance resilience layer using DuckDB and S3-compatible storage to recover 100% of events into the Data Warehouse.
Reduced AI agent latency by 86% (6s → 0.8s) by optimizing production architectures with FastAPI and async Python for global SaaS clients.
Stabilized cloud costs (FinOps) and unlocked analytical consumption by orchestrating Kafka ↔ BigQuery integration via custom DAG Generators.
Enabled international metrics tracking by architecting the full data cycle — from RAW ingestion to structured Data Marts in BigQuery — unlocking executive-level analytics.
Increased global leadership decision accuracy by implementing Star Schema dimensional modeling and complex DAX calculations in Power BI.

Nov 2024 — Sep 2025

Software Developer (Freelance)

Tropical Bud · Portugal (Remote)

Accelerated the European expansion sales cycle by developing custom Liquid scripts and advanced Shopify platform integrations.
Centralized commercial KPI visibility by connecting the store backend to data engineering pipelines for performance dashboard creation.

Aug 2023 — Jul 2024

Co-Founder | Software Engineer

Ultraform Supplements

Enabled brand launch by developing end-to-end tech infrastructure — from database modeling to logistics flow integration via ERP Bling.
Centralized multi-marketplace inventory management by fully deploying and parameterizing the ERP Bling system across multiple seller accounts.

Nov 2021 — May 2024

Software & Data Engineer

Independent Consulting · Freelance

Reduced manual workload for a 100+ salesperson team by building scalable Python bots for marketplace ad management via REST APIs.
Structured market intelligence databases from non-standardized sources by developing robust Web Scraping routines and log sanitization pipelines.

01 / Cloud

Cloud Platforms

02 / Orchestration

Orchestration & Streaming

03 / Compute

Python & Data

04 / AI

AI Engineering

05 / BI

Visualization & BI

06 / Software

Software Engineering

07 / Architecture

Architecture & Modeling

08 / Infra

Infrastructure

Selected Projects

VectorHire

Intelligent ATS with RAG Pipeline

End-to-end recruitment system that cross-matches resumes with job postings using Retrieval-Augmented Generation. Extracts PDFs with PyMuPDF, applies Semantic Chunking, vectorizes locally with sentence-transformers, and searches similarity via PostgreSQL + pgvector. Gemini 2.5 Flash generates structured assessments (Pydantic). Async API with FastAPI, persistence via SQLAlchemy (Repository Pattern).

Repository

AI Architecture & LLMops

Modernization & Structured Output

Re-engineered an AI agent ecosystem focused on model consolidation for performance gains and tech debt reduction. Migrated sequential pipelines to a Single Structured-Output architecture, reducing "cold-cache" latency by 75%. Implemented Structured RAG with dynamic metadata injection in context windows exceeding 15K tokens and strict JSON Schema validations.

Professional Project

Global Data Orchestration

Resilience at Global Scale

Implemented data resilience for a Global SaaS Startup. Designed an automated backfill system using Apache Airflow on Kubernetes (GKE), reducing manual technical interventions by 80%. Established a batch reconciliation mesh as contingency for continuous Kafka flow failures, ensuring 100% data integrity.

Professional Project

Stream-Guard-Kafka

Real-time Fraud Detection

Kappa Architecture pipeline with hybrid processing: ksqlDB applies Tumbling Windows for high-speed transactional anomaly detection, while Python processes complex business rules (blacklist, context). Kafka 3.6 in KRaft mode (no ZooKeeper), realistic synthetic data via Faker, typed contracts with Pydantic, and PostgreSQL persistence via SQLAlchemy. Fully orchestrated with Docker Compose.

Repository

BQ Schema Migrator

BigQuery Schema Governance

Migration tool inspired by Flyway/Liquibase, purpose-built for BigQuery. Each SQL script runs exactly once with idempotency control via SHA256 checksums and a control table. Scripts with @scheduled headers are auto-deployed as Scheduled Queries via the Data Transfer API. Supports placeholders (${PROJECT}, ${DATASET}), credential auto-discovery, and deterministic testing with Pytest.

Repository

What-Price

Crypto Market ELT Pipeline

End-to-end pipeline extracting market data from the top 20 cryptocurrencies via CoinGecko API hourly. Orchestrated with Apache Airflow 3 (TaskFlow API + Astro), loads into Serverless PostgreSQL (Neon) with idempotent ingestion, and exposes metrics in a Streamlit dashboard with Plotly: market dominance, performance index, 24h volatility, and ATH distance. Bronze/Silver architecture inspired by Medallion.

Repository

Certifications & Education

Google Cloud Computing Foundations

Google · Nov 2025

Data Engineering with dbt

LinkedIn Learning · Mar 2026

Databricks Fundamentals

Academy Accreditation · Feb 2026

Introduction to Apache Kafka

DataCamp · Feb 2026

Databricks for Data Engineering

Databricks · Dec 2025

Data Engineer Path

Borderless · Nov 2025

HackerRank SQL (Intermediate)

HackerRank · Oct 2025

Software Engineering — Bachelor's

UNOPAR · Expected Jun/2027

Claude Code in Action

Anthropic Academy · 2026

Intro to Machine Learning with Python

USP / ESALQ · Mar 2026

Data Engineer who delivers infrastructure, not just queries.

João Saraiva

Experience

Data Engineer

Software Developer (Freelance)

Co-Founder | Software Engineer

Software & Data Engineer

Cloud Platforms

Orchestration & Streaming

Python & Data

AI Engineering

Visualization & BI

Software Engineering

Architecture & Modeling

Infrastructure

Selected Projects

VectorHire

AI Architecture & LLMops

Global Data Orchestration

Stream-Guard-Kafka

BQ Schema Migrator

What-Price

Certifications & Education

Google Cloud Computing Foundations

Data Engineering with dbt

Databricks Fundamentals

Introduction to Apache Kafka

Databricks for Data Engineering

Data Engineer Path

HackerRank SQL (Intermediate)

Software Engineering — Bachelor's

Claude Code in Action

Intro to Machine Learning with Python