Available for projects

Data Engineer who delivers infrastructure, not just queries.

I design and maintain scalable pipelines, architect modern data platforms on AWS & GCP, and enforce rigorous governance — so Analytics teams make decisions with data they trust.

About

My edge is combining Software Engineering rigor with Data Engineering scalability. I lead global technical projects, architecting platforms to ensure resilient orchestration, high availability, and cost efficiency.

I believe Data Engineering is the foundation for modern AI. Recently, I led the modernization of an AI agent architecture for a Global SaaS Startup, where I reduced extraction latency by 75% by consolidating fragmented pipelines into a Single Structured-Output architecture with structured RAG.

  • Engineering & Platform: Specialist in high-volume GCP and AWS ecosystems with Databricks.
  • Orchestration: Airflow (MWAA/Composer) in Kubernetes/Docker at global scale.
  • Processing: Scalable ETL/ELT pipelines with PySpark for Kafka ↔ BigQuery/Redshift integration.
  • LLMops & AI: Dynamic metadata injection (>15K tokens) for real-time entity resolution.
  • Quality: Observability via fail-fast checks and unit tests as delivery standard.
João Saraiva

João Saraiva

Data Engineer · Avdata Consulting

4+ Years bridging software and data architectures
GitHub contributions

github.com/jsaraivx

Experience

Sep 2024 — Present

Data Engineer

Avdata Consultoria LTDA · Remote

  • Accelerated pipeline deployment by 80% by developing a metadata-driven DAG generation system for Apache Airflow (MWAA/GKE), automating complex backfills across global data sources.
  • Achieved zero data loss during streaming outages by designing a high-performance resilience layer using DuckDB and S3-compatible storage to recover 100% of events into the Data Warehouse.
  • Reduced AI agent latency by 86% (6s → 0.8s) by optimizing production architectures with FastAPI and async Python for global SaaS clients.
  • Stabilized cloud costs (FinOps) and unlocked analytical consumption by orchestrating Kafka ↔ BigQuery integration via custom DAG Generators.
  • Enabled international metrics tracking by architecting the full data cycle — from RAW ingestion to structured Data Marts in BigQuery — unlocking executive-level analytics.
  • Increased global leadership decision accuracy by implementing Star Schema dimensional modeling and complex DAX calculations in Power BI.
GCPAWSPySparkDatabricksAirflowKubernetesKafkaPythonBigQueryPower BIDAX

Nov 2024 — Sep 2025

Software Developer (Freelance)

Tropical Bud · Portugal (Remote)

  • Accelerated the European expansion sales cycle by developing custom Liquid scripts and advanced Shopify platform integrations.
  • Centralized commercial KPI visibility by connecting the store backend to data engineering pipelines for performance dashboard creation.
ShopifyLiquidJavaScriptHTMLAnalytics

Aug 2023 — Jul 2024

Co-Founder | Software Engineer

Ultraform Supplements

  • Enabled brand launch by developing end-to-end tech infrastructure — from database modeling to logistics flow integration via ERP Bling.
  • Centralized multi-marketplace inventory management by fully deploying and parameterizing the ERP Bling system across multiple seller accounts.
Systems ArchitectureAPIsShopifyERP BlingPower BI

Nov 2021 — May 2024

Software & Data Engineer

Independent Consulting · Freelance

  • Reduced manual workload for a 100+ salesperson team by building scalable Python bots for marketplace ad management via REST APIs.
  • Structured market intelligence databases from non-standardized sources by developing robust Web Scraping routines and log sanitization pipelines.
PythonREST APIsWeb ScrapingSQLFull-stack

01 / Cloud

Cloud Platforms

Google BigQuery Google Cloud Storage (GCS) Google Cloud Composer Google Pub/Sub Google Dataflow Google Data Transfer API AWS Glue Amazon S3 Amazon Redshift Amazon Athena AWS Lambda Amazon Kinesis AWS Step Functions Amazon MWAA (Airflow)

02 / Orchestration

Orchestration & Streaming

Apache Airflow (Astro) Apache Kafka (KRaft) ksqlDB Docker Docker Compose

03 / Compute

Python & Data

Python 3.10+ Advanced SQL Pandas dbt (Data Build Tool) Apache PySpark Databricks FastAPI

04 / AI

AI Engineering

RAG Pipelines pgvector Google Gemini API sentence-transformers Semantic Chunking PyMuPDF

05 / BI

Visualization & BI

Microsoft Power BI (DAX) Dimensional Modeling Workspace Governance Google Looker Streamlit Plotly

06 / Software

Software Engineering

Git & GitHub CI/CD (GitHub Actions) Pytest Pydantic SQLAlchemy HashiCorp Terraform (IaC) Scrum / Kanban

07 / Architecture

Architecture & Modeling

Medallion (Bronze/Silver/Gold) Star & Snowflake Schema Data Warehouse Data Lakehouse Kappa Architecture Data Mesh (concepts) FinOps

08 / Infra

Infrastructure

Docker Kubernetes (AWS EKS / Google GKE) Helm CI/CD (GitHub Actions) HashiCorp Terraform

Selected Projects

VectorHire

Intelligent ATS with RAG Pipeline

End-to-end recruitment system that cross-matches resumes with job postings using Retrieval-Augmented Generation. Extracts PDFs with PyMuPDF, applies Semantic Chunking, vectorizes locally with sentence-transformers, and searches similarity via PostgreSQL + pgvector. Gemini 2.5 Flash generates structured assessments (Pydantic). Async API with FastAPI, persistence via SQLAlchemy (Repository Pattern).

FastAPIpgvectorRAGGemini APIsentence-transformersPyMuPDFPydanticSQLAlchemy
Repository

AI Architecture & LLMops

Modernization & Structured Output

Re-engineered an AI agent ecosystem focused on model consolidation for performance gains and tech debt reduction. Migrated sequential pipelines to a Single Structured-Output architecture, reducing "cold-cache" latency by 75%. Implemented Structured RAG with dynamic metadata injection in context windows exceeding 15K tokens and strict JSON Schema validations.

LLMopsRAGFunction CallingJSON SchemaDockerClaude Code
Professional Project

Global Data Orchestration

Resilience at Global Scale

Implemented data resilience for a Global SaaS Startup. Designed an automated backfill system using Apache Airflow on Kubernetes (GKE), reducing manual technical interventions by 80%. Established a batch reconciliation mesh as contingency for continuous Kafka flow failures, ensuring 100% data integrity.

Apache AirflowGCPKubernetes (GKE)KafkaData ResilienceBackfill
Professional Project

Stream-Guard-Kafka

Real-time Fraud Detection

Kappa Architecture pipeline with hybrid processing: ksqlDB applies Tumbling Windows for high-speed transactional anomaly detection, while Python processes complex business rules (blacklist, context). Kafka 3.6 in KRaft mode (no ZooKeeper), realistic synthetic data via Faker, typed contracts with Pydantic, and PostgreSQL persistence via SQLAlchemy. Fully orchestrated with Docker Compose.

Kafka 3.6 (KRaft)ksqlDBPythonPostgreSQLDocker ComposePydanticSQLAlchemyFaker
Repository

BQ Schema Migrator

BigQuery Schema Governance

Migration tool inspired by Flyway/Liquibase, purpose-built for BigQuery. Each SQL script runs exactly once with idempotency control via SHA256 checksums and a control table. Scripts with @scheduled headers are auto-deployed as Scheduled Queries via the Data Transfer API. Supports placeholders (${PROJECT}, ${DATASET}), credential auto-discovery, and deterministic testing with Pytest.

GCPBigQueryData Transfer APIPythonPytestSHA256CI/CD
Repository

What-Price

Crypto Market ELT Pipeline

End-to-end pipeline extracting market data from the top 20 cryptocurrencies via CoinGecko API hourly. Orchestrated with Apache Airflow 3 (TaskFlow API + Astro), loads into Serverless PostgreSQL (Neon) with idempotent ingestion, and exposes metrics in a Streamlit dashboard with Plotly: market dominance, performance index, 24h volatility, and ATH distance. Bronze/Silver architecture inspired by Medallion.

Airflow 3 (Astro)PostgreSQL (Neon)SQLAlchemyStreamlitPlotlyPandasDocker Compose
Repository

Certifications & Education

Google Cloud Computing Foundations

Google · Nov 2025

Data Engineering with dbt

LinkedIn Learning · Mar 2026

Databricks Fundamentals

Academy Accreditation · Feb 2026

Introduction to Apache Kafka

DataCamp · Feb 2026

Databricks for Data Engineering

Databricks · Dec 2025

Data Engineer Path

Borderless · Nov 2025

HackerRank SQL (Intermediate)

HackerRank · Oct 2025

Software Engineering — Bachelor's

UNOPAR · Expected Jun/2027

Claude Code in Action

Anthropic Academy · 2026

Intro to Machine Learning with Python

USP / ESALQ · Mar 2026