Open-source ETL pipeline for LLM data processing with a block-based interface. Supports multi-source ingestion, Spark-based distributed processing, and privacy-aware filtering. Accepted to NAACL 2025 Demo.

Paper

arXiv: 2403.19340

Venue: NAACL 2025

Library

GitHub Repository

dataopen-sourceresearch