Xiangpeng Hao
banner
xiangpeng.systems
Xiangpeng Hao
@xiangpeng.systems
Database/storage
Flight/DataFusion/Arrow/Parquet
PhD student@UW-Madison

https://xiangpeng.systems
Nice to see this getting shared! 🙌 Now I’m even more motivated to turn it into a full course.
For everyone interested in data infra, want to get a quick sense of how big data works, how data systems are designed, and what the tradeoffs are, start with this share from @xiangpeng.systems, really nice intro!

intro-data-system.xiangpeng.systems
October 29, 2025 at 7:01 PM
Just like other big cities, Madison is getting its own systems talk series. Come join us!
October 24, 2025 at 8:08 PM
LiquidCache a distributed pushdown cache for DataFusion, designed to cut down S3 requests for diskless databases.

💻 Code: github.com/XiangpengHao...
📄 Paper (VLDB 2026): github.com/XiangpengHao...
What is LiquidCache in these slides: what-is-liquid-cache.xiangpeng.systems

BTW @xiangpeng.systems is looking for some early adopters who want to be on the bleeding edge. Hit me up if interested
September 10, 2025 at 8:48 PM
Join my PhD prelim talk next Monday:

Data-Aware Caching for Cloud Analytics

🕐 May 19, 1PM CDT
📍 CS2310 or Zoom: uwmadison.zoom.us/j/3081128886
May 16, 2025 at 12:52 AM
Reposted by Xiangpeng Hao
My manifesto on optimizing SQL and DataFrames in query engines (including an explanation of why Apache DataFusion doesn't have a complex join ordering algorithm):
www.influxdata.com/blog/optimiz... www.influxdata.com/blog/optimiz...
Optimizing SQL (and DataFrames) in DataFusion: Part 1
This post reviews what a Query Optimizer is, what it does, and why you need one for SQL and DataFrames. It also describes how industrial Query Optimizers are structured and standard optimization class...
www.influxdata.com
April 4, 2025 at 4:41 PM
New blog post: "Build your own S3-Select in 400 lines of Rust"

Check it out 😉: blog.xiangpeng.systems/posts/build-...
Build your own S3-Select in 400 lines of Rust – Xiangpeng’s blog
DataFusion is ALL YOU NEED
blog.xiangpeng.systems
March 24, 2025 at 2:14 PM
I submitted a PR that cuts average ClickBench latency by 15% for DataFusion! But reviewing it wasn't straightforward due to the nature of complex performance tuning dynamics, so I made a blog post to explain why it works -- check it out: blog.xiangpeng.systems/posts/parque...
Efficient Filter Pushdown in Parquet – Xiangpeng’s blog
How to implement efficient filter pushdown in Parquet readers and why it’s challenging in practice.
blog.xiangpeng.systems
March 13, 2025 at 6:36 PM
Reposted by Xiangpeng Hao
We are excited to share Fray Debugger (aoli.al/blogs/deadlo...), an IntelliJ plugin that allows you to control concurrent execution deterministically!

We have translated the Deadlock Empire (deadlockempire.github.io) into Java to demonstrate how to use Fray Debugger.
Evil Scheduler: Mastering Concurrency Through Interactive Debugging – Ao Li
TLDR Watch the video below to see how Fray debugger works! I enjoy the concept of Deadlock Empire, an interactive game that teaches the semantics of locks and other concurrency primitives. The core id...
aoli.al
March 12, 2025 at 7:25 PM
Reposted by Xiangpeng Hao
@xiangpeng.systems shared a great post about system researchers. I wrote a comment on it and would like to share some thoughts here and offer complementary ideas.

In short: build paper with open source.

xuanwo.io/links/2025/0...
March 10, 2025 at 7:26 AM
Wrote a blog post reflecting my thoughts on DeepSeek, NSF funding and system research communities in general. Apologies for the bold claims -- hope they can invite some discussions.
blog.xiangpeng.systems/posts/system...
Where are we now, system researchers? – Xiangpeng’s blog
blog.xiangpeng.systems
March 10, 2025 at 4:49 AM
Checkout the underneath framework: github.com/cmu-pasta/fray
Looking forward to a future Rust support😉
February 22, 2025 at 4:50 PM
My weekend project now comes with AI super power! Now you can explore Parquet data with natural language! parquet-viewer.haoxp.xyz
November 24, 2024 at 7:58 PM
This is amazing -- an open source query engine build on open standard is now the fastest, and it is in Rust! datafusion.apache.org/blog/2024/11...
Apache DataFusion is now the fastest single node engine for querying Apache Parquet files
<!–
datafusion.apache.org
November 21, 2024 at 11:22 PM
Reposted by Xiangpeng Hao
New blog post on the fun new hardware advancements which databases can leverage for great gains, and why the cloud means it doesn't matter that they exist. 🫠

transactional.blog/b...
November 20, 2024 at 12:13 AM
Reposted by Xiangpeng Hao
My shoddy ASCII art about writing data to disk was surprisingly popular, so I finished off an old set of notes talking more comprehensively about durably writing data to disk and added a better version of the diagram.

transactional.blog/h...
November 6, 2024 at 11:58 PM
Checkout my weekend project on Parquet explorer: xiangpenghao.github.io/parquet-expl...
It compiles Rust Parquet into WebAssembly, allowing you to explore the structure of Parquet files directly in your browser!
November 4, 2024 at 2:23 AM
New blog post on caching in DataFusion! See how my research is advancing DataFusion’s capabilities and what’s next:
blog.haoxp.xyz/posts/cachin...
October 28, 2024 at 4:07 AM
DataFusion implements one of the most advanced Parquet readers🚀, checkout how:
blog.haoxp.xyz/posts/parque...
October 25, 2024 at 1:24 AM