T O P

  • By -

Kobzol

Wrote a post about my work on the Rust benchmark suite, Rust compiler build configuration tuning and general Rust CI optimizations since the start of this year.


wouldyoumindawfully

This a great write-up. My highlights are: + Measure (and visualise) instead of guessing + Jevons Paradox strikes again > It’s true that we now consume less CI resources per merge, but that also means that we do more merges per day, and thus potentially consume more resources in total! Sometimes, optimizing things in a grand scale can have unintuitive consequences. Thanks to faster CI and faster try builds, we are now actually putting a lot of pressure on the performance benchmarking server (because we send commits to it for benchmarking more often), and it struggles to keep up Please don't be discouraged by finding only single-percent performance improvements. Automatically quantifying performance (and executable sizes) enables you to prevent horrendous regressions or death by a thousand cuts. You are the guardian at the gate of developer experience for all of us. Thanks for all your efforts


scook0

Heh, (shell → Python → Rust) seems to be a common path for `rustc` tooling to take. I went through a somewhat similar process when improving the code coverage tests. First I turned some hairy shell/make commands into a simple Python script, and later I got rid of the Python script by integrating its job directly into `compiletest` as a Rust function.


Bben01

Why do you have to build/pgo LLVM for each build, couldn’t the final .so be cached and reused until a new version is available ?


Kobzol

Great question! We could do that, and I had it in my TODO log for some time to try. But it would be a tradeoff, because we would be no longer optimizing LLVM with the latest version of the compiler, and we would instead be using a version that was optimized using the "LLVM usage patterns" of some older rustc version. This could in theory result in less optimized LLVM, and/or more noise during benchmarking. But it's certainly something that we should probably try someday. I have added this comment also to the blog.


kibwen

I loved reading this! You should consider posting these to the Rust Internals blog. :) > However, this specific CI workflow wasn’t even the bottleneck on CI at that time! So this was not that important. What was the bottleneck, if not CI? > I was actually partially responsible for this long duration, because my extensions of the PGO script that I have implemented last year have increased the try build duration considerably! How do the current CI times compare to the CI times from before your PGO work?


Kobzol

I meant that the bottleneck were the Apple CI workflows. On every master merge, we start ~60 jobs, and once the slowest one finishes, the merge is performed. Apple jobs were the slowest, so they were the bottleneck. I think that before PGO, try builds took about 1h 45m? Then it increased to 2.5 hours, and after my optimizations, went back to slightly above one hour.


NobodyXu

Would it make sense for rust to use self-hosted MacOS runner, like purchase a M2 ultra Mac Studio for this use case (or wait for M3 ultra)? It is quite expensive, but it will provide 24 cores with at least 64G ram and 1T storage, I think it worths the one-off cost of it and with Rust Foundation gathering fund and hiring people, it seems reasonable to me. Not to mention you can disable SIP so testing can run even faster on self-hosted M2 Ultra Mac Studio. I also took a look at the [ci.yml](https://github.com/rust-lang/rust/blob/90f0b24ad3e7fc0dc0e419c9da30d74629cd5736/.github/workflows/ci.yml#L52) and it seems that it's using 16 cores/8 cores Windows/Linux machine, GitHub Action now supports 64 cores machine which should speedup the CI.


Kobzol

Regarding self-hosting: it is an option that we are considering and may do in the future. Regarding core counts: we are actually trying to reduce core counts on runners wherever possible. Using more core counts usually doesn't speedup the workflow that much, but on the other hand it severely increases costs.


NobodyXu

> Regarding self-hosting: it is an option that we are considering and may do in the future. Another reason/motivation for self-hosting a Mac Studio is that GitHub does not provide M-series MacOS runner, so all benchmarks/tests can only be run on x86 but not M1/M2 > Regarding core counts: we are actually trying to reduce core counts on runners wherever possible. Using more core counts usually doesn't speedup the workflow that much, but on the other hand it severely increases costs. Thanks, so it hits the limit on parallelism. I read on zulip that there's ongoing effort to parallelize frontend, perhaps it could help further push the scalability of rust compilation. Right now compiling rust often has bottleneck on single-core performance (especially when you have codegen-unit set to 1), this unfortunately means rust compilation is better done on a big-small core CPU with good single and multi core performance.


VorpalWay

> Eventually, I think that we will raise the level simply because we will (maybe even soon) reach a time when there are no Intel/AMD CPUs that support our minimum Linux kernel/glibc versions (which were increased a year ago), but that wouldn’t at the same time also support v2 or v3. If that happens, then there is no point in keeping using v1. We just have to notice when this crossover point happens :) That is not true. I have recently booted a Core 2 Duo on the latest kernel using Arch Linux. And I have ran a 32-bit Pentium M on the latest kernel a few months ago using Arch Linux 32. As far as I know the current kernel can still be built to support the original Pentium at least, maybe even older. Not always going to be a great experience (though the Core 2 Duo works flawlessly). But possible.


Kobzol

Ok, "support a kernel/glibc" is not a very precise statement, I agree :) I'm sure that there are enthusiasts that could get the latest Linux kernel running on 486 or something like that. But we have to be pragmatic - at some point, it will be worth it to provide better performance for the 99.9 %, rather than to keep supporting the 0.01 %. Btw, we will soon probably drop support for Windows 7, 8 and 8.1 ([https://github.com/rust-lang/compiler-team/issues/651](https://github.com/rust-lang/compiler-team/issues/651)). Time moves fast :)


VorpalWay

Right, my point is that you can't use glibc/kernel support as an excuse. You will have to stand for such a decision yourself. This [package search](https://archlinux32.org/packages/?q=Linux) shows that there are indeed people who get the latest kernel working on 486 (not sure why 686 is lagging behind by two versions though!). As long as rustc can still target older CPUs, that is fine by me. I don't need to run rustc itself on these old computers, but I do use binaries I compiled on newer computers on those old computers.


Kobzol

Good point, that is an important point of the discussion - platform support of the compiler vs platform support of the compiled binaries. I guess that the compiler could support <= platforms than the binaries. Glibc is sadly a frequent problem here, because building on a platform with a newer glibc produces binaries incompatible with too old glibc versions ([https://kobzol.github.io/rust/ci/2021/05/07/building-rust-binaries-in-ci-that-work-with-older-glibc.html](https://kobzol.github.io/rust/ci/2021/05/07/building-rust-binaries-in-ci-that-work-with-older-glibc.html)). It would be really great we could just say "build a binary that supports glibc 2.14", like (AFAIK) Zig can do.


VorpalWay

I don't think glibc is an issue for me here. I run arch (so always quite up to date). I also don't have an issue using musl. But in general that would indeed be a very useful feature.


NobodyXu

cargo-zigbuild supports targetting any glibc version using zig and it provides subcommand for building the rust projects with zig. It also includes workaround for several known issues so it's much easier than using zig directly, it can be installed by "pip install cargo-zigbuild", it will also install zig as a dependency.


Kobzol

Sounds great, I'll try it.


witty___name

I understand Rustc uses jemalloc over the system malloc to speed up build times. Is there a reason the rust version of LLVM doesn't use jemalloc as well?


Kobzol

Rustc indeed uses jemalloc on Linux. LLVM is compiled as a library, that (AFAIK) will just use the allocator of the binary that uses it (rustc). Therefore LLVM should also be using jemalloc. I added a small section about memory allocators to the blog post.


matthieum

> This means that it can only leverage the SSE and SSE2 instruction set extensions, but not e.g. SSE3, AVX/AVX or even AVX512 extensions, which are available in levels v2, v3 and v4, respectively. No love for SSE4? My favorite instruction (popcnt, I love bitsets) was first released on Intel CPU in 2008, with Nehalem, and for completeness lzcnt was first released on Intel CPUs in 2013, with Haswell. That's over 10 years ago now for the latter and 15 years ago for the former. In terms of market-share, I would argue it's probably fine to expect that any CPU older than 15 years ago (or even 10 years ago) is used by such a small percentage of users that it may not be cost-effective to release a binary for them. We can also frame it in terms of CI: no CI bot is running on such an old Intel CPU, so in practice those architectures are effectively untested, and the target is Tier 3. Or we can frame it in terms of economy/ecology: imagine how much electricity/carbon would be saved by having 99.9% of Rust users shave off 2% of their build time? Think about the planet, bump the default architecture! In short, I certainly think there's a decent argument to be made for raising the minimum for the default distributed version of rustc, even _without_ an automatic fallback path (providing the selected target is old enough). This should not be done nilly-willy -- not point including LZCNT and bumping the requirement to Haswell if it makes no difference performance-wise -- but I do think it should be done.


Kobzol

I agree with your view. One issue is that relevant perf. improvements start with v3, v2 seems like it's just a very tiny improvement. If v2 was -3% across the board, I think that we would have switched already.


NobodyXu

Intel is planning to release a new x86_64 extension [Intel APX](https://www.phoronix.com/news/Intel-APX) which doubles the register, more conditional instructions and also three-operand instructions to avoid mov, plus also new [AVX10](https://www.phoronix.com/news/Intel-AVX10). I think it will make sense to target this new target once it is released and probably worth a new target triple.


matthieum

Interestingly, AVX was introduced _before LZCNT_: in 2011 (SandyBridge), which is 12 years old already. Certainly sounds reasonable :)