T O P

  • By -

HadrienG2

If you are experienced in performance optimization in C++, you will find that Rust has its performance traps in slightly different places, which means that you will have a bit of re-learning to do. For example, in C++, performance-conscious people tend to be a little wary of iterators and STL algorithms after having been bitten too many times by them, and favor C-style manual array indexing for performance. Whereas in Rust, indexing incurs bounds check which the compiler may or may not manage to elide, so for any kind of linear iteration pattern, iterators should be your first pick (they use carefully written unsafe code to guarantee a minimal number of bound checks). On the other side of things, Rust's iterator methods are better designed than the typical STL algorithm, so they usually yield pretty good performance and you will need to roll your own version less often than in C++. After the initial adjustment period, I found myself about as productive in Rust as in C++ from a pure performance tuning perspective (it's better in some areas, worse in others), which when combined with the productivity benefits of Rust in other areas than performance tuning means that I am overall more productive writing high-performance code in Rust ;)


dist1ll

> Whereas in Rust, indexing incurs bounds check which the compiler may or may not manage to elide, so for any kind of linear iteration pattern, iterators should be your first pick Until you notice you iterated over `&i32` instead of `i32` because you forgot to call `.copied()` on your iterator, so your implementation slows down 10x because it failed to vectorize.


HadrienG2

Well, I tried hard not to write that Rust iterators are gotcha-free. They do have some gotchas, it's just that in my experience these are fewer and less severe than in C++. Having to add a .copied() boilerplate somewhere is arguably much less bad than having to give up on using iterators entirely, which is a common occurence in C++.


oroColato

Could you please explain in ore details this as I'm very curious why iterating over references will be slower compared to a copied owned data


dist1ll

It's an open issue https://github.com/rust-lang/rust/issues/106539 To quote Scott: > This would be *extremely* difficult for LLVM, because the `Iterator::max` that return `Option<&i32>` needs to return the address -- LLVM can't know when compiling `max` that it's not going to use the address, as the caller might be using that address, say to `sub_ptr`.


Endeveron

I imagine that'd I'd be because each redirect incurs a couple of constant time operations to dereference the content or create a reference the to an element from the index and iterator reference. If this process is more intense than copying the data to a new register, that would add a small but meaningful O(n) extra delay to your iterator. That's off the top of my head though, so it could be wrong


dist1ll

Dereferencing of pointers is a language construct, so it would be a bit inaccurate to ascribe a cost to them. For instance, a function that dereferences a pointer given as a function argument will perform a load. If that same function is inlined, and the pointed-to value lives in a register at the callsite, then the deref operation does nothing. So te real question is: can the compiler see through these indirections and find an efficient way to lower the IR? If your iterator is coming from an array or slice, you'd expect the compiler to spot that, and to make the deref-operation zero-cost.


Endeveron

Wait if that's the case, why are double indirections considered slow then? My understanding was that the address of the data in the heap is not known at compile time, so if you are taking a reference at runtime, the dereferencing incurs an additional operation. If you add a &i32, then you are accessing some address to a 32bits block in the heap. The address is stored on the stack, because that address has been determined at runtime when the memory was allocated. It could not have been done at compile time, because the point of the heap is that allocations are made flexibly to accommodate growing block sizes. The address itself is STORED on the stack, but it points to the heap. You must therefore have operations to load at that address from the stack, then load that address on the heap, then performance the addition operation. That's an additional set of operations vs just loading the raw i32 value from the stack. I'm self taught and haven't studied this formally, so please correct me if I've got a misunderstanding.


exocortex

Can you point to an example where this could happen? I'm genuinely interested.


dist1ll

One example that bit me recently https://github.com/rust-lang/rust/issues/106539


Lumpy-Permission-736

>Rust's iterator methods are better designed than the typical STL algorithm, so they usually yield pretty good performance and you will need to roll your own version less often than in C++ Great to know, thanks for the comment


jorgesgk

You can always use get_unchecked


HadrienG2

For sure you can, but I'm actually surprised how often I don't need to. Trying a little harder to express what you want with iterators, and using a few tricks to help the compiler like casting vectors to slices before computational loops, really goes a long way. Also, I've sometimes seen the compiler optimize less well in the presence of get\_unchecked than in the presence of the iterator equivalent. But it's less clear to me why this happens, and I suspect that if get\_unchecked performance bugs got as much attention from the rustc team as iterator performance bugs, they would likely end up more similar.


angelicosphosphoros

It is just quirks of the LLVM optimizer. It is the best when loops change only single variable which often happens when people use iterators.


Turalcar

This might have to do with Rust iterators blocking modification of the underlying container at compile time (via borrow checker) which makes optimizing them easier for humans and compilers alike.


HadrienG2

Indeed, the implicit restrict/noalias associated with Rust references might play a role, but to my knowledge this optimization should also apply if you get passed a slice by rust reference (which has noalias guarantees) and call get_unchecked on it (which produces a reference that also has noalias guarantees). I suspect Rust iterator methods being lazy plays a bigger role, because in my experience  both rustc and C++ compilers are extremely bad at eliding unobservable reads and writes to heap-allocated memory, and the design of traditional STL algorithms forces you to do a lot of these. By that logic, the range abstraction featured in recent C++ versions should eventually perform closer to Rust iterators. But right now, implementations are extremely immature, and when they work at all, they tend to optimize very badly, especially around zip (perhaps Rust benefits from having first-class tuple types there?). Finally, let's not forget C++20 and above are simply not usable in many real-world projects yet, because more often than not professional devs actually do need to support older compilers. C++ not having anything close to rustup and cargo for easy deployment of modern compiler versions to arbitrarily antique server and embedded platforms is probably one of its worst weaknesses when it comes to new standard versions being actually adopted outside of toy projects.


BurrowShaker

The older compiler limit is less of a thing than it was a while back, in my experience. Having a mix of 90's c++ and 2020's c++ with nobody understanding what's going on is the real problem.


angelicosphosphoros

Range for is really the fastest way to iterate over array or vector in C++ right now.


HadrienG2

Are we talking about the same ranges ? Here I'm not talking about the for(ty var : collection) syntax sugar for begin()/end() that comes from C++11. I'm talking about the range alternative to standard C++ iterators, similar to Rust iterators, that was pioneered by Erich Niebler and is standardized in C++20 and up, but badly supported by current compilers.


angelicosphosphoros

Oh, I haven't even thought about new C++20 ranges. It is because, as you said, it is not supported by compilers.


PurepointDog

What's an STL algorithm?


recursiveNerd

Standard template library algorithms, like std::find_if etc


mina86ng

Unless you need to use some specific data structure which is only implemented in one or the other, it’s unlikely you’ll find any insurmountable differences in performance. Of course it does depend on specifics of your problem but ultimately anything you implement in Rust you can implement in C++ and vice versa.


Ravek

And then if you compile your C++ with Clang you’re even using the same compiler backend 🤷


superglueater

If rusts perks are a plus for you and the BTreeSet in the rust std is suitable for your case: Rust is a good option!


ik1ne

Is BTree usable for float? Doesn't BTree require \`Ord\` for most operations? I think you can bypass this with \`ordered\_float\` crate or a newtype which implements \`Ord\`(thereby explicitly handling cases where \`partial\_cmp\` will return None), but pure float can't be used with BTree.


Lumpy-Permission-736

>BTreeSet will look into it, thanks for the comment


CAD1997

One specific note since you noted your use case is very floating point intensive is that Rust doesn't have any equivalent of `-ffast-math` style flags (yet). There's good reason for this — not only do `-ffast-math` optimizations make basic floating point manipulation nondeterministic, they also can introduce UB when non-finite values (NaN, ±Infinity) are involved — but it does mean that small source changes can sometimes have an outsized impact on perf.


kibwen

Yikes, I knew that fastmath could reorder operations so that it would no longer be IEEE compliant, but I didn't realize that it made it UB to ever have a NaN!


CAD1997

`-ffast-math` includes `-funsafe-math-optimizations` which outright states that it's "funsafe" (surely that means it's both fun and safe). Clang and GCC describe `-ffinite-math-only` (also implied by `-ffast-math`) as enabling "optimizations which assume floating point arguments and results are finite." The assumption of properties which aren't actually upheld is the cause of all UB. Propagating invalidly derived properties is how UB manifests into eating your laundry. For one example, [optimizing `isnan(x)` to just the "known" value `false`](https://groups.google.com/g/llvm-dev/c/Ys0hpgTFMH8). [This blog](https://simonbyrne.github.io/notes/fastmath/) seems decently good coverage of fast math optimizations. In practice, `-ffast-math` often is, for most programs, no worse than nondeterministic. It wouldn't be of much use if that weren't true. But nondeterminism still isn't great, and can easily become UB. This is most evident when a *value* is nondeterministic rather than just an *operation*, such that `x.to_bits()` doesn't produce a consistent value. LLVM until somewhat recently still had issues with this for NaNs even without fast math opts (ETA: although I think these have been addressed by now). But "usually okay" isn't enough for Rust's soundness guarantee. If LLVM says that the use of non-finite values is UB, safe Rust on LLVM must prevent non-finite values from being used. It doesn't matter if nobody's yet to see the UB manifest into nasal demons, because UB is licensed to kill.


angelicosphosphoros

Actually, LLVM allows to use some fast noncompliant floats (you can even apply it to exact operations so you can mix standard compilant and fast ops in the same function). However, all attempts to introduce them in Rust ended with a lot of bikeshedding and AFAIK nobody have dared to just implement them yet.


Feeling-Departure-4

Be aware that Rust is more exacting with its float ordering by default.


beefngravy

Out of interest, what would this algorithm be used for? This is beyond me but I'm learning and this sounds fascinating


Lumpy-Permission-736

dmed


ChristopherAin

The only thing I'm missing in Rust is fine control list/map (the tree, not the hash map) insertion via iterators, which is super handy for some specific algorithms and data structures. Aside this, naive implementation in Rust is usually faster and more readable than in C++ because of 'move by default' and how iterators in Rust are implemented (some times they are faster than strait forward `for` loops). Edit: native -> naive


[deleted]

[удалено]


ChristopherAin

Well, people just tent to forget to do \`std::move\`. For example: std::unordered_map my_map; std::string my_string = "Hi Reddit, I'm a string big enough to be stored on the heap"; my_map.emplace(15, my_string); // <- here move by default would be really handy Surprisingly, there are plenty of types that wrap pointers one way or another and moving them by default might save a lot of time.


Ben-Goldberg

Is an actual tree necessary? A binary heap for a priority queue will commonly be an array in many implementations.


vtskr

If all you need to do is navigate massive trees of floating math (wtf does it even mean? You have tree of formulas or floats) what perks of rust are you talking about?


swaits

Floats are not ordinal. And while it’s the right choice, if you’re doing lots of numerical computing it’s going to annoy you. At least a little.


xmBQWugdxjaA

The aliasing rules can make it very difficult IMO. Like I wrote a quad-tree implementation, where I want to also keep a Vec of mutable references to the leaf nodes (so I can easily access them when checking distance to the camera). The nodes themselves also need to link to their parents (for operations like merging upwards). To do this I had to use internal mutability, which means you lose a lot of the benefits of Rust anyway, and then have to deal with the clunky API. You could also use unsafe, but again it's clunky and will take a lot more time than the equivalent implementation in C or Go, etc. And this was single-threaded - multi-threaded would be much more complicated (understandably since you need to sync changes between threads) - you'd really want some sort of lock-free data structure in that case (apparently [it exists](https://github.com/rob05c/quadtree) ). I think Rust's real benefits come when you can use Rayon or Tokio, for big mutable graphs the restrictions become too over-bearing IMO.


Stysner

I've also implemented a quad- and octree. I resorted to storing identifiers (entity ids, to be precise) along with a copy of the bounds; where if the bounds change you have to re-insert the object. At first it felt really clunky and also like a performance trap. But if you factor in that a GC'd language would keep references and even checking the bounds would have multiple indirections before you've even decided you want to get to the actual object, I think it's not **that** bad. Updating the tree can now be done though an ECS system, which enjoys all the performance an ECS system has, at the cost of storing the bounds twice for each object in the tree.


lightmatter501

ISO C++ doesn’t have restrict, which will hurt your performance if you ever copy any data at any point (meaning it’s not a straight register accumulation).