diff --git a/standards/cpp.md b/standards/cpp.md index 8ef7042..6d6c623 100755 --- a/standards/cpp.md +++ b/standards/cpp.md @@ -48,7 +48,6 @@ for (int i = 0; i < N; i++) } ``` - Branchless code ```c++ @@ -57,6 +56,22 @@ for (int i = 0; i < N; i++) } ``` +### Instruction table latency + +| Instruction | Latency | RThroughput | +|-------------|---------|:------------| +| `jmp` | - | 2 | +| `mov r, r` | - | 1/4 | +| `mov r, m` | 4 | 1/2 | +| `mov m, r` | 3 | 1 | +| `add` | 1 | 1/3 | +| `cmp` | 1 | 1/4 | +| `popcnt` | 1 | 1/4 | +| `mul` | 3 | 1 | +| `div` | 13-28 | 13-28 | + +https://www.agner.org/optimize/instruction_tables.pdf + ### Cache line sharing between CPU cores When working with multi-threading you may choose to use atomic variables and atomic operations to reduce the locking in your application. You may think that a variable value `a[0]` used by thread 1 on core 1 and a variable value `a[1]` used by thread 2 on core 2 will have no performance impact. However, this is wrong. Core 1 and core 2 both have different L1 and L2 caches BUT the CPU doesn't just load individual variables, it loads entire cache lines (e.g. 64 bytes). This means that if you define `int a[2]`, it has a high chance of being on the same cache line and therfore thread 1 and thread 2 both have to wait on each other when doing atomic writes.