mirror of
https://github.com/Karaka-Management/Developer-Guide.git
synced 2026-01-11 20:38:42 +00:00
Some checks failed
CI / linting (push) Has been cancelled
Signed-off-by: Dennis Eichhorn <spl1nes.com@googlemail.com>
225 lines
5.6 KiB
Markdown
Executable File
225 lines
5.6 KiB
Markdown
Executable File
# C/C++
|
|
|
|
The C/C++ code should focus on using simplicity over "modern solutions". This may often mean to heavily restrict the code. The following rule of thumb applies:
|
|
|
|
1. C99 and C++11 should be used where reasonable
|
|
2. C++ may be used in places where external libraries basically require C++
|
|
|
|
The reason for the strong focus on C is that we **personally** believe that C is simpler and easier to understand than the various abstractions provided by C++.
|
|
|
|
## Operating system support
|
|
|
|
C/C++ solutions should be valid on Windows 10+ and Linux.
|
|
|
|
## Performance
|
|
|
|
When writing code keep the following topics in mind:
|
|
|
|
* Branching / Branchless programming
|
|
* Instruction tables and their latency / throughput
|
|
* Cache Sizes
|
|
* Cache Line Size
|
|
* Cache Locality
|
|
* Cache Associativity
|
|
* Memory Bandwidth
|
|
* Memory Latency
|
|
* Prefetching
|
|
* Alignment / Packing
|
|
* Array of Structs vs Struct of Arrays
|
|
* SIMD
|
|
* Choosing correct data types
|
|
* Data Type Sizes (e.g. 32 bit vs 64 bit)
|
|
* Containers (e.g. arrays vs vectors)
|
|
* Signed vs unsigned
|
|
* Threading
|
|
* Cost of abstractions
|
|
* atomics vs locking (mutex) vs producer/consumer
|
|
* Cache line sharing between CPU cores
|
|
|
|
### Branching / Branchless programming
|
|
|
|
Branched code
|
|
|
|
```c++
|
|
if (a < 50) {
|
|
b += a;
|
|
}
|
|
```
|
|
|
|
Branchless code
|
|
|
|
```c++
|
|
b += (a < 50) * a;
|
|
```
|
|
|
|
### Instruction table latency
|
|
|
|
| Instruction | Latency | RThroughput |
|
|
|-------------|---------|:------------|
|
|
| `jmp` | - | 2 |
|
|
| `mov r, r` | - | 1/4 |
|
|
| `mov r, m` | 4 | 1/2 |
|
|
| `mov m, r` | 3 | 1 |
|
|
| `add` | 1 | 1/3 |
|
|
| `cmp` | 1 | 1/4 |
|
|
| `popcnt` | 1 | 1/4 |
|
|
| `mul` | 3 | 1 |
|
|
| `div` | 13-28 | 13-28 |
|
|
|
|
https://www.agner.org/optimize/instruction_tables.pdf
|
|
|
|
### CPU stats
|
|
|
|
| CPU Category | Stat |
|
|
|--------------|---------|
|
|
| L1 Cache | 32 - 48 KB |
|
|
| L2 Cache | 2 - 4 MB |
|
|
| L3 Cache | 8 - 36 MB |
|
|
| L4 Cache | 0 - 128 MB |
|
|
| Clock speed | 3.5 - 6.2 Ghz |
|
|
| Cache Line | 64 B |
|
|
| Page Size | 4 KB |
|
|
|
|
### Cache locality
|
|
|
|
Column wise traversal
|
|
|
|
```c++
|
|
void process_columns(int matrix[1000][1000]) {
|
|
for (int col = 0; col < 1000; ++col) {
|
|
for (int row = 0; row < 1000; ++row) {
|
|
matrix[row][col] *= 2;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Row wise traversal
|
|
|
|
```c++
|
|
void process_rows(int matrix[1000][1000]) {
|
|
for (int row = 0; row < 1000; ++row) {
|
|
for (int col = 0; col < 1000; ++col) {
|
|
matrix[row][col] *= 2;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Data Padding
|
|
|
|
Wasting 6 bytes
|
|
|
|
```c++
|
|
struct Data {
|
|
char a;
|
|
int b;
|
|
char c;
|
|
};
|
|
```
|
|
|
|
Wasting 2 bytes
|
|
|
|
```c++
|
|
struct Data {
|
|
char a;
|
|
char c;
|
|
int b;
|
|
};
|
|
```
|
|
|
|
### SIMD
|
|
|
|
Performs every addition one after another:
|
|
|
|
```c++
|
|
void add_arrays(float* a, float* b, float* result, size_t size) {
|
|
for (size_t i = 0; i < size; ++i) {
|
|
result[i] = a[i] + b[i];
|
|
}
|
|
}
|
|
```
|
|
|
|
> The code above may actually get optimized by the compiler because it is very simple
|
|
|
|
Performs 8 additions at the same time:
|
|
|
|
```c++
|
|
void add_arrays(float* a, float* b, float* result, size_t size) {
|
|
size_t i = 0;
|
|
for (; i < size - (size % 8); i += 8) {
|
|
__m256 va = _mm256_loadu_ps(&a[i]);
|
|
__m256 vb = _mm256_loadu_ps(&b[i]);
|
|
__m256 vr = _mm256_add_ps(va, vb);
|
|
_mm256_storeu_ps(&result[i], vr);
|
|
}
|
|
|
|
// Handle the remainder if size is not dividable by 8
|
|
for (; i < size; ++i) {
|
|
result[i] = a[i] + b[i];
|
|
}
|
|
}
|
|
```
|
|
|
|
### Locks vs lockless
|
|
|
|
Locked version
|
|
|
|
```c++
|
|
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
|
|
int counter = 0;
|
|
|
|
void increment_counter() {
|
|
pthread_mutex_lock(&mtx);
|
|
++counter;
|
|
pthread_mutex_unlock(&mtx);
|
|
}
|
|
```
|
|
|
|
Lockless version
|
|
|
|
```c++
|
|
atomic_int counter = 0;
|
|
|
|
void increment_counter() {
|
|
atomic_fetch_add(&counter, 1);
|
|
}
|
|
```
|
|
|
|
### Cache line sharing between CPU cores
|
|
|
|
When working with multi-threading you may choose to use atomic variables and atomic operations to reduce the locking in your application. You may think that a variable value `a[0]` used by thread 1 on core 1 and a variable value `a[1]` used by thread 2 on core 2 will have no performance impact. However, this is wrong. Core 1 and core 2 both have different L1 and L2 caches BUT the CPU doesn't just load individual variables, it loads entire cache lines (e.g. 64 bytes). This means that if you define `int a[2]`, it has a high chance of being on the same cache line and therfore thread 1 and thread 2 both have to wait on each other when doing atomic writes.
|
|
|
|
## Namespace
|
|
|
|
### use
|
|
|
|
Namespaces must never be globally used. This means for example `use namespace std;` is prohibited and functions from the standard namespace should be prefixed instead `std::`
|
|
|
|
## Templates
|
|
|
|
Don't use C++ templates.
|
|
|
|
## Allocation
|
|
|
|
Use memory arenas instead of over and over manually allocating memory.
|
|
|
|
However, if neccessary use C allocation methods for heap allocation.
|
|
|
|
## Functions
|
|
|
|
### C++ function
|
|
|
|
Don't use C++ standard library functions or C++ functions provided by other C++ header files unless you have to work with C++ types which is often required when working with third party libraries.
|
|
|
|
### Parameters
|
|
|
|
Generally, functions that take pointers to non-scalar types should modify the data instead of allocating new memory **IF** reasonable. This forces programmers to consciously create copies before passing data **IF** they need the original data. To indicate that a reference/pointer is not modified by a function define them as const!
|
|
|
|
We believe this approach provides a framework for better memory management and better performance in general.
|
|
|
|
Examples for this can be:
|
|
|
|
* Matrix multiplication with a scalar
|
|
* Sorting data (depends on sorting algorithm)
|