Developer-Guide/standards/cpp.md

# C/C++

The C/C++ code should focus on using simplicity over "modern solutions". This may often mean to heavily restrict the code. The following rule of thumb applies:

1. C99 and C++11 should be used where reasonable
2. C++ may be used in places where external libraries basically require C++

The reason for the strong focus on C is that we **personally** believe that C is simpler and easier to understand than the various abstractions provided by C++.

## Operating system support

C/C++ solutions should be valid on Windows 10+ and Linux.

## Performance

When writing code keep the following topics in mind:

* Branching / Branchless programming
* Instruction tables and their latency / throughput
* Cache Sizes
* Cache Line Size
* Cache Locality
* Cache Associativity
* Memory Bandwidth
* Memory Latency
* Prefetching
* Alignment / Packing
* Array of Structs vs Struct of Arrays
* SIMD
* Choosing correct data types
  * Data Type Sizes (e.g. 32 bit vs 64 bit)
  * Containers (e.g. arrays vs vectors)
  * Signed vs unsigned
* Threading
* Cost of abstractions
* atomics vs locking (mutex)
* Cache line sharing between CPU cores

### Branching / Branchless programming

Branched code

```c++
if (a < 50) {
    b += a;
}
```

Branchless code

```c++
b += (a < 50) * a;
```

### Instruction table latency

| Instruction | Latency | RThroughput |
|-------------|---------|:------------|
| `jmp`       | -       | 2           |
| `mov r, r`  | -       | 1/4         |
| `mov r, m`  | 4       | 1/2         |
| `mov m, r`  | 3       | 1           |
| `add`       | 1       | 1/3         |
| `cmp`       | 1       | 1/4         |
| `popcnt`    | 1       | 1/4         |
| `mul`       | 3       | 1           |
| `div`       | 13-28   | 13-28       |

https://www.agner.org/optimize/instruction_tables.pdf

### CPU stats

| CPU Category | Stat |
|--------------|---------|
| L1 Cache | 32 - 48 KB |
| L2 Cache | 2 - 4 MB |
| L3 Cache | 8 - 36 MB |
| L4 Cache | 0 - 128 MB |
| Clock speed | 3.5 - 6.2 Ghz  |
| Cache Line | 64 B |
| Page Size | 4 KB |

### Cache locality

Column wise traversal

```c++
void process_columns(int matrix[1000][1000]) {
    for (int col = 0; col < 1000; ++col) {
        for (int row = 0; row < 1000; ++row) {
            matrix[row][col] *= 2;
        }
    }
}
```

Row wise traversal

```c++
void process_rows(int matrix[1000][1000]) {
    for (int row = 0; row < 1000; ++row) {
        for (int col = 0; col < 1000; ++col) {
            matrix[row][col] *= 2;
        }
    }
}
```

### Data Padding

Wasting 6 bytes

```c++
struct Data {
    char a;
    int b;
    char c;
};
```

Wasting 2 bytes

```c++
struct Data {
    char a;
    char c;
    int b;
};
```

### SIMD

Performs every addition one after another:

```c++
void add_arrays(float* a, float* b, float* result, size_t size) {
    for (size_t i = 0; i < size; ++i) {
        result[i] = a[i] + b[i];
    }
}
```

> The code above may actually get optimized by the compiler because it is very simple

Performs 8 additions at the same time:

```c++
void add_arrays(float* a, float* b, float* result, size_t size) {
    size_t i = 0;
    for (; i < size - (size % 8); i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vr = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&result[i], vr);
    }

    // Handle the remainder if size is not dividable by 8
    for (; i < size; ++i) {
        result[i] = a[i] + b[i];
    }
}
```

### Locks vs lockless

Locked version

```c++
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
int counter = 0;

void increment_counter() {
    pthread_mutex_lock(&mtx);
    ++counter;
    pthread_mutex_unlock(&mtx);
}
```

Lockless version

```c++
atomic_int counter = 0;

void increment_counter() {
    atomic_fetch_add(&counter, 1);  // Atomic operation, no locking needed
}
```

### Cache line sharing between CPU cores

When working with multi-threading you may choose to use atomic variables and atomic operations to reduce the locking in your application. You may think that a variable value `a[0]` used by thread 1 on core 1 and a variable value `a[1]` used by thread 2 on core 2 will have no performance impact. However, this is wrong. Core 1 and core 2 both have different L1 and L2 caches BUT the CPU doesn't just load individual variables, it loads entire cache lines (e.g. 64 bytes). This means that if you define `int a[2]`, it has a high chance of being on the same cache line and therfore thread 1 and thread 2 both have to wait on each other when doing atomic writes.

## Namespace

### use

Namespaces must never be globally used. This means for example `use namespace std;` is prohibited and functions from the standard namespace should be prefixed instead `std::`

## Templates

Don't use C++ templates.

## Allocation

Use memory arenas instead of over and over manually allocating memory.

However, if neccessary use C allocation methods for heap allocation.

## Functions

### C++ function

Don't use C++ standard library functions or C++ functions provided by other C++ header files unless you have to work with C++ types which is often required when working with third party libraries.

### Parameters

Generally, functions that take pointers to non-scalar types should modify the data instead of allocating new memory **IF** reasonable. This forces programmers to consciously create copies before passing data **IF** they need the original data. To indicate that a reference/pointer is not modified by a function define them as const!

We believe this approach provides a framework for better memory management and better performance in general.

Examples for this can be:

* Matrix multiplication with a scalar
* Sorting data (depends on sorting algorithm)