Skip to content

Commit 48b1560

Browse files
authored
Merge pull request #64 from synopse/main
abouchez / mormot: new -c option, new rouding method and README enhancements
2 parents ccf0377 + 6b35009 commit 48b1560

2 files changed

Lines changed: 43 additions & 50 deletions

File tree

entries/abouchez/README.md

Lines changed: 25 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -22,35 +22,40 @@ Here are the main ideas behind this implementation proposal:
2222

2323
- **mORMot** makes cross-platform and cross-compiler support simple - e.g. `TMemMap`, `TDynArray`,`TTextWriter`, `SetThreadCpuAffinity`, `crc32c`, `ConsoleWrite` or command-line parsing;
2424
- The entire 16GB file is `memmap`ed at once into memory - it won't work on 32-bit OS, but avoid any `read` syscall or memory copy;
25-
- Process file in parallel using several threads - configurable via the `-t=` switch, default being the total number of CPUs reported by the OS;
26-
- Input is fed into each thread as 64MB chunks: because thread scheduling is unbalanced, it is inefficient to pre-divide the size of the whole input file into the number of threads;
25+
- File is processed in parallel using several threads - configurable via the `-t=` switch, default being the total number of CPUs reported by the OS;
26+
- Input is fed into each thread as 4MB chunks (see also the `-c` command line switch): because thread scheduling is unbalanced, it is inefficient to pre-divide the size of the whole input file into the number of threads;
2727
- Each thread manages its own `Station[]` data, so there is no lock until the thread is finished and data is consolidated;
2828
- Each `Station[]` information is packed into a record of exactly 16 bytes, with no external pointer/string, to leverage the CPU L1 cache size (64 bytes) for efficiency;
29-
- Maintain a `StationHash[]` hash table for the name lookup, with crc32c perfect hash function - no name comparison nor storage is needed with a perfect hash (see below);
30-
- On Intel/AMD/AARCH64 CPUs, *mORMot* uses hardware SSE4.2 opcodes for this crc32c computation;
31-
- Store values as 16-bit or 32-bit integers, as temperature multiplied by 10;
32-
- Parse temperatures with a dedicated code (expects single decimal input values);
29+
- A O(1) hash table is maintained for the name lookup, with crc32c perfect hash function - no name comparison nor storage is needed with a perfect hash (see below);
30+
- On Intel/AMD/AARCH64 CPUs, *mORMot* offers hardware SSE4.2 opcodes for this crc32c computation;
31+
- The hash table does not directly store the `Station[]` data, but use a separated `StationHash[]` lookup array of 16-bit indexes (as our `TDynArray` does) to leverage the CPU caches;
32+
- Values are stored as 16-bit or 32-bit integers, as temperature multiplied by 10;
33+
- Temperatures are parsed with a dedicated code (expects single decimal input values);
3334
- The station names are stored as UTF-8 pointers to the memmap location where they appear first, in `StationName[]`, to be emitted eventually for the final output, not during temperature parsing;
3435
- No memory allocation (e.g. no transient `string` or `TBytes`) nor any syscall is done during the parsing process to reduce contention and ensure the process is only CPU-bound and RAM-bound (we checked this with `strace` on Linux);
3536
- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target) - perhaps making it less readable, because we used pointer arithmetics when it matters (I like to think as such low-level pascal code as [portable assembly](https://sqlite.org/whyc.html#performance) similar to "unsafe" code in managed languages);
36-
- Can optionally output timing statistics and resultset hash value on the console to debug and refine settings (with the `-v` command line switch);
37-
- Can optionally set each thread affinity to a single core (with the `-a` command line switch).
37+
- It can optionally output timing statistics and resultset hash value on the console to debug and refine settings (with the `-v` command line switch);
38+
- It can optionally set each thread affinity to a single core (with the `-a` command line switch).
3839

3940
If you are not convinced by the "perfect hash" trick, you can define the `NOPERFECTHASH` conditional, which forces full name comparison, but is noticeably slower. Our algorithm is safe with the official dataset, and gives the expected final result - which was the goal of this challenge: compute the right data reduction with as little time as possible, with all possible hacks and tricks. A "perfect hash" is a well known hacking pattern, when the dataset is validated in advance. And since our CPUs offers `crc32c` which is perfect for our dataset... let's use it! https://en.wikipedia.org/wiki/Perfect_hash_function ;)
4041

4142
## Why L1 Cache Matters
4243

43-
Taking special care of the "64 bytes cache line" is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance.
44+
Taking special care of the "64 bytes cache line" does make a noticeable difference in performance. Even the fastest Java implementations of the 1brc challenge try to regroup the data in memory.
4445

4546
The L1 cache is well known in the performance hacking litterature to be the main bottleneck for any efficient in-memory process. If you want things to go fast, you should flatter your CPU L1 cache.
4647

47-
Min/max values will be reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;)
48+
Min/max values have been reduced as 16-bit `smallint` - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;)
4849

4950
As a result, each `Station[]` entry takes only 16 bytes, so we can fit exactly 4 entries in a single CPU L1 cache line. To be fair, if we put some more data into the record (e.g. use `Int64` instead of `smallint`/`integer`), the performance degrades only for a few percents. The main fact seems to be that the entry is likely to fit into a single cache line, even if filling two cache lines may be sometimes needed for misaligned data.
5051

5152
In our first attempt (see "Old Version" below), we stored the name into the `Station[]` array, so that each entry is 64 bytes long exactly. But since `crc32c` is a perfect hash function for our dataset, it is enough to just store the 32-bit hash instead, and not the actual name.
5253

53-
Note that if we reduce the number of stations from 41343 to 400, the performance is much higher, also with a 16GB file as input. The reason is that since 400x16 = 6400, each dataset could fit entirely in each core L1 cache. No slower L2/L3 cache is involved, therefore performance is better. The cache memory seems to be the bottleneck of our code. Which is a good sign.
54+
We tried to remove the `StationHash[]` array of `word` lookup table. It made one data read less, but performed almost three times slower. Data locality and cache pollution prevails on absolute number of memory reads. It is faster to access twice the memory, if this memory could remain in the CPU caches. Only profiling and timing would show this. The shortest code is not the fastest with modern CPUs.
55+
56+
Note that if we reduce the number of stations from 41343 to 400 (as other languages 1brc projects do), the performance is much higher, also with a 16GB file as input. My guess is that since 400x16 = 6400, each dataset could fit entirely in each core L1 cache. No slower L2/L3 cache is involved, therefore performance is better.
57+
58+
The cache memory seems to be the bottleneck of our code. Which is a good sign, even if it may be difficult to make it any faster. But who knows?
5459

5560
## Usage
5661

@@ -70,8 +75,10 @@ Options:
7075
-h, --help display this help
7176
7277
Params:
73-
-t, --threads <number> (default 16)
78+
-t, --threads <number> (default 20)
7479
number of threads to run
80+
-c, --chunk <megabytes> (default 4)
81+
size in megabytes used for per-thread chunking
7582
```
7683
We will use these command-line switches for local (dev PC), and benchmark (challenge HW) analysis.
7784

@@ -110,14 +117,14 @@ This is the expected behavior, and will be fine with the benchmark challenge, wh
110117

111118
On my Intel 13h gen processor with E-cores and P-cores, forcing thread to core affinity does not make any huge difference (we are within the error margin):
112119
```
113-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -t=10 -v
114-
Processing measurements.txt with 20 threads and affinity=false
115-
result hash=8A6B746A,, result length=1139418, stations count=41343, valid utf8=1
120+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -v
121+
Processing measurements.txt with 20 threads, chunkmb=4 and affinity=false
122+
result hash=85614446, result length=1139418, stations count=41343, valid utf8=1
116123
done in 2.36s 6.6 GB/s
117124
118-
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -t=10 -v -a
119-
Processing measurements.txt with 20 threads and affinity=true
120-
result hash=8A6B746A, result length=1139418, stations count=41343, valid utf8=1
125+
ab@dev:~/dev/github/1brc-ObjectPascal/bin$ ./abouchez measurements.txt -v -a
126+
Processing measurements.txt with 20 threads, chunkmb=4 and affinity=true
127+
result hash=85614446, result length=1139418, stations count=41343, valid utf8=1
121128
done in 2.44s 6.4 GB/s
122129
```
123130
Affinity may help on Ryzen 9, because its Zen 3 architecture is made of identical 16 cores with 32 threads, not this Intel E/P cores mess. But we will validate that on real hardware - no premature guess!

entries/abouchez/src/brcmormot.lpr

Lines changed: 18 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -47,13 +47,13 @@ TBrcMain = class
4747
fEvent: TSynEvent;
4848
fRunning, fMax: integer;
4949
fCurrentChunk: PByteArray;
50-
fCurrentRemain: PtrUInt;
50+
fCurrentRemain, fChunkSize: PtrUInt;
5151
fList: TBrcList;
5252
fMem: TMemoryMap;
5353
procedure Aggregate(const another: TBrcList);
5454
function GetChunk(out start, stop: PByteArray): boolean;
5555
public
56-
constructor Create(const fn: TFileName; threads, max: integer;
56+
constructor Create(const fn: TFileName; threads, chunkmb, max: integer;
5757
affinity: boolean);
5858
destructor Destroy; override;
5959
procedure WaitFor;
@@ -183,7 +183,7 @@ procedure TBrcThread.Execute;
183183

184184
{ TBrcMain }
185185

186-
constructor TBrcMain.Create(const fn: TFileName; threads, max: integer;
186+
constructor TBrcMain.Create(const fn: TFileName; threads, chunkmb, max: integer;
187187
affinity: boolean);
188188
var
189189
i, cores, core: integer;
@@ -193,6 +193,7 @@ constructor TBrcMain.Create(const fn: TFileName; threads, max: integer;
193193
if not fMem.Map(fn) then
194194
raise ESynException.CreateUtf8('Impossible to find %', [fn]);
195195
fMax := max;
196+
fChunkSize := chunkmb shl 20;
196197
fList.Init(fMax);
197198
fCurrentChunk := pointer(fMem.Buffer);
198199
fCurrentRemain := fMem.Size;
@@ -217,11 +218,6 @@ destructor TBrcMain.Destroy;
217218
fEvent.Free;
218219
end;
219220

220-
const
221-
CHUNKSIZE = 64 shl 20; // fed each TBrcThread with 64MB chunks
222-
// it is faster than naive parallel process of size / threads input because
223-
// OS thread scheduling is never fair so some threads will finish sooner
224-
225221
function TBrcMain.GetChunk(out start, stop: PByteArray): boolean;
226222
var
227223
chunk: PtrUInt;
@@ -232,9 +228,9 @@ function TBrcMain.GetChunk(out start, stop: PByteArray): boolean;
232228
if chunk <> 0 then
233229
begin
234230
start := fCurrentChunk;
235-
if chunk > CHUNKSIZE then
231+
if chunk > fChunkSize then
236232
begin
237-
stop := pointer(GotoNextLine(pointer(@start[CHUNKSIZE])));
233+
stop := pointer(GotoNextLine(pointer(@start[fChunkSize])));
238234
chunk := PAnsiChar(stop) - PAnsiChar(start);
239235
end
240236
else
@@ -310,23 +306,6 @@ procedure AddTemp(w: TTextWriter; sep: AnsiChar; val: PtrInt);
310306
w.Add(AnsiChar(val - d10 * 10 + ord('0')));
311307
end;
312308

313-
function Average(sum, count: PtrInt): PtrInt;
314-
// sum and result are temperature * 10 (one fixed decimal)
315-
var
316-
x, t: PtrInt; // temperature * 100 (two fixed decimals)
317-
begin
318-
x := (sum * 10) div count; // average
319-
// this weird algo follows the "official" PascalRound() implementation
320-
t := (x div 10) * 10; // truncate
321-
if abs(x - t) >= 5 then
322-
if x < 0 then
323-
dec(t, 10)
324-
else
325-
inc(t, 10);
326-
result := t div 10; // truncate back to one decimal (temperature * 10)
327-
//ConsoleWrite([sum / (count * 10), ' ', result / 10]);
328-
end;
329-
330309
function ByStationName(const A, B): integer; // = StrComp() but ending with ';'
331310
var
332311
pa, pb: PByte;
@@ -354,6 +333,11 @@ function ByStationName(const A, B): integer; // = StrComp() but ending with ';'
354333
result := 1;
355334
end;
356335

336+
function ceil(x: double): PtrInt; // "official" rounding method
337+
begin
338+
result := trunc(x) + ord(frac(x) > 0); // using FPU is fast enough here
339+
end;
340+
357341
function TBrcMain.SortedText: RawUtf8;
358342
var
359343
c: PtrInt;
@@ -385,7 +369,7 @@ function TBrcMain.SortedText: RawUtf8;
385369
p := fList.StationName[n^];
386370
w.AddNoJsonEscape(p, NameLen(p));
387371
AddTemp(w, '=', s^.Min);
388-
AddTemp(w, '/', Average(s^.Sum, s^.Count));
372+
AddTemp(w, '/', ceil(s^.Sum / s^.Count));
389373
AddTemp(w, '/', s^.Max);
390374
dec(c);
391375
if c = 0 then
@@ -409,7 +393,7 @@ function TBrcMain.SortedText: RawUtf8;
409393

410394
var
411395
fn: TFileName;
412-
threads: integer;
396+
threads, chunkmb: integer;
413397
verbose, affinity, help: boolean;
414398
main: TBrcMain;
415399
res: RawUtf8;
@@ -427,6 +411,8 @@ function TBrcMain.SortedText: RawUtf8;
427411
Executable.Command.Get(
428412
['t', 'threads'], threads, '#number of threads to run',
429413
SystemInfo.dwNumberOfProcessors);
414+
Executable.Command.Get(
415+
['c', 'chunk'], chunkmb, 'size in #megabytes used for per-thread chunking', 4);
430416
help := Executable.Command.Option(['h', 'help'], 'display this help');
431417
if Executable.Command.ConsoleWriteUnknown then
432418
exit
@@ -438,11 +424,11 @@ function TBrcMain.SortedText: RawUtf8;
438424
end;
439425
// actual process
440426
if verbose then
441-
ConsoleWrite(['Processing ', fn, ' with ', threads, ' threads',
442-
' and affinity=', BOOL_STR[affinity]]);
427+
ConsoleWrite(['Processing ', fn, ' with ', threads, ' threads, chunkmb=',
428+
chunkmb, ' and affinity=', BOOL_STR[affinity]]);
443429
QueryPerformanceMicroSeconds(start);
444430
try
445-
main := TBrcMain.Create(fn, threads, {max=}45000, affinity);
431+
main := TBrcMain.Create(fn, threads, chunkmb, {max=}45000, affinity);
446432
// note: current stations count = 41343 for 2.5MB of data per thread
447433
try
448434
main.WaitFor;

0 commit comments

Comments
 (0)