Performance Guide

Tips for maximizing Buffer performance across platforms.

Key Principles

Avoid allocations in hot paths - use buffer pools
Prefer zero-copy operations - slicing over copying
Use largest primitives - readLong() over 8x readByte()
Use bulk operations - multi-byte reads/writes
Choose the right buffer factory - BufferFactory.Default for I/O

Buffer Pooling

Allocation is expensive. Pool buffers in hot paths:

// Bad: allocate per request
fun handleRequest(data: ByteArray) {
    val buffer = BufferFactory.Default.allocate(data.size)  // Allocation!
    buffer.writeBytes(data)
    process(buffer)
}

// Good: pool buffers
val pool = BufferPool(defaultBufferSize = 8192)

fun handleRequest(data: ByteArray) {
    pool.withBuffer(data.size) { buffer ->
        buffer.writeBytes(data)
        process(buffer)
    }  // Returned to pool
}

Zero-Copy Operations

Slicing

Create views without copying:

// Zero-copy: slice shares memory
val slice = buffer.slice()

// Copy: creates new buffer
val copy = BufferFactory.Default.allocate(buffer.remaining())
copy.write(buffer)

StreamProcessor Zero-Copy

StreamProcessor returns slices when data is contiguous:

// Zero-copy when data is in single chunk
val payload = processor.readBuffer(length)
// Returns slice if possible, copies only when spanning chunks

Use Largest Primitives

Reading/writing larger primitives is significantly faster than byte-by-byte operations. The CPU processes 8 bytes in a single Long operation vs 8 separate Byte operations.

WebSocket XOR Masking Example

WebSocket requires XOR masking of payload data. The mask is always exactly 4 bytes, so use Int instead of ByteArray:

// Slow: XOR each byte individually with ByteArray mask
fun maskPayloadSlow(payload: PlatformBuffer, maskKey: ByteArray) {
    var i = 0
    while (payload.remaining() > 0) {
        val b = payload.readByte()
        payload.position(payload.position() - 1)
        payload.writeByte((b.toInt() xor maskKey[i % 4].toInt()).toByte())
        i++
    }
}

// Fast: Use Int for mask, expand to Long, process 8 bytes at a time
fun maskPayload(payload: PlatformBuffer, maskKey: Int) {
    // Expand 4-byte Int mask to 8-byte Long mask by duplicating
    val maskLong = (maskKey.toLong() and 0xFFFFFFFFL) or
                   ((maskKey.toLong() and 0xFFFFFFFFL) shl 32)

    // Process 8 bytes at a time
    while (payload.remaining() >= 8) {
        val value = payload.readLong()
        payload.position(payload.position() - 8)
        payload.writeLong(value xor maskLong)
    }

    // Handle remaining 0-7 bytes
    var shift = 24
    while (payload.remaining() > 0) {
        val b = payload.readByte()
        val maskByte = ((maskKey ushr shift) and 0xFF).toByte()
        payload.position(payload.position() - 1)
        payload.writeByte((b.toInt() xor maskByte.toInt()).toByte())
        shift = (shift - 8) and 31  // Wrap around: 24 -> 16 -> 8 -> 0 -> 24...
    }
}

// Read mask as Int directly from buffer instead of ByteArray
val maskKey = buffer.readInt()  // Not buffer.readBytes(4)

Why this is faster:

Int mask avoids array indexing and bounds checks
Expanding Int to Long is a single bitwise operation
Processing 8 bytes per iteration instead of 1
Remaining 0-7 bytes use bit shifts on the Int mask (no array access)

General Pattern: Largest to Smallest

Process data using the largest primitive that fits, then fall back to smaller ones:

fun processBuffer(buffer: ReadBuffer) {
    // Process 8 bytes at a time
    while (buffer.remaining() >= 8) {
        val value = buffer.readLong()
        process8Bytes(value)
    }

    // Process 4 bytes at a time
    while (buffer.remaining() >= 4) {
        val value = buffer.readInt()
        process4Bytes(value)
    }

    // Process remaining bytes
    while (buffer.remaining() > 0) {
        val value = buffer.readByte()
        process1Byte(value)
    }
}

Built-in `xorMask()` (Fastest)

The buffer library provides a SIMD-optimized xorMask() method that eliminates all the overhead above:

// Best: built-in SIMD-optimized XOR mask (36x faster on Native)
fun maskPayload(payload: PlatformBuffer, maskKey: Int) {
    payload.xorMask(maskKey)  // SIMD-accelerated on Native, uses Long ops on JVM
}

This uses platform-specific optimizations:

Native (Apple/Linux): C cinterop functions auto-vectorized to NEON/SSE2 by Clang
JVM: Long-based XOR with hardware byte swapping
JS: Int32 DataView operations (native to V8)

SIMD-Accelerated Bulk Operations

On native platforms (Apple ARM64, Linux x86_64), Direct buffers use SIMD-optimized C functions that Clang auto-vectorizes to NEON or SSE2/AVX2 instructions.

macOS ARM64 Benchmark Results (64KB buffers)

Comparison uses the same Direct buffer type — "Baseline" is the old Kotlin-only implementation (Long-based reads/writes via getLong/set) that was used before the SIMD overrides:

Operation	SIMD	Baseline (Kotlin-only)	Speedup
xorMask	635K ops/s	4.3K ops/s	146x
contentEquals	626K ops/s	8.9K ops/s	70x
fill	944K ops/s	17.2K ops/s	55x
mismatch	351K ops/s	9.0K ops/s	39x
indexOf(Int)	16.1M ops/s	1.1M ops/s	14x
indexOf(Long)	16.2M ops/s	1.2M ops/s	14x
indexOf(Byte)	42.4M ops/s	3.9M ops/s	11x
indexOf(Int, aligned)	36.9M ops/s	—	—
indexOf(Long, aligned)	43.8M ops/s	—	—
bufferCopy	939K ops/s	—	—

Key takeaways:

Use BufferFactory.Default (native/direct memory) on native platforms for bulk operations (11-146x faster)
xorMask() gains the most because SIMD avoids byte-order swapping overhead
The aligned flag enables even faster SIMD scanning when data alignment is known

Running Benchmarks

# All platforms
./gradlew bulkBenchmark

# Platform-specific
./gradlew macosArm64BenchmarkBulkBenchmark
./gradlew jvmBenchmarkBulkBenchmark
./gradlew jsBenchmarkBulkBenchmark

Bulk Operations

Multi-byte operations are faster than byte-by-byte:

// Slow: byte-by-byte
for (b in bytes) {
    buffer.writeByte(b)
}

// Fast: bulk write
buffer.writeBytes(bytes)

// Also fast: buffer-to-buffer
destBuffer.write(sourceBuffer)

Platform-Specific Tips

JVM

Use BufferFactory.Default (direct) for NIO channel I/O
Pool Direct buffers (allocation is slow)
Heap buffers require copying for native I/O

Android

Use BufferFactory.shared() for IPC
Pool camera/video frame buffers
Watch memory pressure on low-end devices

JavaScript

Configure CORS for SharedArrayBuffer
Pool for WebSocket message handling
Batch operations to reduce JS interop

Native (Linux/Apple)

Use Direct buffers for SIMD-accelerated bulk operations (up to 40x faster)
Buffer pooling is critical (avoid GC pressure from Kotlin/Native)
xorMask(), contentEquals(), mismatch(), indexOf() all use C SIMD functions
Use aligned=true on indexOf() when data alignment is known (up to 24x faster)

WASM

Use BufferFactory.Default for JS interop - LinearBuffer shares memory with JavaScript
Use BufferFactory.managed() for compute workloads - ByteArrayBuffer has no memory limits
LinearBuffer is faster - 25% faster single ops, 2x faster bulk ops
Pre-allocated memory - 256MB limit due to optimizer bug workaround

// JS interop: use Default (LinearBuffer)
val interopBuffer = BufferFactory.Default.allocate(1024)

// Compute workloads: use managed (ByteArrayBuffer)
val computeBuffer = BufferFactory.managed().allocate(1024)

WASM benchmark results:

Operation	LinearBuffer	ByteArrayBuffer	Speedup
Single int ops	91.1M ops/s	73.2M ops/s	1.24x
Bulk ops (256 ints)	2.0M ops/s	967K ops/s	2.04x

Profiling Tips

Measure Allocation Rate

val stats = pool.stats()
val hitRate = stats.poolHits.toDouble() / (stats.poolHits + stats.poolMisses)
println("Pool hit rate: ${hitRate * 100}%")

// Target: >90% hit rate

Benchmark Critical Paths

For accurate performance measurements, use proper benchmarking frameworks:

JVM/Multiplatform: kotlinx-benchmark
Android: Jetpack Benchmark

// Using kotlinx-benchmark
@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
class BufferBenchmark {
    private lateinit var pool: BufferPool

    @Setup
    fun setup() {
        pool = BufferPool(defaultBufferSize = 1024)
    }

    @Benchmark
    fun pooledBufferReadWrite() {
        pool.withBuffer(1024) { buffer ->
            buffer.writeInt(42)
            buffer.resetForRead()
            buffer.readInt()
        }
    }
}

Avoid Simple Timing

measureTimeMillis in a loop doesn't account for JVM warmup, GC pauses, or inlining. Use a proper benchmarking library for reliable results.

Common Anti-Patterns

Unnecessary Copies

// Anti-pattern: copy to ByteArray then wrap
val bytes = buffer.readByteArray(length)
val newBuffer = BufferFactory.Default.wrap(bytes)

// Better: slice directly
val slice = buffer.readBytes(length)

Allocation in Loops

// Anti-pattern
repeat(1000) {
    val buffer = BufferFactory.Default.allocate(1024)
    // ...
}

// Better: pool or reuse
val buffer = BufferFactory.Default.allocate(1024)
repeat(1000) {
    buffer.resetForWrite()
    // ...
}

Ignoring Position/Limit

// Anti-pattern: reading beyond limit
while (buffer.position() < buffer.capacity()) {
    buffer.readByte()  // May read garbage!
}

// Correct: respect limit
while (buffer.remaining() > 0) {
    buffer.readByte()
}

Summary

Optimization	Impact	Effort
Buffer pooling	High	Low
SIMD bulk ops (Native Direct)	High	Low
Use largest primitives	High	Low
Zero-copy slicing	High	Low
`xorMask()` for WebSocket	High	Low
Bulk operations	Medium	Low
Direct allocation	Medium	Low
`indexOf(aligned=true)`	Medium	Low

Key Principles​

Buffer Pooling​

Zero-Copy Operations​

Slicing​

StreamProcessor Zero-Copy​

Use Largest Primitives​

WebSocket XOR Masking Example​

General Pattern: Largest to Smallest​

Built-in xorMask() (Fastest)​

SIMD-Accelerated Bulk Operations​

macOS ARM64 Benchmark Results (64KB buffers)​

Running Benchmarks​

Bulk Operations​

Platform-Specific Tips​

JVM​

Android​

JavaScript​

Native (Linux/Apple)​

WASM​

Profiling Tips​

Measure Allocation Rate​

Benchmark Critical Paths​

Common Anti-Patterns​

Unnecessary Copies​

Allocation in Loops​

Ignoring Position/Limit​

Summary​