Performance Guide
Tips for maximizing Buffer performance across platforms.
Key Principles
- Avoid allocations in hot paths - use buffer pools
- Prefer zero-copy operations - slicing over copying
- Use largest primitives -
readLong()over 8xreadByte() - Use bulk operations - multi-byte reads/writes
- Choose the right allocation zone - Direct for I/O
Buffer Pooling
Allocation is expensive. Pool buffers in hot paths:
// Bad: allocate per request
fun handleRequest(data: ByteArray) {
val buffer = PlatformBuffer.allocate(data.size) // Allocation!
buffer.writeBytes(data)
process(buffer)
}
// Good: pool buffers
val pool = BufferPool(defaultBufferSize = 8192)
fun handleRequest(data: ByteArray) {
pool.withBuffer(data.size) { buffer ->
buffer.writeBytes(data)
process(buffer)
} // Returned to pool
}
Zero-Copy Operations
Slicing
Create views without copying:
// Zero-copy: slice shares memory
val slice = buffer.slice()
// Copy: creates new buffer
val copy = PlatformBuffer.allocate(buffer.remaining())
copy.write(buffer)
StreamProcessor Zero-Copy
StreamProcessor returns slices when data is contiguous:
// Zero-copy when data is in single chunk
val payload = processor.readBuffer(length)
// Returns slice if possible, copies only when spanning chunks
Use Largest Primitives
Reading/writing larger primitives is significantly faster than byte-by-byte operations. The CPU processes 8 bytes in a single Long operation vs 8 separate Byte operations.
WebSocket XOR Masking Example
WebSocket requires XOR masking of payload data. The mask is always exactly 4 bytes, so use Int instead of ByteArray:
// Slow: XOR each byte individually with ByteArray mask
fun maskPayloadSlow(payload: PlatformBuffer, maskKey: ByteArray) {
var i = 0
while (payload.remaining() > 0) {
val b = payload.readByte()
payload.position(payload.position() - 1)
payload.writeByte((b.toInt() xor maskKey[i % 4].toInt()).toByte())
i++
}
}
// Fast: Use Int for mask, expand to Long, process 8 bytes at a time
fun maskPayload(payload: PlatformBuffer, maskKey: Int) {
// Expand 4-byte Int mask to 8-byte Long mask by duplicating
val maskLong = (maskKey.toLong() and 0xFFFFFFFFL) or
((maskKey.toLong() and 0xFFFFFFFFL) shl 32)
// Process 8 bytes at a time
while (payload.remaining() >= 8) {
val value = payload.readLong()
payload.position(payload.position() - 8)
payload.writeLong(value xor maskLong)
}
// Handle remaining 0-7 bytes
var shift = 24
while (payload.remaining() > 0) {
val b = payload.readByte()
val maskByte = ((maskKey ushr shift) and 0xFF).toByte()
payload.position(payload.position() - 1)
payload.writeByte((b.toInt() xor maskByte.toInt()).toByte())
shift = (shift - 8) and 31 // Wrap around: 24 -> 16 -> 8 -> 0 -> 24...
}
}
// Read mask as Int directly from buffer instead of ByteArray
val maskKey = buffer.readInt() // Not buffer.readBytes(4)
Why this is faster:
Intmask avoids array indexing and bounds checks- Expanding
InttoLongis a single bitwise operation - Processing 8 bytes per iteration instead of 1
- Remaining 0-7 bytes use bit shifts on the
Intmask (no array access)
General Pattern: Largest to Smallest
Process data using the largest primitive that fits, then fall back to smaller ones:
fun processBuffer(buffer: ReadBuffer) {
// Process 8 bytes at a time
while (buffer.remaining() >= 8) {
val value = buffer.readLong()
process8Bytes(value)
}
// Process 4 bytes at a time
while (buffer.remaining() >= 4) {
val value = buffer.readInt()
process4Bytes(value)
}
// Process remaining bytes
while (buffer.remaining() > 0) {
val value = buffer.readByte()
process1Byte(value)
}
}
Built-in xorMask() (Fastest)
The buffer library provides a SIMD-optimized xorMask() method that eliminates all the overhead above:
// Best: built-in SIMD-optimized XOR mask (36x faster on Native)
fun maskPayload(payload: PlatformBuffer, maskKey: Int) {
payload.xorMask(maskKey) // SIMD-accelerated on Native, uses Long ops on JVM
}
This uses platform-specific optimizations:
- Native (Apple/Linux): C cinterop functions auto-vectorized to NEON/SSE2 by Clang
- JVM: Long-based XOR with hardware byte swapping
- JS: Int32 DataView operations (native to V8)
SIMD-Accelerated Bulk Operations
On native platforms (Apple ARM64, Linux x86_64), Direct buffers use SIMD-optimized C functions that Clang auto-vectorizes to NEON or SSE2/AVX2 instructions.
macOS ARM64 Benchmark Results (64KB buffers)
Comparison uses the same Direct buffer type — "Baseline" is the old Kotlin-only implementation
(Long-based reads/writes via getLong/set) that was used before the SIMD overrides:
| Operation | SIMD | Baseline (Kotlin-only) | Speedup |
|---|---|---|---|
| xorMask | 635K ops/s | 4.3K ops/s | 146x |
| contentEquals | 626K ops/s | 8.9K ops/s | 70x |
| fill | 944K ops/s | 17.2K ops/s | 55x |
| mismatch | 351K ops/s | 9.0K ops/s | 39x |
| indexOf(Int) | 16.1M ops/s | 1.1M ops/s | 14x |
| indexOf(Long) | 16.2M ops/s | 1.2M ops/s | 14x |
| indexOf(Byte) | 42.4M ops/s | 3.9M ops/s | 11x |
| indexOf(Int, aligned) | 36.9M ops/s | — | — |
| indexOf(Long, aligned) | 43.8M ops/s | — | — |
| bufferCopy | 939K ops/s | — | — |
Key takeaways:
- Use
AllocationZone.Directon native platforms for bulk operations (11-146x faster) xorMask()gains the most because SIMD avoids byte-order swapping overhead- The
alignedflag enables even faster SIMD scanning when data alignment is known
Running Benchmarks
# All platforms
./gradlew bulkBenchmark
# Platform-specific
./gradlew macosArm64BenchmarkBulkBenchmark
./gradlew jvmBenchmarkBulkBenchmark
./gradlew jsBenchmarkBulkBenchmark
Bulk Operations
Multi-byte operations are faster than byte-by-byte:
// Slow: byte-by-byte
for (b in bytes) {
buffer.writeByte(b)
}
// Fast: bulk write
buffer.writeBytes(bytes)
// Also fast: buffer-to-buffer
destBuffer.write(sourceBuffer)
Platform-Specific Tips
JVM
- Use
Directfor NIO channel I/O - Pool Direct buffers (allocation is slow)
- Heap buffers require copying for native I/O
Android
- Use
SharedMemoryfor IPC - Pool camera/video frame buffers
- Watch memory pressure on low-end devices
JavaScript
- Configure CORS for SharedArrayBuffer
- Pool for WebSocket message handling
- Batch operations to reduce JS interop
Native (Linux/Apple)
- Use Direct buffers for SIMD-accelerated bulk operations (up to 40x faster)
- Buffer pooling is critical (avoid GC pressure from Kotlin/Native)
xorMask(),contentEquals(),mismatch(),indexOf()all use C SIMD functions- Use
aligned=trueonindexOf()when data alignment is known (up to 24x faster)
WASM
- Use
Directfor JS interop - LinearBuffer shares memory with JavaScript - Use
Heapfor compute workloads - ByteArrayBuffer has no memory limits - LinearBuffer is faster - 25% faster single ops, 2x faster bulk ops
- Pre-allocated memory - 256MB limit due to optimizer bug workaround
// JS interop: use Direct (LinearBuffer)
val interopBuffer = PlatformBuffer.allocate(1024, AllocationZone.Direct)
// Compute workloads: use Heap (ByteArrayBuffer)
val computeBuffer = PlatformBuffer.allocate(1024, AllocationZone.Heap)
WASM benchmark results:
| Operation | LinearBuffer | ByteArrayBuffer | Speedup |
|---|---|---|---|
| Single int ops | 91.1M ops/s | 73.2M ops/s | 1.24x |
| Bulk ops (256 ints) | 2.0M ops/s | 967K ops/s | 2.04x |
Profiling Tips
Measure Allocation Rate
val stats = pool.stats()
val hitRate = stats.poolHits.toDouble() / (stats.poolHits + stats.poolMisses)
println("Pool hit rate: ${hitRate * 100}%")
// Target: >90% hit rate
Benchmark Critical Paths
For accurate performance measurements, use proper benchmarking frameworks:
- JVM/Multiplatform: kotlinx-benchmark
- Android: Jetpack Benchmark
// Using kotlinx-benchmark
@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
class BufferBenchmark {
private lateinit var pool: BufferPool
@Setup
fun setup() {
pool = BufferPool(defaultBufferSize = 1024)
}
@Benchmark
fun pooledBufferReadWrite() {
pool.withBuffer(1024) { buffer ->
buffer.writeInt(42)
buffer.resetForRead()
buffer.readInt()
}
}
}
measureTimeMillis in a loop doesn't account for JVM warmup, GC pauses, or inlining. Use a proper benchmarking library for reliable results.
Common Anti-Patterns
Unnecessary Copies
// Anti-pattern: copy to ByteArray then wrap
val bytes = buffer.readByteArray(length)
val newBuffer = PlatformBuffer.wrap(bytes)
// Better: slice directly
val slice = buffer.readBytes(length)
Allocation in Loops
// Anti-pattern
repeat(1000) {
val buffer = PlatformBuffer.allocate(1024)
// ...
}
// Better: pool or reuse
val buffer = PlatformBuffer.allocate(1024)
repeat(1000) {
buffer.resetForWrite()
// ...
}
Ignoring Position/Limit
// Anti-pattern: reading beyond limit
while (buffer.position() < buffer.capacity()) {
buffer.readByte() // May read garbage!
}
// Correct: respect limit
while (buffer.remaining() > 0) {
buffer.readByte()
}
Summary
| Optimization | Impact | Effort |
|---|---|---|
| Buffer pooling | High | Low |
| SIMD bulk ops (Native Direct) | High | Low |
| Use largest primitives | High | Low |
| Zero-copy slicing | High | Low |
xorMask() for WebSocket | High | Low |
| Bulk operations | Medium | Low |
| Direct allocation | Medium | Low |
indexOf(aligned=true) | Medium | Low |