Go calls Lua performance stress testing and tuning

Translation wujiuye 197 0 2022-12-29

This article is a translation of the original text, which can be found at the following link: https://www.wujiuye.com/article/c1b3d030fd764a48bd5f457bdddd109d

Author: wujiuye
Link: https://www.wujiuye.com/article/c1b3d030fd764a48bd5f457bdddd109d
Source: 吴就业的网络日记
This article is an original work by the blogger and is not allowed to be reproduced without the blogger's permission.

Purpose

  1. Fine-tune the performance of invoking Lua scripts using the gopher-lua library: whether to use pooling and what strategy to implement for the virtual machine pool.
  2. Output a performance test report to users (developers).

Test Description

Based on Go’s benchmarking capabilities, concurrent test cases are written. To rule out the performance impact of the script itself, the script only implements simple logic and pre-compilation is implemented. By adjusting the virtual machine pool strategy, CPU count, and concurrency, output the average time taken to call Lua scripts, and the memory occupied.

Test Script

function helloLua(n)
    goSayHello("hello", "my name is lua") -- Call a Go method
    return n, 100000
end

Benchmark Code

var luaMng = NewLuaPreCompileManager(NewLStatePool)

func init() {
   err := luaMng.CompileLua("test.lua", script)
   if err != nil {
      panic(err)
   }
}

func invokeLua() {
   result, err := luaMng.InvokeScriptFunc("test.lua", "helloLua", 30*time.Second, 2, 1)
   if err != nil {
      panic(err)
   }
   fmt.Println(result[0], result[1])
}

// go test -bench='Parallel$' -cpu=2 -benchtime=5s -count=3 -benchmem
func BenchmarkLuaPreCompileManager_InvokeScriptFunc_Parallel(b *testing.B) {
   b.ReportAllocs()
   b.ResetTimer()
   b.SetParallelism(2000)
   b.RunParallel(func(pb *testing.PB) {
      for pb.Next() {
         invokeLua()
      }
   })
}

Virtual Machine Configuration:

return NewLState(lua.Options{
   CallStackSize:       32,   // Maximum call stack size, the depth of the call stack, that is, up to 32 method depths
   MinimizeStackMemory: true, // The call stack will automatically grow and shrink as needed, up to `CallStackSize`
})

Report Data

A few concepts:

Virtual Machine Pool Benchmark Specified CPU Count Benchmark Duration Benchmark Count Parallelism (Goroutine Count) Milliseconds/op Memory Consumption/op CPU Usage (Peak) Memory Usage (Peak)
No Pooling 2 10s 5 1000 0.18455 159.5KB 190% 476.9M
No Pooling 2 10s 5 2000 0.168622 159.5KB 191% 935.8M
No Pooling 2 10s 5 4000 0.175112 159.6KB 190% 1.82G
Pooling, Variable Size 2 10s 5 1000 0.065165 6.53KB 44% 291M
Pooling, Variable Size 2 10s 5 2000 0.073247 6.50KB 50% 560M
Pooling, Variable Size 2 10s 5 4000 0.077863 6.47KB 52% 1.08G
Fixed Core Count 1000+Unlimited Non-core 2 10s 5 4000 0.046725 7.4KB 90% 883M
Fixed Core Count 2000+Unlimited Non-core 2 10s 5 4000 0.045968 6.8KB 66% 962M
Fixed Core Count 1000+Blocking Wait 2 10s 5 4000 0.048416 6.52KB 70% 326M
Fixed Core Count 2000+Blocking Wait 2 10s 5 4000 0.04729 6.52KB 72% 652M
Fixed Core Count 1000+Blocking Wait 4 10s 5 4000 0.046915 6.52KB 100% 348M
Fixed Core Count 2000+Blocking Wait 4 10s 5 4000 0.047518 6.52KB 102% 649M
Fixed Core 1000+Non-core 2000+Blocking Wait 2 10s 5 4000 0.048806 7.2KB 84% 682M

Report Analysis

The time taken for a single script call is less than 0.2ms.

Comparison between Pooling and Non-Pooling:

  1. There is a significant difference in CPU usage; creating virtual machines is very CPU-intensive.
  2. The average time per script call is 0.1ms apart, and pooling can enhance performance.
  3. At the same level of parallelism, the total memory usage is significantly different; pooling can reduce memory consumption.

Pooling without Fixed Pool Size:

  1. Supports 1000 parallel processes, requiring 291M of memory, and for every doubling of the number of parallel processes, the memory usage doubles.
  2. The higher the level of parallelism, the higher the average time taken.

Pooling with Fixed Core Count + Unlimited Non-core Count: At the same level of parallelism, the lower the core count, the higher the CPU usage and the average time taken.

Pooling with Fixed Core Count + Blocking Wait: At the same level of parallelism, the lower the core count, the higher the average time taken, but the total memory usage is lower.

Pooling with Fixed Core Count + Unlimited Non-core Count vs Pooling with Fixed Core Count + Blocking Wait: With the same 2000 core count, 4000 concurrent processes, blocking wait has lower memory usage than non-blocking wait, but the average time taken is higher.

Optimization Plan

  1. Choose the direction of pooling to reduce CPU consumption, memory usage, and lower the average time taken.
  2. The maximum size of the pool should be limited to prevent OOM due to sudden traffic spikes.
  3. Consider optimizing pool performance with a fixed core count + maximum non-core count + blocking wait strategy (based on the average concurrency and memory situation of a single process, configure a fixed core count + maximum non-core count):
    1. Performance is best when the concurrency is less than the core count;
    2. When the concurrency is within the maximum non-core range, it can reduce blocking waits;
    3. When the concurrency exceeds the maximum core count, it can control the maximum memory usage.
  4. To avoid the pool continuing to occupy memory in cases where the script rules have been unloaded, etc., a free check and release mechanism should be implemented.