Go calls Lua performance stress testing and tuning

Purpose

Fine-tune the performance of invoking Lua scripts using the gopher-lua library: whether to use pooling and what strategy to implement for the virtual machine pool.
Output a performance test report to users (developers).

Test Description

Based on Go’s benchmarking capabilities, concurrent test cases are written. To rule out the performance impact of the script itself, the script only implements simple logic and pre-compilation is implemented. By adjusting the virtual machine pool strategy, CPU count, and concurrency, output the average time taken to call Lua scripts, and the memory occupied.

Test Script

function helloLua(n)
    goSayHello("hello", "my name is lua") -- Call a Go method
    return n, 100000
end

Benchmark Code

var luaMng = NewLuaPreCompileManager(NewLStatePool)

func init() {
   err := luaMng.CompileLua("test.lua", script)
   if err != nil {
      panic(err)
   }
}

func invokeLua() {
   result, err := luaMng.InvokeScriptFunc("test.lua", "helloLua", 30*time.Second, 2, 1)
   if err != nil {
      panic(err)
   }
   fmt.Println(result[0], result[1])
}

// go test -bench='Parallel$' -cpu=2 -benchtime=5s -count=3 -benchmem
func BenchmarkLuaPreCompileManager_InvokeScriptFunc_Parallel(b *testing.B) {
   b.ReportAllocs()
   b.ResetTimer()
   b.SetParallelism(2000)
   b.RunParallel(func(pb *testing.PB) {
      for pb.Next() {
         invokeLua()
      }
   })
}

Virtual Machine Configuration:

return NewLState(lua.Options{
   CallStackSize:       32,   // Maximum call stack size, the depth of the call stack, that is, up to 32 method depths
   MinimizeStackMemory: true, // The call stack will automatically grow and shrink as needed, up to `CallStackSize`
})

Report Data

A few concepts:

Core: Virtual machines that are resident in memory, i.e., returned to the pool after use
Non-core: Virtual machines that are Closed after use
Blocking: Blocking wait for a virtual machine to be returned to the pool
Non-blocking: Directly create a new virtual machine

Virtual Machine Pool	Benchmark Specified CPU Count	Benchmark Duration	Benchmark Count	Parallelism (Goroutine Count)	Milliseconds/op	Memory Consumption/op	CPU Usage (Peak)	Memory Usage (Peak)
No Pooling	2	10s	5	1000	0.18455	159.5KB	190%	476.9M
No Pooling	2	10s	5	2000	0.168622	159.5KB	191%	935.8M
No Pooling	2	10s	5	4000	0.175112	159.6KB	190%	1.82G
Pooling, Variable Size	2	10s	5	1000	0.065165	6.53KB	44%	291M
Pooling, Variable Size	2	10s	5	2000	0.073247	6.50KB	50%	560M
Pooling, Variable Size	2	10s	5	4000	0.077863	6.47KB	52%	1.08G
Fixed Core Count 1000+Unlimited Non-core	2	10s	5	4000	0.046725	7.4KB	90%	883M
Fixed Core Count 2000+Unlimited Non-core	2	10s	5	4000	0.045968	6.8KB	66%	962M
Fixed Core Count 1000+Blocking Wait	2	10s	5	4000	0.048416	6.52KB	70%	326M
Fixed Core Count 2000+Blocking Wait	2	10s	5	4000	0.04729	6.52KB	72%	652M
Fixed Core Count 1000+Blocking Wait	4	10s	5	4000	0.046915	6.52KB	100%	348M
Fixed Core Count 2000+Blocking Wait	4	10s	5	4000	0.047518	6.52KB	102%	649M
Fixed Core 1000+Non-core 2000+Blocking Wait	2	10s	5	4000	0.048806	7.2KB	84%	682M

Report Analysis

The time taken for a single script call is less than 0.2ms.

Comparison between Pooling and Non-Pooling:

There is a significant difference in CPU usage; creating virtual machines is very CPU-intensive.
The average time per script call is 0.1ms apart, and pooling can enhance performance.
At the same level of parallelism, the total memory usage is significantly different; pooling can reduce memory consumption.

Pooling without Fixed Pool Size:

Supports 1000 parallel processes, requiring 291M of memory, and for every doubling of the number of parallel processes, the memory usage doubles.
The higher the level of parallelism, the higher the average time taken.

Pooling with Fixed Core Count + Unlimited Non-core Count: At the same level of parallelism, the lower the core count, the higher the CPU usage and the average time taken.

Pooling with Fixed Core Count + Blocking Wait: At the same level of parallelism, the lower the core count, the higher the average time taken, but the total memory usage is lower.

Pooling with Fixed Core Count + Unlimited Non-core Count vs Pooling with Fixed Core Count + Blocking Wait: With the same 2000 core count, 4000 concurrent processes, blocking wait has lower memory usage than non-blocking wait, but the average time taken is higher.

Optimization Plan

Choose the direction of pooling to reduce CPU consumption, memory usage, and lower the average time taken.
The maximum size of the pool should be limited to prevent OOM due to sudden traffic spikes.
Consider optimizing pool performance with a fixed core count + maximum non-core count + blocking wait strategy (based on the average concurrency and memory situation of a single process, configure a fixed core count + maximum non-core count):
1. Performance is best when the concurrency is less than the core count;
2. When the concurrency is within the maximum non-core range, it can reduce blocking waits;
3. When the concurrency exceeds the maximum core count, it can control the maximum memory usage.
To avoid the pool continuing to occupy memory in cases where the script rules have been unloaded, etc., a free check and release mechanism should be implemented.