+------------------------------------------------------------------------------+
|                                                                              |
|  Dhananjaya D R                   @/logs   @/software   @/resume   @/contact |
|                                                                              |
+------------------------------------------------------------------------------+


OpenBLAS pthread Resource Exhaustion
________________________________________________________________________________

An AI/ML application running in Kubernetes had one pod failing for 3 days and 4 
hours straight - over 27K failed attempts while others worked fine.  

  
The error message was
________________________________________________________________________________
OpenBLAS blas_thread_init: pthread_create failed for thread 11 of 16: Resource 
temporarily unavailable
OpenBLAS blas_thread_init: ensure that your address space and process count 
limits are big enough (ulimit -a)
OpenBLAS blas_thread_init: RLIMIT_NPROC 1048576 current, 1048576 max

The confusing and interesting part - `RLIMIT_NPROC` showed over 1 million 
processes available, yet thread creation was failing due to "resource 
temporarily unavailable."

  
RCA
________________________________________________________________________________
The issue wasn't hardware resources or VM limitations. It was container level 
resource constraints that OpenBLAS couldn't see. OpenBLAS tries to create 
multiple threads for optimal performance - in this case, 16 threads. It checks 
the host system's `RLIMIT_NPROC` (which shows 1M+ processes) and assumes it can 
create those threads. However, the container itself has much stricter limits 
imposed by Kubernetes resource constraints.

  
Reproduction
________________________________________________________________________________
I created a test to reproduce the exact error with a container that had limited 
resources, numpy with OpenBLAS, high thread count, and process limit 
restrictions. The reproduction setup is available at k8pthreadfail


Why This Happens
________________________________________________________________________________
[1] OpenBLAS automatically detects CPU cores and creates threads accordingly. 
    On a 16 core host, it tries to create 16 threads.
[2] Kubernetes uses cgroups and namespaces to isolate containers. A container 
    might only be allowed 0.5 CPU cores and 10 processes, regardless of host 
    resources.
[3] OpenBLAS sees the host's `RLIMIT_NPROC` but hits the container's actual 
    limits when creating threads.


The Fix
________________________________________________________________________________
[1] Set explicit thread limits for OpenBLAS and other math libraries using 
    environment variables
[2] Review and adjust container resource limits to match application 
    requirements
[3] Match thread count to allocated CPU resources rather than relying on 
    auto detection


+------------------------------------------------------------------------------+
  
Thanks to Henna Rose Joshi for bringing this problem to my attention.

+------------------------------------------------------------------------------+
________________________________________________________________________________