Issue Loading Fairseq Model in Nomad

rbxAndrew · September 30, 2021, 7:48pm

Good day!

I have created a new docker image that is based on pytorch/torchserve:0.4.2-cpu. It includes some additional dependencies (most importantly a modified version of fairseq), and an entrypoint script that downloads the Model Archive (.mar) file from AWS S3 and places it in /home/model-server/model-store/ then starts torchserve, like so:

    torchserve --start --ts-config /secrets/config.properties

When I run this docker image on my development machine, everything works great. The .mar file is downloaded from S3, Torchserve starts up, loads the model, and successfully responds to prediction requests.

However, when I try to deploy this docker image in Hashicorp’s Nomad Orchestration System, I see the following issue. The .mar file is successfully downloaded from S3, Torchserve starts up successfully, but then as it’s loading the model, it blows up without any helpful error (that I can see).

Here is the output that I see in the logs for the nomad allocation:

2021-09-30 19:34:50,836 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2021-09-30 19:34:50,999 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.4.2
TS Home: /usr/local/lib/python3.6/dist-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 4
Max heap size: 512 M
Python executable: /usr/bin/python3
Config file: /secrets/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://0.0.0.0:8082
Model Store: /home/model-server/model-store
Initial Models: en-de.mar
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 32
Netty client threads: 0
Default workers per model: 4
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: true
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /home/model-server/model-store
Model config: {"en-de": {"0.0.1": {"defaultVersion": true,"marName": "en-de.mar","minWorkers": 1,"maxWorkers": 1,"responseTimeout": 5000}}}
2021-09-30 19:34:51,019 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2021-09-30 19:34:51,021 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: en-de.mar
2021-09-30 19:35:09,606 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 0.0.1 for model en-de
2021-09-30 19:35:09,606 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 0.0.1 for model en-de
2021-09-30 19:35:19,441 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model en-de loaded.
2021-09-30 19:35:19,442 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: en-de, count: 1
2021-09-30 19:35:19,455 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2021-09-30 19:35:19,527 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2021-09-30 19:35:19,527 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2021-09-30 19:35:19,528 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2021-09-30 19:35:19,528 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-09-30 19:35:19,529 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2021-09-30 19:35:19,741 [WARN ] pool-2-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2021-09-30 19:35:19,822 [INFO ] pool-2-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:a6c81b0f4290,timestamp:1633030519
2021-09-30 19:35:19,823 [INFO ] pool-2-thread-1 TS_METRICS - DiskAvailable.Gigabytes:150.82331466674805|#Level:Host|#hostname:a6c81b0f4290,timestamp:1633030519
2021-09-30 19:35:19,823 [INFO ] pool-2-thread-1 TS_METRICS - DiskUsage.Gigabytes:56.5672492980957|#Level:Host|#hostname:a6c81b0f4290,timestamp:1633030519
2021-09-30 19:35:19,823 [INFO ] pool-2-thread-1 TS_METRICS - DiskUtilization.Percent:27.3|#Level:Host|#hostname:a6c81b0f4290,timestamp:1633030519
2021-09-30 19:35:19,824 [INFO ] pool-2-thread-1 TS_METRICS - MemoryAvailable.Megabytes:70207.8671875|#Level:Host|#hostname:a6c81b0f4290,timestamp:1633030519
2021-09-30 19:35:19,824 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUsed.Megabytes:24832.73828125|#Level:Host|#hostname:a6c81b0f4290,timestamp:1633030519
2021-09-30 19:35:19,824 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUtilization.Percent:26.3|#Level:Host|#hostname:a6c81b0f4290,timestamp:1633030519
2021-09-30 19:35:20,182 [INFO ] W-9000-en-de_0.0.1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2021-09-30 19:35:20,182 [INFO ] W-9000-en-de_0.0.1-stdout MODEL_LOG - [PID]105
2021-09-30 19:35:20,182 [INFO ] W-9000-en-de_0.0.1-stdout MODEL_LOG - Torch worker started.
2021-09-30 19:35:20,182 [INFO ] W-9000-en-de_0.0.1-stdout MODEL_LOG - Python runtime: 3.6.9
2021-09-30 19:35:20,183 [DEBUG] W-9000-en-de_0.0.1 org.pytorch.serve.wlm.WorkerThread - W-9000-en-de_0.0.1 State change null -> WORKER_STARTED
2021-09-30 19:35:20,187 [INFO ] W-9000-en-de_0.0.1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2021-09-30 19:35:20,197 [INFO ] W-9000-en-de_0.0.1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2021-09-30 19:35:20,236 [INFO ] W-9000-en-de_0.0.1-stdout MODEL_LOG - model_name: en-de, batchSize: 1
2021-09-30 19:35:21,193 [INFO ] W-9000-en-de_0.0.1-stdout MODEL_LOG - Sentencepiece model path: /home/model-server/tmp/models/0b779a55a9e84e42996d07ab39e39a31/spm.model
2021-09-30 19:35:24,147 [INFO ] W-9000-en-de_0.0.1-stdout MODEL_LOG - [en] dictionary: 32000 types
2021-09-30 19:35:24,147 [INFO ] W-9000-en-de_0.0.1-stdout MODEL_LOG - [de] dictionary: 32000 types
2021-09-30 19:35:25,304 [INFO ] epollEventLoopGroup-3-1 ACCESS_LOG - /10.136.1.138:16775 "GET /models HTTP/1.1" 200 4
2021-09-30 19:35:25,304 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:Unknown,timestamp:null
2021-09-30 19:35:26,570 [INFO ] epollEventLoopGroup-4-1 ACCESS_LOG - /10.136.1.138:49125 "GET /metrics HTTP/1.1" 200 1
2021-09-30 19:35:26,571 [INFO ] epollEventLoopGroup-4-1 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:Unknown,timestamp:null
2021-09-30 19:35:27,779 [INFO ] W-9000-en-de_0.0.1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-en-de_0.0.1-stdout
2021-09-30 19:35:27,779 [INFO ] W-9000-en-de_0.0.1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-en-de_0.0.1-stderr
2021-09-30 19:35:27,780 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2021-09-30 19:35:27,780 [DEBUG] W-9000-en-de_0.0.1 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2021-09-30 19:35:27,781 [DEBUG] W-9000-en-de_0.0.1 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
	at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432)
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:188)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
2021-09-30 19:35:27,782 [WARN ] W-9000-en-de_0.0.1 org.pytorch.serve.wlm.BatchAggregator - Load model failed: en-de, error: Worker died.
2021-09-30 19:35:27,782 [DEBUG] W-9000-en-de_0.0.1 org.pytorch.serve.wlm.WorkerThread - W-9000-en-de_0.0.1 State change WORKER_STARTED -> WORKER_STOPPED
2021-09-30 19:35:27,783 [WARN ] W-9000-en-de_0.0.1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-en-de_0.0.1-stderr
2021-09-30 19:35:27,783 [WARN ] W-9000-en-de_0.0.1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-en-de_0.0.1-stdout
2021-09-30 19:35:27,783 [INFO ] W-9000-en-de_0.0.1 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.

Any idea what might be going on? Or any ways to get additional information about the failure?

Thanks in advance!

rbxAndrew · October 4, 2021, 9:39pm

Despite sizing the Nomad allocation’s memory limit to allow for about 2x the size of the loaded model, it seems that during some initialization steps, the allocation required a bit more than that amount of memory, which was causing it to silently fail.

Increasing the memory limit to 3x the size of the loaded model made this work.

marksaroufim · February 10, 2022, 10:30pm

Soundds good! Do you mind marking your question as resolved and let me know if you have any more questions