Backend worker monitoring thread interrupted or backend worker process died

I’m testing torchserve using resnet-18 tutorial in this link: https://github.com/pytorch/serve/tree/master/examples/image_classifier/resnet_18

The tutorial works perfectly fine, but when I moved resnet.py from torchvision, index_to_name.json, resnet18-5c106cde.pth, model.py into the same folder, and I changed model.py into:

from resnet import ResNet, BasicBlock

class ImageClassifier(ResNet):
    def __init__(self):
        super(ImageClassifier, self).__init__(BasicBlock, [2, 2, 2, 2])

then I use the following command to archive the model and start torchserve:

torch-model-archiver --model-name resnet-18 --version 1.0 --model-file model.py --serialized-file resnet18-5c106cde.pth --handler image_classifier --extra-files index_to_name.json

mkdir model_store

mv resnet-18.mar model_store/

torchserve --start --model-store model_store --models resnet-18=resnet-18.mar

here is the log after torchserve is executed:

> 2020-09-07 15:49:36,082 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000
> 2020-09-07 15:49:36,082 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]18279
> 2020-09-07 15:49:36,082 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
> 2020-09-07 15:49:36,082 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet-18_1.0 State change WORKER_STOPPED -> WORKER_STARTED
> 2020-09-07 15:49:36,082 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.9
> 2020-09-07 15:49:36,082 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000
> 2020-09-07 15:49:36,083 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000.
> 2020-09-07 15:49:36,546 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died.
> 2020-09-07 15:49:36,546 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):
> 2020-09-07 15:49:36,546 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 176, in
> 2020-09-07 15:49:36,546 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - worker.run_server()
> 2020-09-07 15:49:36,546 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 148, in run_server
> 2020-09-07 15:49:36,546 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.handle_connection(cl_socket)
> 2020-09-07 15:49:36,546 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 112, in handle_connection
> 2020-09-07 15:49:36,546 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service, result, code = self.load_model(msg)
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 85, in load_model
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service = model_loader.load(model_name, model_dir, handler, gpu, batch_size)
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_loader.py”, line 117, in load
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - model_service.initialize(service.context)
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/torch_handler/base_handler.py”, line 50, in initialize
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.model = self._load_pickled_model(model_dir, model_file, model_pt_path)
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/torch_handler/base_handler.py”, line 74, in _load_pickled_model
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - module = importlib.import_module(model_file.split(".")[0])
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/usr/lib/python3.6/importlib/init.py”, line 126, in import_module
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - return _bootstrap._gcd_import(name[level:], package, level)
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “”, line 994, in _gcd_import
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “”, line 971, in _find_and_load
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “”, line 955, in _find_and_load_unlocked
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “”, line 665, in _load_unlocked
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “”, line 678, in exec_module
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “”, line 219, in _call_with_frames_removed
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/tmp/models/492344f70f0f41beb13b7ed4335b3a18/model.py”, line 1, in
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - from resnet import ResNet, BasicBlock
> 2020-09-07 15:49:36,547 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - ModuleNotFoundError: No module named 'resnet’
> 2020-09-07 15:49:36,548 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
> 2020-09-07 15:49:36,551 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
> 2020-09-07 15:49:36,552 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
> java.lang.InterruptedException
> at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
> at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
> at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432)
> at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> 2020-09-07 15:49:36,552 [WARN ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: resnet-18, error: Worker died.
> 2020-09-07 15:49:36,552 [DEBUG] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-resnet-18_1.0 State change WORKER_STARTED -> WORKER_STOPPED
> 2020-09-07 15:49:36,552 [WARN ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-resnet-18_1.0-stderr
> 2020-09-07 15:49:36,552 [WARN ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-resnet-18_1.0-stdout
> 2020-09-07 15:49:36,553 [INFO ] W-9000-resnet-18_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 55 seconds.
> 2020-09-07 15:49:36,565 [INFO ] W-9000-resnet-18_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet-18_1.0-stderr
> 2020-09-07 15:49:36,565 [INFO ] W-9000-resnet-18_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-resnet-18_1.0-stdout
> 2020-09-07 15:49:38,081 [INFO ] main org.pytorch.serve.ModelServer - Torchserve stopped.

I’m using:

  • torchserve version: 0.2.0
  • torch version: 1.6.0
  • java version: 11.0.8
  • Operating System and version: Ubuntu 18.04.5

it seems like the model cannot be loaded, what did I do wrong and how can I fix this?

Thanks.