Securely serializing/loading untrusted pytorch models?

turian · April 28, 2021, 8:08pm

I am currently organizing a NeuroIPS competition, in which participants might be submitting pytorch models to our evaluation server for the leaderboard.

Is there a secure way of serializing/loading untrusted pytorch models?

Are there alternatively to pickle, which can be insecure? Is there a pytorch model format we can insist upon that is secure?

I believe that ONNX + tensorflow use protobuf, which avoids the security issues of pickling.

albanD · April 28, 2021, 8:12pm

There has been some discussion around it: pickle is a security issue · Issue #52596 · pytorch/pytorch · GitHub
But I don’t think there is a simple solution.

I guess you can require them to export their model to onnx. But that might lead to some restrictions as not all PyTorch models are able to go through onnx

ptrblck · April 29, 2021, 3:50am

Additionally to what @albanD said, I think you could try to secure your environment by e.g. running submissions in a container (singularity or docker) without any permissions, which could affect your bare metal systems. Of course a system is never 100% secure and in the worst case users might try to escape the container, but you could at least reduce the probability of malicious code doing harm.

EDIT: I’m unsure about the background of the competition, but in the past (a couple of years ago) I’ve hosted a small closed competition on Kaggle. Maybe it’s still possible and would fit your use case?

EDIT2: You could also think about just accepting raw text data (as csv files) containing the validation predictions instead of the models.

turian · April 30, 2021, 2:32am

@ptrblck thanks for the feedback. What sort of performance issues will I encounter versus bare-metal? My main experience has been loading from disk is slower.

Right now we are leaning towards keeping the evaluation tasks secret, thus we don’t want to share the evaluation data until the competition is over. Curious about what code you have for that.

Docker is also useful because I can disable networking, which mitigates the risk anyone tries to exfiltrate the blinded evaluation data. Not really familiar with Singularity, what are the pros and cons versus using Docker for a competition like this.

Let me know if you want to hear more about the challenge offline. My email is username@gmail.com

ptrblck · April 30, 2021, 3:58am

I’m not sure, if I understand the question correctly. What would you mean from disk? Do you mean the performance difference between the Kaggle kernels with their (free) GPU runtime vs. a bare metal machine? If so, I cannot comment on the performance of the free Kaggle kernels and am not sure what is being used currently.

Kaggle is using an approach where the users submit the predictions and will get a score based on a secret part of the submission samples. E.g. the score of only 10% of the submission will be reported while internally the score of the entire submission will be stored. This is known as the “public” and “private” leaderboard. Since number of submissions are limited, users (usually) cannot brute force the private leaderboard samples. However, as so often, users might find a data leak which could allow them to somehow get more information about the public/private split, but issues with data leaks are sometimes hard to predict.

I’m not deeply familiar with Singularity either, but have heard that it can be “more secure”, so you would need to check some comparisons conline or blog posts etc.

turian · April 30, 2021, 4:01am

In Docker, I’ve found that shared mounts sometimes have slow (particularly write) performance. So if the dataset lives outside Docker, reads might be slow?

But participants are provided the input data? We would prefer to avoid this.

I will look up Singularity

ptrblck · April 30, 2021, 4:04am

OK, if participants don’t have the input data, I assume you are setting up machines for them to train and evaluate their models?

I haven’t seen this issue in docker and am heavily using it for my workloads.
A slow data loading can of course be met, if you are using a mounted folder from a network drive, but that’s not depending on docker but the network speed, so I would recommend to copy the data to a fast local storage.