Failure of Tesla K80 for training

koklimabc · December 8, 2023, 11:17am

I’d ever experienced from Tesla K80 and gave comment that it is not suitable for data training

First things it could ended unknown error and second halt at the end of epochs once using pytorch lightling Trainer. I’d no issues on normal GTX970 card and faster as compare to it.

Anyone could fix it?

How to boost it up?

J_Johnson · December 8, 2023, 12:27pm

Are you using proper cooling?

ptrblck · December 8, 2023, 2:47pm

This sounds misleading as K80s were heavily used in e.g. Colab for all kinds of ML/DL use cases.

Unsure what exactly is failing as mentioning random errors is not really actionable.
However, note that K80 support is dropped in the latest CUDA releases as Keppler GPUs are too old.

koklimabc · December 9, 2023, 8:13am

So not proper cooling will degrade the faster performance?
fan
FYI, i’m using this fan but change it from 12 volt to 5 volt for simply blow hot air to outside only (not cool air to inside).

J_Johnson · December 9, 2023, 1:27pm

Are you using the K80 in a proper server or a desktop? You may get a lot of problems in desktop as it’s not designed for that.

You need something like a Supermicro | Products | SuperServers | 2U | 2027GR-TRF which will have proper cooling, power, and connectivity. Can likely find on eBay. And get the proper RAM and CPUs. And then you will also need to setup IPMI to run the fans at the appropriate time.

koklimabc · December 10, 2023, 3:16am

FYI， It is desktop used :), maybe i need modify to improve cooling performance.

J_Johnson · December 10, 2023, 7:25am

I think we’ve found you’re problem. You’ll get quite mixed results on using a K80 in a desktop. I speak from experience.

At least in a proper server, you can go days without an error interrupting training. But just note server fans are about as noisy as a prop plane.

koklimabc · December 12, 2023, 1:15am

Yea, agree, I’d recently tested out 12 volt for my fan (before that I swap to 5 volt under molex connector), then all trainings are runnning smooth to the end without any problem only a lot of noise coming from these two fans:).

Cooling is absolute important for computing device in order to train it without any problems.