Three player adversarial games

AjayTalati · July 14, 2017, 2:22am

Hello

this probably sounds quite vague, but I wonder if anyone has managed to train three nets using adversarial training? Here’s the general algorithm

E,F and D are nets, with F and D being simple MLPs, and E is an encoder with an application specific architecture. In the inner loop, E and F are trained co-operatively, and in the outer loop they are trained adversarially against D.

The convergence/stability theory/proof is from a paper on A conditional adversarial architecture The actual application is not so relevant, it the solid proof that’s interesting.

I was simply wandering if anyone had managed to train such a three headed beast?

@smth, @tom any thoughts on this one?

You guy’s are expert adversarial training people who I thought I should ask first

smth · July 14, 2017, 3:21am

i haven’t managed to train such a thing (but I didn’t try). Seems very very mildly interesting, let me know how it goes

tom · July 14, 2017, 9:49am

Awesome, thanks for the link. I had not seen it before, but I would not consider myself an expert (just yet ).
It certainly feels like a very different use of adversarial:
So you have a training set of (sample, source, label) and you want to produce an encoding e = e(sample) that keeps as much information about label as possible but such that source is as independent from (encoding, label) as possible and the idea is that this helps when classification you have new sources.

Interesting. Do you have a use case in mind? I wonder if it could be used for advanced style fingerprinting, too, when you switch source and label.

Best regards

Thomas
[edit: fixed the error Ajay pointed out below]

AjayTalati · July 14, 2017, 4:41pm

Hi Tom,

Wow - that is a perfect summary - it took me a nearly a full days reading to come to the same level of understand.

Indeed in this case the purpose of the discriminator network D, is to remove conditional dependencies on the sources (i.e. sleeping subject + measurement environment).

Perhaps you meant e = e(sample) rather than e=e(source)? As the encoder, is fed in the sequence of spectrograms X = [x_1,x_2,…,x_t] in Omega_x, up to some time t, in the notation of the start of section 3. Model.

At the moment my use/test case is pretty much the same as the original paper - but due to the generality of this domain adaptation/source conditionally independent, type of setup I guess it could be applied to many use cases in general time-series classification and prediction?

The minimum requirement’s would be as you clearly summarised, (a time-series sample X, source ID s, instantaneous event labels y). So I guess there could be applications to claim modelling/event classification & prediction in insurance - the solid proof/performance guarantee of the paper makes it a good candidate?

My guess is due to the generality of this setup it could eventually be extended to extra modalities, by simply training more independent encoders say a, b, c, on each of them, and then concatenating the outputs of these encoders - and projecting this vector, e.g.

E_canonical (X_a, X_b, X_c) = concatenate [ E_a(X_a), E_b(X_b), E_c(X_c)]

onto a latent embedding/manifold. The projection from this canonical space, could then be fed into the predictor and discriminator networks as before. Some notion of how perhaps to attempt this is given here - Conditional generation of multi-modal data using constrained embedding space mapping. It’s just an idea at the moment though. Perhaps it could again be very applicable in actuarial science, maybe as an alternative to classical Gaussian/Levy/Poisson Process models - given enough data it should hopefully pick up notions of correlations in both time and across modalities?

I’m not sure about the application to style fingerprinting, so maybe we could just try it and see?

I have some test data which is similar to that used in this paper. The difference is, the available data uses accelerometer recording’s (rather than radio frequency modulations) of sleep study participants, and it also has their “gold standard” labels from polysomnography (PSG).

https://es.informatik.uni-freiburg.de/datasets/ichi2014

So converting this 3 channel accelerometer data to sequences of spectrograms is reasonably simple to do. After that the papers encoder architecture could be applied, and it’s algorithm should be reproducible without any modifications - I hope

Hopefully in a few months time I should have access to some SOTA clinical grade wearable sensors, which record ECG, electrodermal activity, respiration rate/depth - generally multi-modal streaming bio-markers of sympathetic nervous system and cardio-respiratory activities - similar to the Verily Study Watch.

I’m guessing something similar to that will be the new standard recording instrument in health insurance studies, in a few years time?

Great to be chatting with you again

Best regards,

Ajay

tom · July 14, 2017, 7:52pm

The other association that comes to my mind is that it seems like a generalised information bottleneck - except that you are not necessarily trying to compress the sample, but lose mutual information between the sample and the source, so you want to minimize (I(Encoding, Source) - β I(Encoding, Target)) with β=1/λ or so. But the notation is too thick for me today…

AjayTalati · July 14, 2017, 8:22pm

Hi, @smth of course I’ll let you know how it goes - thanks a lot for the reply

AjayTalati · July 14, 2017, 8:34pm

Hi Thomas

thanks a lot for the insight, I hadn’t thought about in this way? It does sound right - we want to remove the effects of individual sources from a general representation of the samples (i.e. the encoder), and predictor.

The three player game idea seems to be the easiest way to get a simple algorithm with an equilibrium/convergence proof. I don’t know how to derive something like that from basic information theory - that’s kind of why I like this paper.

The only other reference’s to adversarial training with three networks is in the domain adaption literature, and also Triple Generative Adversarial Nets - it has code - but I don’t understand the paper at the moment

tom · July 16, 2017, 8:36pm

Hi Ajay, @AjayTalati

looking at this some more, I wonder whether the guarantees are unusually strong in practice -

if … have enough capacity and is trained to reach optimum

seems like a pretty strong assumption (it would seem the WGAN equivalent would be not far away from “the discriminator has learnt a test function reproducing the real Wasserstein distance” and “the generator has been trained so that the Wasserstein distance is 0” - at which point you automatically have success).

The code for the article accompanying the ichi2014 dataset you linked seems to be downloadable from this uni-siegen page. Maybe it is useful as a baseline. I’m not sure that I quite understand the spectrogram conversion from accelerometer data - this seems quite different from RF data.

Best regards

Thomas

AjayTalati · July 17, 2017, 12:53am

Hi Thomas,

thanks really a lot for your insight - I never thought of comparing with the WGAN

So yes !!! I’m keen to do the experiment you suggest i.e. comparing the three-player game setup in this paper with the WGAN. I’m very curious how the two compare! Will need to think a bit more carefully though about a reasonable architecture for the WGAN - I’ve got no experience of training sequence GANs so I’m sure how to do it yet?

I’ve got a reasonable amount of the architecture and training code for the 3 player game paper done now. What I’ve got so far is - that the Encoder E is basically just a standard image captioning CNN-LSTM architecture with the 2D residual CNN replaced with a 1D residual CNN. The Predictor P can just be a simple MLP.

So this architecture can be trained using the algorithm posted at the top of this thread, minus the discriminator loop, i.e. just ignoring the lines repeat, Update discriminator D: until

In the paper, in section 4.5 - Role of Our Adversarial Discrimnator, the performance of this simple setup, (of just the Encoder and Predictor), is refereed to the “baseline model”, and it’s compared with the full setup, i.e. including the Discriminator. It seems the performance of the baseline is not soon bad on it’s own, but the addition of the discriminator has some important effects, in particular it allows the learning transitions between the labels, which is a hidden/latent category never presented to the predictor network - so that’s empirically quite interesting

So hopefully we should be able to test this baseline model soon - i.e the next few days, I’ll try it and get back to you - I’m guessing that it should train reasonably quickly compared to larger image/more complicated GAN models?

As you pointed out, looking into the data pre-processing in more detail it appears there are few different reasonable ways to convert the accelerometer data into spectrograms? I believe the ichi2014 dataset has a 100hz sampling frequency so slicing this up into 30 second windows should give 3000 data points - I’m hoping that this will be good enough to get a reasonable spectrogram - using scipy.signal.spectrogram - which should output a single 1D vector of the amount of energy in each frequency “bin” - for each 30 second window. Alternatively I’ll just have to experiment with different sized windows as I’ve seen the spectorgram method applied to accelerometer sleep data before. Though I believe it’s quite common for accelerometer activity recognition data - so perhaps I/we could have a look into that if this doesn’t work? Alternatively, the authors of the MIT paper have said they will release their RF and Polysomnography data, but I’m guess if that does happen it won’t be till mid August, after they present the paper at ICML17

Here’s a nice picture of the the time evolution of a spectrogram “window”,

If I’m not going crazy - each single spectrogram is a slice through this 3D plot at a set time. This slice is then simply a vector of real numbers, where each number in the vector is the power in a particular frequency bin, i.e. the height plotted on the amplitude, z-axis

Since there are 3 channels in the accelerometer data, I’m guessing I’m either going to train three separate Encoders, and perhaps share the weights? Or, alternatively and more simply I could just add the spectrograms, to get the energies summed across all three channel, for each given bin. I’m not too sure about these two design decision’s so I guess I just have to try them both?

I’ll post the all code (loading-preprocessing-spectrogram, models and the training scripts) in a Github repo as soon as it’s working/worthwhile to share.

Really nice to be working with you again

Best regards,

Ajay

PS - thank you very much for the link to code on the this uni-siegen page - somehow I missed that?

PS2 - there’s a nice video about the RF device implementation here - https://www.youtube.com/watch?v=BhSL7AILTzE - it does’nt talk about the three player game or really any deep learning methods, but it’s interesting background for this particular use case