Which advanced ML network would be best for my use case?

Hi all,

I would like to get some guidance on improving the ML side of a problem I’m working on in experimental quantum physics.

I am generating 2D light patterns (images) that we project into a vacuum chamber to trap neutral atoms. These light patterns are created via Spatial Light Modulators (SLM) – essentially programmable phase masks that control how the laser light is shaped. The key is that we want to generate a phase-only hologram (POH), which is a 2D array of phase values that, when passed through optics, produces the desired light intensity pattern (tweezer array) at the target plane.

Right now, this phase-only hologram is usually computed via iterative-based algorithms (like Gerchberg-Saxton), but these are relatively slow and brittle for real-time applications. So the idea is to replace this with a neural network that can map directly from a desired target light pattern (e.g. a 2D array of bright spots where we want tweezers) to the corresponding POH in a single fast forward pass.

There’s already some work showing this is feasible using relatively simple U-Net architectures (example: https://arxiv.org/pdf/2401.06014). This U-Net takes as input:

  • The target light intensity pattern (e.g. desired tweezer array shape)
    And outputs:

  • The corresponding phase mask (POH) that drives the SLM.

They train on simulated data: target intensity ↔ GS-generated phase. The model works, but:

  • The U-Net is relatively shallow.

  • The output uniformity isn’t that good (only 10%).

  • They aren’t fully exploiting modern network architectures.

I want to push this problem further by leveraging better architectures but I’m not an expert on the full design space of modern generative / image-to-image networks.

My specific use case is:

  • This is essentially a structured regression problem:

  • Input: target intensity image (2D array, typically sparse — tweezers sit at specific pixel locations).

  • Output: phase image (continuous value in [0, 2pi] per pixel).

  • The output is sensitive: small phase errors lead to distortions in the real optical system.

  • The model should capture global structure (because far-field interference depends on phase across the whole aperture), not just local pixel-wise mappings.

  • Ideally real-time inference speed (single forward pass, no iterative loops).

  • I am fine generating datasets from simulations (no data limitation), and we have physical hardware for evaluation.

Since this resembles many problems in vision and generative modeling, I’m looking for suggestions on what architectures might be best suited for this type of task. For example:

  • Are there architectures from diffusion models or implicit neural representations that might be useful even though we are doing deterministic inference?

  • Are there any spatial-aware regression architectures that could capture both global coherence and local details?

  • Should I be thinking in terms of Fourier-domain models?

I would really appreciate your thoughts on which directions could be most promising.

Hi Thunlok!

Thank you for the clarification relative to your previous thread.

As I understand your use case, you have a target normal image (that might be restricted to
being a collection of bright spots on a black background) and you want to predict from the
normal image the “phase-only hologram” image that reproduces the normal image.

I don’t want to rain on your parade (or on the parade of the authors of the paper you linked
to – which, as a matter of personal forum-participation practice, I didn’t look at), but there
are lots of problems for which neural networks / machine learning are not appropriate.

For example, we wouldn’t, for practical purposes, train a neural network to invert matrices.

(If you really want to have some fun, train a network to find prime factorizations of large
integers.)

When there’s a well-defined concrete conventional algorithm that solves your problem, getting
specific quantitative details correct with high precision, you generally don’t want to build a
neural-network for it.

Neural networks are good in cases where we want to “discover” the “algorithm” because we
don’t have a fully-baked conceptual algorithm in mind. (For example, describe to me in words
an algorithm for distinguishing images of cats from those of dogs.) They’re also good for
wading through lots of data looking for patterns that we don’t already have a detailed handle
on.

Neural networks / machine learning is quite fashionable these days (for good reason, I might
add) and as a consequence we see many papers being written applying such techniques to
problems for which they aren’t really appropriate.

So, yeah, for your use case I would advise you to use “old-school” scientific computing …

Having said that:

“Better” architectures are those that work better on your specific use case. “Generative /
image-to-image networks” are quite the fad right now (for good reason), but they’re directed
at things like noise reduction and inpainting (and deepfakes), where you’re trying to produce
an output image that “looks good” according to rather “soft” criteria. You’re not trying predict
the value of a specific output pixel to several digits of accuracy.

Even though these architectures are very good for certain (very interesting) use cases, I don’t
see them as being likely to be a good choice for you.

Note, this speaks against something like U-Net (but see below), as the general structure of
U-Net is to predict values for output pixels based on the “regions” around the corresponding
input pixels.

Are you willing to have an expensive training process (for example, training on lots of data),
provided your inference is (relatively) cheap?

I understand this to mean that you have (i.e., can generate) very large amounts of realistic
annotated training data.

Note, you’d better be sure that your simulated datasets pass your “physical-hardware”
evaluation with flying colors. Even if you succeed in effectively training your network
with lots of simulated data, the network’s output can only be as good as that simulated
data. If your physical-hardware evaluation doesn’t like your simulated data that much,
it also won’t like the output of even a well-trained network.

This just doesn’t seem like a promising avenue to me.

Something like that could make sense.

Since, figuratively speaking, the hologram image (that you want to predict) is the Fourier
transform of the normal image you will be using as input, you might try along these lines:

Using U-Net as the context for the discussion (without suggesting that U-Net would be a
better choice than other convolutional or image-to-image architectures), you might augment
your input image with its Fourier transform.

That is, assuming that your input image is single-channel gray-scale, you might use a
three-channel input image that consists of the original input image together with the amplitude
and phase of its Fourier transform.

The Fourier transform is likely very relevant to the desired phase-only hologram output, and
you don’t want to, in effect, train your network to compute the Fourier transform. (Might as
well just train a network to invert matrices …) Instead, just input the Fourier transform up front.

As a variation on this theme, you might use multiple Fourier-transform-like derivatives of your
input image as additional input channels.

One scheme: If you have some single-pass (or otherwise adequately cheap) algorithms
for producing one or more approximate results for your desired phase-only hologram, you
could input those as additional input channels. The idea is that the U-Net would be tasked
not with predicting the phase-only hologram, but rather with predicting a correction to an
approximate phase-only hologram.

Again, my gut tells me that you want to use old-school scientific computing for this. If you
need more speed, your efforts are probably better directed at improving the old-school
algorithm or improving the efficiency of its implementation or by building a parallel-processing
or cuda-style implementation.

However, if you can afford lots of computational power for training, I could see you getting
something like a U-Net (or another convolutional network) working if you train it on lots of
(high-quality) simulated data and augment the input images with companion Fourier-transform
or approximate-solution images. (And be prepared to play around with the depth and / or other
internal details of your network.)

Best.

K. Frank

Of course it’s bit of a bummer to read that old school ways would work best as I think it would be quite fun to do ML haha. But I sincerely appreciate your advice. They way you put it about ML is good when you don’t know the algorithm is enlightening. Actually, the only problem with the Gerchberg-Saxton algorithm is that it takes a long time to run so it’s difficult during experiments to run it multiple times. It’s just computationally demanding. So people have been trying other ways instead of GS and one way was doing this neural network to essentially learn the GS, but at less cost computationally when using the network during an experiment. That’s essentially the only reason why people have been trying ML for this. In fact, what would be a dream is to make a ML network or something else that gets close to what the GS or weighted GS can get to. I understand that a network can’t get better than what it’s being training on but if it can get close and also have other additions such as improving flickering intensity (this is a big problem at the moment) from phase hologram to phase hologram. The flickering intensity happens when one updates the phase hologram to the SLM – it makes the intensity flicker due because each phase hologram is made via GS algorithm and the phase of the phase hologram is random i.e. from one phase hologram to the other, there is no information so this elicit phase jumps which cause flickering of the light on the camera. So if there is a way for a network or some ML tool to learn to smooth out the phase holograms from step to step, then that could be a big benefit as well. This can be expanded even further so that the network outputs all phase hologram steps that give lower flickering and say 30 optimal phase holograms. These would be my next steps once I get a simple ML network/tool to work just for the simple case of making a GS-like phase hologram. Given this, do you see more of a reason to try ML?

I would like to also mention that the GS algorithm (Phase retrieval - Wikipedia) is a phase retrieval algorithm which is essentially an algorithm to find solutions to the phase problem (also in the link above). So when you state that there is an algorithm that does it already, this algorithm is, I would say, the best we got so far, but I’ve been looking up using deep learning for the phase retrieval and this would be something I’d be interested in investigating i.e. completely removing the GS algo to then use just machine learning for the phase problem.

hen this is extremely valuable. It’s essentially saying "Hey, it’s about 3% as precise as the WGS but with that you get less flickering

Correct

What do you mean by augment the input image with its FT?

Have you heard of GitHub - NVlabs/edm: Elucidating the Design Space of Diffusion-Based Generative Models (EDM) ( Elucidating the Design Space of Diffusion-Based Generative Models (EDM))
It was recommended to me to use for a U-net but I have no clue how to use it. You think using this as the Unet would be a good idea?

Huh…interesting. So, the goal was to have the neural network train on the GS-made holograms i.e. the best holograms that is currently known in the field. But how could it predict a correction? This is a good idea because essentially I could make a phase hologram that is not ideal very very fast. So if I can put that into the network and get an optimized phase hologram, then I would say that is what I want! :smiley:

Hi Thunlok!

This is a side comment:

You say that the (overall?) phase of the phase hologram is random. Is it the case that there
is an overall random phase in the phase hologram? For example, is it the case that the only
thing that matters is the phase differences between various points in the hologram?

I’m not familiar with your GS algorithm, but it would seem plausible that such an algorithm
could be modified (without much increase in cost) to take in some kind of "reference phase’
that the output of the algorithm then matches. For example, could the previously-generated
hologram be used as the algorithm is generating the new hologram to guide the new hologram
to have a consistent “low-flicker” phase?

If the notion that the algorithm produces output with an unconstrained overall phase is
figuratively correct. perhaps one could tweak the algorithm to fix that overall phase “by hand.”
Perhaps the phase-fixing condition could simply be requiring the upper-left corner of the
generated hologram to have phase 0. Or if fixing that upper-left corner would be too much
of “the tail wagging the dog,” perhaps you could requires the average phase – the phase of
each pixel in the hologram averaged over all of the pixels – to be 0.

Note, even if you go with some machine learning approach, training on “fixed-phase” holograms
could teach your network to generate fixed-phase holograms. (You could even add a term to
the loss function that penalizes the difference between the “overall phase” (e.g., average phase)
and a fixed overall-phase benchmark (such as 0).)

Is your goal to find a faster algorithm that reproduces the results of the GS algorithm, or are
you looking to generate phase holograms that are better (by whatever measure) than those
produced by the GS algorithm?

Still commenting on the side topic of flickering: If flickering is the only criterion (in addition to
speed) by which you want to do better than the GS algorithm, my gut tells me that there is likely
a (low-cost) tweak to the algorithm than will reduce or eliminate flickering.

Working in the context that your ground-truth target data are the GS holograms – that if you
can reproduce the GS holograms your done:

So your baseline problem is that you want to input a single-channel normal images (the
bright spots) and output a single-channel (phase-only) hologram “image” that will reproduce
the input image.

Even though these two images are in different “domains” – the input image in the space
domain and the output image in the frequency domain – I imagine that the two images have
the same shape. (In your original example, you had a 60x60 input image and what I assumed
was a 60x60 output hologram “image.”)

Whether it’s U-Net or some other image-to-image model, it’s very natural for such a model
to take in and output multi-channel images.

So by “augment” the input image I mean pass in a multi-channel image that contains not
only the original normal image, but its Fourier transform, as well. Concretely, I have in
mind the first channel of the multi-channel input image being the original image – the
brightness of its pixels. The second channel would be the amplitude of the corresponding
Fourier transform, and the third channel would be the phase of the Fourier transform.

(The motivation for augmenting with the Fourier transform is that it’s hard to train a network
to perform Fourier transforms. Since the hologram is, morally speaking, the Fourier transform
of the normal image, the Fourier transform is likely to be important part of the “workflow” the
network performs when it generates the hologram. Rather than force the network to learn
to perform the Fourier transform – or some other broadly-equivalent sub-processing – just
give it the result of the processing up front.)

Again, by augment I mean input to the network various other useful information by including
it as additional channels in the multi-channel input image.

So you could augment your normal input image with an inexpensive approximate GS hologram
(in addition to or instead of the Fourier transform, depending on which works best). If you train
your network to produce near-perfect GS holograms (because those are your target images),
you will, by definition, be training your network to predict a correction to the approximate
hologram (input as one of the channels of the augments input image). Sure, this correction
is packaged as the actual target hologram, rather than a set of “correction data,” but these
are substantively the same thing.

(As an aside, you might get some minor benefit by training the network to explicitly predict
the correction, for example by training it to predict the difference between the target GS
hologram and the approximation that was input. But regardless of whether one or the other
of these approaches is better, the difference won’t be great because they are substantively
the same problem.)

Of course, it’s a legitimate avenue to consider augmenting the input image with multiple
useful “things.” For example, if approximation A captures some useful features of the target
hologram and approximation B captures other useful features, it would be very useful to
augment with both. It becomes an empirical issue as to whether additional augmentation
improves the resulting prediction or just confuses the issue.

To me, a hologram includes both phase and amplitude and both carry important information.
Generally, giving more (useful) information to a network helps it work better. (That’s why I
recommended that if you augment your input with its Fourier transform, you augment with
the amplitude as well as the phase.)

In a similar vein, if your GS (or similar adequately-inexpensive) algorithm can give you an
approximate amplitude-phase hologram whose phase channel is a good approximation to
your target phase-only hologram, you will probably do better if you augment your input with
both amplitude of that approximate hologram, as well as its phase.

Lastly, I understand is that your goal is to reproduce the GS phase-only holograms, so that’s
what you should train on. But you can often train a network more effectively by training it to
predict more than the output you will actually be using.

So if your GS phase-only-hologram algorithm can be extended to also produce amplitude
information, it will probably be beneficial to train your network to jointly predict both. (With
U-Net or similar image-to-image model, your output would now be a two-channel image with
one channel being the phase and the other, the amplitude. Having a two-channel output adds
no substantive complication to the network architecture.) If predicting the amplitude “distracts”
from making the best prediction for the phase, you can always weight the phase prediction
more heavily in your loss function.

The point is that when training your network to predict the amplitude, it will likely build “features”
internally that are also helpful for predicting the phase. In a sense, training your network to
also predict the amplitude will tell your network to look at this additional structural information
in the amplitude-prediction problem that is helpful for predicting the phase, but is somewhat
hidden – or perhaps even absent – in the pure phase problem.

As an analogy, consider predicting the height of a person from an image of that person. If
in addition to height data, you also have ground-truth data available for weight, you could
well believe that training to predict height and weight jointly could give you a better height
prediction than just training on the height. By training on the weight, you direct the network
to learn about various weight-centric clues in the image that are also relevant to predicting
height.

Best.

K. Frank

Sorry let me explain further. It’s a bit difficult because I switch from Fourier space and real space. But if we are at the space of the CCD camera, we have light shining on specific pixels. Using the weighted GS algorithm, the phase of these intensities is what I’m referring to. And the wGS uses those phases as additional parameters to create uniformity of the intensities. This phase is the phase of the “design” on the SLM. So, where the design, let’s call it a blazed grating. Where the start point of this periodic design starts, that’s the phase, but there are many of these blazed gratings. So in essence, the phase on the pixels is random because you have many blazed gratings pushing the light to those specific tweezers. I hope this makes more sense :smiley:

This is exactly what I want to do! It would be really great if I can find a way to do this. The only thing is, is that I don’t know what phase hologram would give the best lowest flickering from one phase hologram to the other. This is what I hope a network could learn: we have one phase hologram as a start point. The network should give the best low flicker phase of the next intermediate step and so on.

Huh…this sounds interesting. Are you referring to, in the GS algorithm, setting a constraint that defines a defined zero point?

What do you think the type of network would be good for this?

Right now, just faster. If I can find better in whatever measure then that would really be amazing and a big result. Of course I want the latter, but I would be happy with the former.

The wGS algorithm (wGS is much better than GS as it is an active feedback that improves the weights of the active pixels) is a one-phase-hologram algorithm. So, you can only improve the phase hologram that is actively being shown on the CCD camera. One idea I had was: phase hologram 1 is the starting point, and I have the defined phases of the active pixels which sets the starting point of the phases. When I then change to the phase hologram 2, the phases at the active pixels can only change within maybe ±pi/8 of the previous phase. This allows more control of the phase and less flickering. This is something I’m going to try. So then let’s say this works and I get less flickering, then, I can make a neural network that is faster than the wGS algorithm.

Another thing I didn’t mention haha is that in this field, pixel i.e. intensity of light rearrangement is crucial. Let’s say we have an initial pixel design and we want the pixels to move to a diamond shape – which pixels of light move where in steps to assure the most efficient rearrangement? I.e. which pixels move where in terms of steps to form the diamond? These types of algorithms such as Hungarian algorithm and Jonker-Volgenant algorithm are typically O(N^3) where N is the # of pixels/light intensities that will move. So, if there is a way that a network can take in initial and final pixel designs, make X amount of intermediate steps of phase holograms that take into consideration flickering between intermediate steps and are all the while on par with wGS phase holograms, then this would be superb. Sorry for not mentioning this before, I just simplified things as too much info at once convolutes things. But this, would be more of a dream in a sense and to get to this point would be when the network can handle all of the preliminary parts.

Yes, same shape.

Correct and yes I think it is best to give as input as much information about the system as possible. So, putting the phase and the amplitude as well as the phase hologram as inputs could be advantageous and is what I was thinking. However, it is redundant in the sense that the FT of the phase hologram is the amplitude and phase.

Yes this is what I did in my CNN. I had the amplitude and the phase as inputs.