First, by “CNN” I assume you mean a “convolutional neural
network,” that is, a network whose first few layers consist of
convolutions. And by “naive DNN” I assume you mean a
network consisting of fully-connected layers.
Yes, your expectation is correct – a CNN will not work well
on a problem whose input doesn’t have “locality.”
Let’s say that your input consists of 1024 values, and that
the relationship between, say, inputs #7 and #505 is just as
likely to carry useful information as the relationship between
inputs #7 and #8. The constrained structure of the CNN – its
“assumption” of locality – is going to make it difficult for the
network to discover the relationship between inputs #7 and #505.
(You can implement a convolutional layer as a fully-connected
layer where the fully-connected weight matrix consists of a
bunch of copies of the convolutional block, with zeros elsewhere.
The weights that would mix inputs #7 and #505 together and
potentially discover these inputs’ relationships will be zero in
this case – so no mixing, at least in this layer.)
Also note that in addition to this constraint of locality, a
convolutional layer also has translational invariance. (That
is, the convolutional block is translated unchanged across
the input, which is to say, it is convolved with the input.)
For example, let’s say your input is a 2-d image. We expect
that edges within the image that form potentially useful
features have more or less the same structure and meaning
in the lower left corner of the image as in the upper right.
Or we may expect to be able to detect the image of a dog
equally well in the lower left and upper right corners of the
If the local, translationally-invariant structure of the convolutional
layers match the structure of the input data, then you come out
ahead. If it doesn’t match, then you come out behind, and you
are better off training fully-connected layers that can learn the
(non-local, non-translationally-invariant) structure of the input