88 Character String Dataset Analysis

I have a collection of about 20K strings from a black box function, each string is 88 ascii characters in length, one string per line in a file.

My initial thought was to convert each line in the file to its numerical ascii value, with an 88 neuron wide input layer; for example:

>>> s = 'AQCfRmKcnAruzC0L8ree7ToDczoHj8b-S30pFqhKu2qldji8awymRepScQGB6my3HoSkXkcse6BlllLm62rnPnxw'
>>> [ord(c) for c in s]
[65, 81, 67, 102, 82, 109, 75, 99, 110, 65, 114, 117, 122, 67, 48, 76, 56, 114, 101, 101, 55, 84, 111, 68, 99, 122, 111, 72, 106, 56, 98, 45, 83, 51, 48, 112, 70, 113, 104, 75, 117, 50, 113, 108, 100, 106, 105, 56, 97, 119, 121, 109, 82, 101, 112, 83, 99, 81, 71, 66, 54, 109, 121, 51, 72, 111, 83, 107, 88, 107, 99, 115, 101, 54, 66, 108, 108, 108, 76, 109, 54, 50, 114, 110, 80, 110, 120, 119]

The first 90% of the file used to train the net, and with the remaining 10% of the file used as the validation set. I am not sure how deep the ANN should be, but plan on experimenting with those parameters once I get an initial proof of concept running.

Any ideas? Thanks in advance.

  1. I’m not sure what function you want to approximate. Is the 88-char-long seq your input or output? What should be the other one? If there is no input, are you trying to come up with a generative model for your sequences? Then look into things like GAN and VAE.
  2. Using ordinal number as input is a bad idea, because the chars are not indications of some quantity. The usual way to deal with such data is one-hot-encoding, since the number of ASCII chars is small.
  3. GANs, LSTMs, and MDNs are all different things. Among them, only LSTM is an architecture. You probably don’t want to use LSTM as the length of the sequence is fixed. GAN describes a generative network (regardless of architecture) whose loss function is defined by a simultaneously trained discriminator network. MDN describes a way to parametrize neural network output to represent a probability density function.

I’m not sure what function you want to approximate.

Neither do I, as I don’t have access to the logic that is generating the sequence.

Using ordinal number as input is a bad idea, because the chars are not indications of some quantity. The usual way to deal with such data is one-hot-encoding, since the number of ASCII chars is small.

Ok, good to know.

The usual way to deal with such data is one-hot-encoding, since the number of ASCII chars is small.

So are you suggesting one hot encoding to represent each ascii character that is present in the total 20K dataset? e.g. if only ABCdef123 is used in the entire set, then:

A = 1 0 0 0 0 0 0 0 0
B = 0 1 0 0 0 0 0 0 0
C = 0 0 1 0 0 0 0 0 0

etc?

Thanks for your feedback.

You are correct about one-hot encoding.

Have you run any statistical analysis on the sentences in order to estimate the difficulty of the task?

For instance, the first letter seems to always be A, the second always Q, what about the 10th? Is the tenth character equally likely to be any of the possibilities?

What about bigrams (pairs of letters) and trigrams? Are some more common than others? Does that depend on the position in the sentence?

I have not performed any in-depth statistical analysis on the collection of strings beyond Shannon entropy analysis and chi squared.

The Shannon entropy score is ~5.528939470219318, which according to the benchmark I am using:

* Standard English text usually falls somewhere between 3.5 and 5.
* Properly encrypted or compressed data of a reasonable length should have an entropy of over 7.5.

So at least from a Shannon entropy perspective, there seems to be a lot of repetition in the data set.

What about bigrams (pairs of letters) and trigrams? Are some more common than others? Does that depend on the position in the sentence?

Are you suggesting building the one hot encoded classes based on bigrams / trigrams / quadgrams instead of the individual letters that exist in the corpus alphabet?

Did you remove the first two characters of each sequence before calculating the Shannon entropy? I don’t know how much the predictability of the first two characters would reduce the total entropy score.

I am not familiar with Gold/Kasami coding, and I wouldn’t try it unless I could find reference to anyone using those codings for sequence to sequence character or language stuff.

I can’t really make any decent recommendations for statistical analysis beyond very basic stuff, such as a scatter plot of character position against ordinal number. Then maybe to count the different bigrams and plot their frequencies.

Did you remove the first two characters of each sequence before calculating the Shannon entropy? I don’t know how much the predictability of the first two characters would reduce the total entropy score.

It’s about the same with the first two characters removed, 5.449065917291543.

Here’s the numbers on the dataset:

50299 _
50641 -
50047 0
50124 1
50589 2
50316 3
50572 4
49662 5
50656 6
50376 7
50049 8
50247 9
50219 a
97719 A
50164 b
59783 B
50132 c
59543 C
50120 d
59407 D
50594 e
49896 E
50455 f
50510 F
50187 g
50773 G
50136 h
50063 H
49745 i
50329 I
50122 j
50192 J
50172 k
50747 K
50425 l
50078 L
50375 m
50179 M
49786 n
50611 N
50463 o
50444 O
49837 p
50506 P
50438 q
88051 Q
50120 r
50138 R
49962 s
50501 S
50322 t
50234 T
50242 u
50697 U
50433 v
50557 V
49755 w
50425 W
50049 x
50122 X
49906 y
50392 Y
50334 z
50392 Z

When you want to approximate a blackbox function, you need to have access to its inputs and outputs. If you don’t, the problem of

is ill defined. For instance, is the function “i => i-th string in your data” a desired function? Surely it outputs some pretty damn good results.

That is why I suggested looking into generative models, i.e. models that describes how a dataset could be generated, with the goal that the resulting models should be able to generalize. One example of such model is a deep function that maps standard Normal random values N(0, 1) to the type of data you want to model.

It depends. It makes sense in VAE case, but makes less sense in GAN case.

MDN’s number of mixture components scales with #outputs. The data you have has 88 components and is quite multimodal (approximate to uniform even as it is supposed to be a pseudo random function). So MDN is likely bad here.

One example of such model is a deep function that maps standard Normal random values N(0, 1) to the type of data you want to model.

Hmm. What would some specific ANN architectures be?

The architecture surely matters, but usually formulation is what is most important. The approach I described is called Variational Autoencoder, or VAE. It uses two deep networks to represent a probabilistic model, and tries to get a high likelihood solution by optimizing a lower bound. Architectural choices for these networks can vary. :slight_smile: