1

I'm new to Tensorflow and machine learning.

My task is to predict the type of a given string input. Here's an example of the training data (with the output already one-hot encoded):

const training = [
    { x: '622-49-7314', y: [1,0,0,0] },      // "ssn"
    { x: '1234 Elm Street', y: [0,1,0,0] },  // "street-address"
    { x: '(419) 555-5555', y: [0,0,1,0] },   // "phone-number"
    { x: 'Jane Doe', y: [0,0,0,1] },         // "full-name"
    { x: 'José García', y: [0,0,0,1] },      // "full-name"
    // ... and millions more examples...
]

My first problem is how to encode the input, since it's not a typical text-dictionary problem (sequence of words) but rather a variable-size sequence of letters.

I've tried 3 encoding approaches for the input string:

Encoding 1, standard text embeddings:

async function encodeData(data) {
    const sentences = data.map(str => str.toLowerCase());
    const model = await use.load();
    const embeddings = await model.embed(sentences);
    return embeddings;
}

Encoding 2, padded unicode buffers and normalized exponential (softmax):

function encodeStr(str, pad = 512) {
    let arr = Array.from(
        new Int32Array(Buffer.from(str.padEnd(pad, '\0'), 'utf16le'))
    );
    const sum = arr.reduce((t, v) => t + Math.exp(v), 0);
    arr = arr.map(el => Math.exp(el) / sum);
    return arr;
}

Encoding 3, a locality hash, broken down into a hex vector of length 64 and normalized exponential (softmax):

const { Nilsimsa } = require('nilsimsa');
function encodeHash(str) {
    const hash = new Nilsimsa(str).digest('hex'),
        vals = hash.split(/(?<=^(?:.{2})+)(?!$)/).map(el => parseInt(el, 16));

    const sum = vals.reduce((t, v) => t + Math.exp(v), 0),
        normArr = vals.map(el => Math.exp(el) / sum);
    return normArr;
}

Then I used a simple model:

const inputSz = 512; // or 128 for encodeStr, or 32 for encodeHash 
const outputSz = 4; // [0,0,0,0] - the size of the one-hot encoding (potentially could be >1000)

model.add(
    tf.layers.dense({
        inputShape: [inputSz],
        activation: 'softmax',
        units: outputSz
    })
);

model.add(
    tf.layers.dense({
        inputShape: [outputSz],
        activation: 'softmax',
        units: outputSz
    })
);

model.add(
    tf.layers.dense({
        inputShape: [outputSz],
        activation: 'softmax',
        units: outputSz
    })
);

model.compile({
    loss: 'meanSquaredError',
    optimizer: tf.train.adam(0.06)
});

Which is trained as such:

    const trainingTensor = tf.tensor2d( data.map(_ => encodeInput(_.input)));
    const [encodedOut, outputIndex, outSz] = encodeOutput(data.map(_ => _.output));
    const outputData = tf.tensor2d(encodedOut);
    const history = await model.fit(trainingTensor, outputData, { epochs: 50 });

But results are all very poor, averaging loss = 0.165. I've tried different configs using the approaches above, ie. "softmax" and "sigmoid" activations, more or less dense layers, but I just can't figure it out.

  • What's the best way to encode strings that are not just text?
  • What's the correct network type and model configuration for this type of classification?

Any help or some direction here would be appreciated as I can't find good examples to base my solution on.

1 Answer 1

1

About the model

The softmax activation returns a probability (value between 0 and 1) and is mostly used as an activation for a last layer for classification problem. The relu activation can be used instead. Additionnaly for the loss function, the categoricalCrossEntropy is well suited than the meanSquaredError.

LSTM and/or bidirectionnal LSTM can be added to the models to take into account the context of the data. If they are used, they will be the first layers of the models so as not to break the context of the data before passing on to dense layers.

About the encoding

Since Nilsimsa is an algorithmic technique that hashes similar input items into the same "buckets" with high probability, it can also be used for clustering and text classification though I haven't used it myself.

The first encoding tries to keep the distance between words when creating tokens from the sentence.

Encoding the data as binary is less used in NLP. However, in this case, since the classification would need to figure out how many numbers there is between the text to find out the label, the binary encoding can create tensors where the euclidian distance will be high between inputs of different labels.

Last thing but not the least to compare the encoding would be the time taken for creating the tensors from the input string.

Sign up to request clarification or add additional context in comments.

4 Comments

After lots of trial and error, got to >99.9% accuracy for most datasets with a simple model, with just one layer with activation: "sigmoid", loss: "meanSquaredError" and optimizer: tf.train.adam(0.02). The encoding worked best by converting each char to an array of its 16 Unicode bits and into a flat tensor, ie tf.tensor1d([ 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* char=A */ 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 /* char=B */ ]). No other configuration worked for the model or encoding.
Okay, the model just need to figure out that there is number in the sentence or not. And the distance between Unicode character of letters and numbers is high than between numbers alone or letters alone. I will edit the answer
Actually your proposed model worked to avg 97% accuracy, which is not bad at all.
I updated my answer to take into account the fact that the binary encoding in your case might give better result

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.