I'm new to Tensorflow and machine learning.
My task is to predict the type of a given string input. Here's an example of the training data (with the output already one-hot encoded):
const training = [
{ x: '622-49-7314', y: [1,0,0,0] }, // "ssn"
{ x: '1234 Elm Street', y: [0,1,0,0] }, // "street-address"
{ x: '(419) 555-5555', y: [0,0,1,0] }, // "phone-number"
{ x: 'Jane Doe', y: [0,0,0,1] }, // "full-name"
{ x: 'José García', y: [0,0,0,1] }, // "full-name"
// ... and millions more examples...
]
My first problem is how to encode the input, since it's not a typical text-dictionary problem (sequence of words) but rather a variable-size sequence of letters.
I've tried 3 encoding approaches for the input string:
Encoding 1, standard text embeddings:
async function encodeData(data) {
const sentences = data.map(str => str.toLowerCase());
const model = await use.load();
const embeddings = await model.embed(sentences);
return embeddings;
}
Encoding 2, padded unicode buffers and normalized exponential (softmax):
function encodeStr(str, pad = 512) {
let arr = Array.from(
new Int32Array(Buffer.from(str.padEnd(pad, '\0'), 'utf16le'))
);
const sum = arr.reduce((t, v) => t + Math.exp(v), 0);
arr = arr.map(el => Math.exp(el) / sum);
return arr;
}
Encoding 3, a locality hash, broken down into a hex vector of length 64 and normalized exponential (softmax):
const { Nilsimsa } = require('nilsimsa');
function encodeHash(str) {
const hash = new Nilsimsa(str).digest('hex'),
vals = hash.split(/(?<=^(?:.{2})+)(?!$)/).map(el => parseInt(el, 16));
const sum = vals.reduce((t, v) => t + Math.exp(v), 0),
normArr = vals.map(el => Math.exp(el) / sum);
return normArr;
}
Then I used a simple model:
const inputSz = 512; // or 128 for encodeStr, or 32 for encodeHash
const outputSz = 4; // [0,0,0,0] - the size of the one-hot encoding (potentially could be >1000)
model.add(
tf.layers.dense({
inputShape: [inputSz],
activation: 'softmax',
units: outputSz
})
);
model.add(
tf.layers.dense({
inputShape: [outputSz],
activation: 'softmax',
units: outputSz
})
);
model.add(
tf.layers.dense({
inputShape: [outputSz],
activation: 'softmax',
units: outputSz
})
);
model.compile({
loss: 'meanSquaredError',
optimizer: tf.train.adam(0.06)
});
Which is trained as such:
const trainingTensor = tf.tensor2d( data.map(_ => encodeInput(_.input)));
const [encodedOut, outputIndex, outSz] = encodeOutput(data.map(_ => _.output));
const outputData = tf.tensor2d(encodedOut);
const history = await model.fit(trainingTensor, outputData, { epochs: 50 });
But results are all very poor, averaging loss = 0.165. I've tried different configs using the approaches above, ie. "softmax" and "sigmoid" activations, more or less dense layers, but I just can't figure it out.
- What's the best way to encode strings that are not just text?
- What's the correct network type and model configuration for this type of classification?
Any help or some direction here would be appreciated as I can't find good examples to base my solution on.