Tensorflow - String processing in Dataset API

Question

I have .txt files in a directory of format <text>\t<label>. I am using the TextLineDataset API to consume these text records:

filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]

dataset = tf.contrib.data.Dataset.from_tensor_slices(filenames)

dataset = dataset.flat_map(
    lambda filename: (
        tf.contrib.data.TextLineDataset(filename)
        .map(_parse_data)))

def _parse_data(line):   
    line_split = tf.string_split([line], '\t')
    features = {"raw_text": tf.string(line_split.values[0].strip().lower()),
                "label": tf.string_to_number(line_split.values[1], 
                    out_type=tf.int32)}
    parsed_features = tf.parse_single_example(line, features)
    return parsed_features["raw_text"], raw_features["label"]

I would like to do some string cleaning/processing on the raw_text feature. When I try to run line_split.values[0].strip().lower(), I get the following error:

AttributeError: 'Tensor' object has no attribute 'strip'

mrry · Accepted Answer · 2017-10-31 16:11:57Z

13

The object lines_split.values[0] is a tf.Tensor object representing the 0th split from line. It is not a Python string, and so it does not have a .strip() or .lower() method. Instead you will have to apply TensorFlow operations to the tensor to perform the conversion.

TensorFlow currently doesn't have very many string operations, but you can use the tf.py_func() op to run some Python code on a tf.Tensor:

def _parse_data(line):
    line_split = tf.string_split([line], '\t')

    raw_text = tf.py_func(
        lambda x: x.strip().lower(), line_split.values[0], tf.string)

    label = tf.string_to_number(line_split.values[1], out_type=tf.int32)

    return {"raw_text": raw_text, "label": label}

Note that there are a couple of other problems with the code in the question:

Don't use tf.parse_single_example(). This op is only used for parsing tf.train.Example protocol buffer strings; you do not need to use it when parsing text, and you can return the extracted features directly from _parse_data().
Use dataset.map() instead of dataset.flat_map(). You only need to use flat_map() when the result of your mapping function is a Dataset object (and hence the return values need to be flattened into a single dataset). You must use map() when the result is one or more tf.Tensor objects.

answered Oct 31, 2017 at 16:11

mrry

126k27 gold badges404 silver badges401 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Brian Over a year ago

Thank you for providing some clarity here. I had been trying to use py_func but was encountering some errors. Your code works from me. Also, I've decided to convert my .txt data to TFRecord format. For future reference, should I use python to intergize my data before using tensorflow, or are there good patterns for doing all of this in tf. Right now, I use python with a VocabularyProcessor initialized with my self-built CategoricalVocabulary

javidcf Over a year ago

For anyone coming across this answer, note that a fair number of string operations have been recently introduced in TensorFlow under the namespace tf.strings. The transformation in this question in particular is still a bit complicated, because regex_replace, based on re2, does not support converting to lowercase, but at least you can do the stripping part.

Collectives™ on Stack Overflow

Tensorflow - String processing in Dataset API

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related