5

I have .txt files in a directory of format <text>\t<label>. I am using the TextLineDataset API to consume these text records:

filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]

dataset = tf.contrib.data.Dataset.from_tensor_slices(filenames)

dataset = dataset.flat_map(
    lambda filename: (
        tf.contrib.data.TextLineDataset(filename)
        .map(_parse_data)))

def _parse_data(line):   
    line_split = tf.string_split([line], '\t')
    features = {"raw_text": tf.string(line_split.values[0].strip().lower()),
                "label": tf.string_to_number(line_split.values[1], 
                    out_type=tf.int32)}
    parsed_features = tf.parse_single_example(line, features)
    return parsed_features["raw_text"], raw_features["label"]

I would like to do some string cleaning/processing on the raw_text feature. When I try to run line_split.values[0].strip().lower(), I get the following error:

AttributeError: 'Tensor' object has no attribute 'strip'

1 Answer 1

13

The object lines_split.values[0] is a tf.Tensor object representing the 0th split from line. It is not a Python string, and so it does not have a .strip() or .lower() method. Instead you will have to apply TensorFlow operations to the tensor to perform the conversion.

TensorFlow currently doesn't have very many string operations, but you can use the tf.py_func() op to run some Python code on a tf.Tensor:

def _parse_data(line):
    line_split = tf.string_split([line], '\t')

    raw_text = tf.py_func(
        lambda x: x.strip().lower(), line_split.values[0], tf.string)

    label = tf.string_to_number(line_split.values[1], out_type=tf.int32)

    return {"raw_text": raw_text, "label": label}

Note that there are a couple of other problems with the code in the question:

  • Don't use tf.parse_single_example(). This op is only used for parsing tf.train.Example protocol buffer strings; you do not need to use it when parsing text, and you can return the extracted features directly from _parse_data().
  • Use dataset.map() instead of dataset.flat_map(). You only need to use flat_map() when the result of your mapping function is a Dataset object (and hence the return values need to be flattened into a single dataset). You must use map() when the result is one or more tf.Tensor objects.
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for providing some clarity here. I had been trying to use py_func but was encountering some errors. Your code works from me. Also, I've decided to convert my .txt data to TFRecord format. For future reference, should I use python to intergize my data before using tensorflow, or are there good patterns for doing all of this in tf. Right now, I use python with a VocabularyProcessor initialized with my self-built CategoricalVocabulary
For anyone coming across this answer, note that a fair number of string operations have been recently introduced in TensorFlow under the namespace tf.strings. The transformation in this question in particular is still a bit complicated, because regex_replace, based on re2, does not support converting to lowercase, but at least you can do the stripping part.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.