Skip to content

tfrecord write results in no data but no error #46

@dennisobrien

Description

@dennisobrien

Hi -- I am trying to use spark-tfrecord with Spark 3.1.2, but the files written have no data.

  • Spark 3.1.2
  • Python 3.8.10
  • Java 1.8.0
  • Scala 2.12.10

I'm using the latest version available from the maven repo as:

<dependency>
    <groupId>com.linkedin.sparktfrecord</groupId>
    <artifactId>spark-tfrecord_2.12</artifactId>
    <version>0.3.4</version>
</dependency>

Following the pyspark example from the README but simplified further:

path = "/tmp/test-output.tfrecord"

fields = [
    StructField("a", IntegerType()),
    StructField("b", FloatType()),
    StructField("c", StringType()),
]
schema = StructType(fields)
test_rows = [
    [1, 0.5, 'x'],
    [2, 1.5, 'y'],
    [3, 2.5, 'z'],
]
rdd = spark.sparkContext.parallelize(test_rows)
df = spark.createDataFrame(rdd, schema)
df.show()

Outputs:

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|0.5|  x|
|  2|1.5|  y|
|  3|2.5|  z|
+---+---+---+

Saving the spark dataframe to tfrecord does not throw an error.

path = "/tmp/test-output.tfrecord/"
df.write.mode("overwrite").format("tfrecord").option("recordType", "Example").save(path)

But the directory only has a _SUCCESS flag and a crc file, no data.

ls -la /tmp/test-output.tfrecord/
total 12
drwxr-xr-x.  2 build build 4096 Feb 19 19:00 .
drwxrwxrwx. 11 root  root  4096 Feb 19 19:00 ..
-rw-r--r--.  1 build build    0 Feb 19 19:00 _SUCCESS
-rw-r--r--.  1 build build    8 Feb 19 19:00 ._SUCCESS.crc

And of course, trying to read the file fails.

spark.read.format('tfrecord').option('recordType', 'Example').load(path).show()

Error:

AnalysisException: Unable to infer schema for TFRECORD. It must be specified manually.

Let me know if there is more system/config information that could help to debug this.

FWIW, I had the exact same situation when testing spark-tensorflow-connector which I was building from source. I figured there was something wrong with my dependencies or something and thought I would try this project.

thanks,
Dennis

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions