How to avoid aggregate(shuffle) in processing the tfrecord file?

I have a very large tfrecord directory, and need to filter it with some column to generate new tfrecord files.

Code likes that
![image](https://user-images.githubusercontent.com/4220102/209424378-1a364259-7ca5-4793-a5ba-39405fff389d.png)

When I run it in spark cluster, I find it will run with two steps. 
![image](https://user-images.githubusercontent.com/4220102/209424413-a4532044-c9e0-4d07-a98c-4299fbede776.png)

I check the code  in ```https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-connector/src/main/scala/org/tensorflow/spark/datasources/tfrecords/TensorFlowInferSchema.scala#L39```, it have the aggregate steps !

Can I avoid it?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to avoid aggregate(shuffle) in processing the tfrecord file? #201

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

How to avoid aggregate(shuffle) in processing the tfrecord file? #201

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions