Build #5,031

License check with RAT and Checkstyle

Build: #5031 was successful Changes by albertshau <>

Build result summary


5 minutes
b1f303d704750c8a8fad879020e8a874ea4d8eff b1f303d704750c8a8fad879020e8a874ea4d8eff
Successful since
#4616 ()

Code commits

Author Commit Message Commit date
albertshau <> albertshau <> b1f303d704750c8a8fad879020e8a874ea4d8eff b1f303d704750c8a8fad879020e8a874ea4d8eff Merge pull request #12558 from cdapio/feature/CDAP-17078-spark-stage-consolidation
CDAP-17078 consolidate stages within a group
albertshau <> albertshau <> 7e4f1e1fe9cf01b9fa3efeb05b3c76865a0633bb m 7e4f1e1fe9cf01b9fa3efeb05b3c76865a0633bb CDAP-17078 consolidate stages within a group
Changed the SparkPipelineRunner to use a CombinerDag to group
sinks and their preceding transforms together. These grouped
stages are treated similarly to how a single sink is treated,
with flatMapToPair() called on the input RDD to transform it
into a PairRDD, then calling save() to write the RDD out.
This capability is off by default, but can be turned on by
setting a runtime argument.

Instead of flatMapToPair() calling just the sink's transform
method, a new MultiSinkFunction class is used to direct incoming
records to the correct logical branches of the pipeline.
This requires that each input be tagged with which stage it
came from (stage and port), as well as its type (output, or error).
In order to do this, refactored the SparkPipelineRunner a bit
to maintain the RDD<RecordInfo> for each stage rather than
RDD<StructuredRecord>, as the RecordInfo class contains that
extra information.

Also added a MultiOutputFormat that will take the output of the
MultiSinkFunction and delegate writes to the correct underlying
OutputFormat. Since the OutputFormat lives in the pipeline
app, this approach means CDAP datasets cannot be combined.
This caused a problem with dataset lineage, since it is
implemented by implemented by wrapping OutputFormats into a hidden
ExternalDataset class in CDAP. Instead of doing this indirect
wrapping, changed the SparkSinkFactory class to explicitly
register lineage through direct calls instead of hiding it
under several layers of abstraction.

JIRA issues

Unknown Issue TypeCDAP-17078Could not obtain issue details from JIRA