albertshau <ashau@google.com>: Author Summary

Builds triggered by albertshau <ashau@google.com>

Builds triggered by an author are those builds which contains changes committed by the author.
1352
368 (27%)
984 (73%)

Breakages and fixes

Broken means the build has failed but the previous build was successful.
Fixed means that the build was successful but the previous build has failed.
115 (9% of all builds triggered)
113 (8% of all builds triggered)
-2
Build Completed Code commits Tests
IT › ITM › #239 2 weeks ago
Merge pull request #1061 from cdapio/try-ordering-gcp-tests
explicitly list tests in the GCP suite
Merge pull request #1040 from cdapio/remove-metric-assertions-datastore
remove datastore metric assertions
Merge pull request #1055 from cdapio/bugfix/CDAP-16430-fix-hub-plugin-deploy
CDAP-16430 fix deploy hub plugin
increase dataproc cluster size
This fixes an issue where some pipelines in some environments
are stuck forever with MapReduce jobs that stay in pending state.
remove datastore metric assertions
Checking metrics is very unreliable due to race conditions. It can
fail incorrectly because they are not polled for a long enough time,
and they can pass incorrectly because the number being checked is
too low.
fix integration tests
Fix the google library dependencies to versions that all use
compatible dependencies, and remove the direct gax-grpc depdendency
that was causing Spanner to fail.

Also use macros in the BQ tests to both test macros and to
workaround an issue when the CDAP instance is running on a GCE
VM.
Merge pull request #1035 from cdapio/remove-validator-test-6-1
remove validator test
make dataproc worker resources configurable
Add a way to modify the dataproc worker cpus and memory through
passing in system properties. Also set the default to a higher
value, since some pipelines seem to be stuck waiting for
resources.
CDAP-16430 fix deploy hub plugin
Fix the deploy hub plugin method to use the public CDAP REST APIs
instead of some UI endpoint that is subject to change and which is
currently broken.
don't wait for status before starting
Merge branch 'release/6.1' into merge-release-6-1
Merge pull request #1062 from cdapio/increase-cluster-size
increase dataproc cluster size
Merge pull request #1056 from cdapio/merge-release-6-1
Merge release 6 1
CDAP-15804 use macros in tests for gcp plugins
Enhanced bigquery, datastore, and spanner tests to use macros.
disable yarn pmem check
CDAP services are getting killed by yarn due to physical memory
constraints. Disable the pmem check to avoid this.
Merge pull request #1063 from cdapio/fix-integration-tests
fix integration tests
Merge pull request #1036 from cdapio/CDAP-15804-gcp-test-macros
CDAP-15804 use macros in tests for gcp plugins
Merge pull request #1037 from cdapio/disable-yarn-pmem-check
disable yarn pmem check
remove validator test
the validator plugin was removed, so removing the test as well
Merge pull request #1060 from cdapio/increase-time-between-polls
CDAP-16454 increase time between status polls
CDAP-16429 log a warning if pipeline cleanup fails
The pubsub test was flaky because it would occasionally get a 400
error that the pipeline is not running when the stop call is made.
Changed the test to just log a warning in this scenario since the
test was actually successful. CDAP may have some underlying bugs
causing this behavior though.
Merge pull request #1068 from cdapio/bugfix/dlp-test
CDAP-16604 fix dlp test
CDAP-16454 increase time between status polls
Increase the time between program status polls, since most programs
take minutes to run and don't need to be polled every 50 milliseconds.
This also greatly reduces the amount of logging, making it
easier to investigate failures.
Merge pull request #1058 from cdapio/increase-dataproc-resources
make dataproc worker resources configurable
CDAP-16604 fix dlp test
Fixed the dlp test, which was incorrectly counting an empty file
as an output line of an empty string.
Merge pull request #1053 from cdapio/bugfix/CDAP-16429-fix-pubsub-stop-flakiness
CDAP-16429 log a warning if pipeline cleanup fails
explicitly list tests in the GCP suite
Explicitly listing out tests in the GCP suite to control the
order and see if some failures can be explained by ordering.
45 passed
IT › UPD2 › #647 1 month ago
Merge pull request #12763 from cdapio/bugfix/CDAP-17237-fix-pipeline-hconf
CDAP-17237 fix pipeline hconf clearing
CDAP-17237 fix pipeline hconf clearing
Fixed a bug where the hadoop conf is cleared before adding sink
specific properties. This ensures that cluster specific defaults
are correctly included in the conf instead of being wiped.
Testless build
CDAP › URUT › #1124 1 month ago
Merge pull request #12763 from cdapio/bugfix/CDAP-17237-fix-pipeline-hconf
CDAP-17237 fix pipeline hconf clearing
CDAP-17237 fix pipeline hconf clearing
Fixed a bug where the hadoop conf is cleared before adding sink
specific properties. This ensures that cluster specific defaults
are correctly included in the conf instead of being wiped.
Testless build
CDAP › DUT › #3061 1 month ago
Merge pull request #12763 from cdapio/bugfix/CDAP-17237-fix-pipeline-hconf
CDAP-17237 fix pipeline hconf clearing
CDAP-17237 fix pipeline hconf clearing
Fixed a bug where the hadoop conf is cleared before adding sink
specific properties. This ensures that cluster specific defaults
are correctly included in the conf instead of being wiped.
2356 passed
CDAP › RUT › #1246 1 month ago
Merge pull request #12763 from cdapio/bugfix/CDAP-17237-fix-pipeline-hconf
CDAP-17237 fix pipeline hconf clearing
CDAP-17237 fix pipeline hconf clearing
Fixed a bug where the hadoop conf is cleared before adding sink
specific properties. This ensures that cluster specific defaults
are correctly included in the conf instead of being wiped.
2839 passed
CDAP › UDUT › #1116 1 month ago
Merge pull request #12763 from cdapio/bugfix/CDAP-17237-fix-pipeline-hconf
CDAP-17237 fix pipeline hconf clearing
CDAP-17237 fix pipeline hconf clearing
Fixed a bug where the hadoop conf is cleared before adding sink
specific properties. This ensures that cluster specific defaults
are correctly included in the conf instead of being wiped.
Testless build
CDAP › DRC › #5083 1 month ago
Merge pull request #12763 from cdapio/bugfix/CDAP-17237-fix-pipeline-hconf
CDAP-17237 fix pipeline hconf clearing
CDAP-17237 fix pipeline hconf clearing
Fixed a bug where the hadoop conf is cleared before adding sink
specific properties. This ensures that cluster specific defaults
are correctly included in the conf instead of being wiped.
Testless build
IT › UPD2 › #627 2 months ago
Merge pull request #12362 from cdapio/bugfix/CDAP-16760-set-resources-for-sparkprogram-plugins
CDAP-16760 set resources for sparkprogram plugins
CDAP-16760 set resources for sparkprogram plugins
also set properties in the engine config.
Testless build
CDAP › RUT › #1236 2 months ago
CDAP-16760 set resources for sparkprogram plugins
also set properties in the engine config.
Merge pull request #12362 from cdapio/bugfix/CDAP-16760-set-resources-for-sparkprogram-plugins
CDAP-16760 set resources for sparkprogram plugins
2839 passed
CDAP › URUT › #1108 2 months ago
CDAP-16760 set resources for sparkprogram plugins
also set properties in the engine config.
Merge pull request #12362 from cdapio/bugfix/CDAP-16760-set-resources-for-sparkprogram-plugins
CDAP-16760 set resources for sparkprogram plugins
Testless build
Build Completed Code commits Tests
CDAP › DUT › #3061 1 month ago
Merge pull request #12763 from cdapio/bugfix/CDAP-17237-fix-pipeline-hconf
CDAP-17237 fix pipeline hconf clearing
CDAP-17237 fix pipeline hconf clearing
Fixed a bug where the hadoop conf is cleared before adding sink
specific properties. This ensures that cluster specific defaults
are correctly included in the conf instead of being wiped.
2356 passed
CDAP › DUT › #3022 2 months ago
Merge pull request #12558 from cdapio/feature/CDAP-17078-spark-stage-consolidation
CDAP-17078 consolidate stages within a group
CDAP-17078 consolidate stages within a group
Changed the SparkPipelineRunner to use a CombinerDag to group
sinks and their preceding transforms together. These grouped
stages are treated similarly to how a single sink is treated,
with flatMapToPair() called on the input RDD to transform it
into a PairRDD, then calling save() to write the RDD out.
This capability is off by default, but can be turned on by
setting a runtime argument.

Instead of flatMapToPair() calling just the sink's transform
method, a new MultiSinkFunction class is used to direct incoming
records to the correct logical branches of the pipeline.
This requires that each input be tagged with which stage it
came from (stage and port), as well as its type (output, or error).
In order to do this, refactored the SparkPipelineRunner a bit
to maintain the RDD<RecordInfo> for each stage rather than
RDD<StructuredRecord>, as the RecordInfo class contains that
extra information.

Also added a MultiOutputFormat that will take the output of the
MultiSinkFunction and delegate writes to the correct underlying
OutputFormat. Since the OutputFormat lives in the pipeline
app, this approach means CDAP datasets cannot be combined.
This caused a problem with dataset lineage, since it is
implemented by implemented by wrapping OutputFormats into a hidden
ExternalDataset class in CDAP. Instead of doing this indirect
wrapping, changed the SparkSinkFactory class to explicitly
register lineage through direct calls instead of hiding it
under several layers of abstraction.
1 of 1969 failed
CDAP › DUT › #3003 3 months ago
Merge pull request #12494 from cdapio/feature/CDAP-17078-transform-executor-refactor
CDAP-17078 refactored MapReduceTransformExecutorFactory
CDAP-17078 refactored MapReduceTransformExecutorFactory
Refactored the transform executor factory used in mapreduce so
that much of the logic can be re-used in Spark as well.
1 of 1965 failed
CDAP › RUT › #1147 4 months ago
CDAP-16943 use byte[] instead of ByteBuffer for record conversion
When converting from a spark Row to a cdap StructuredRecord,
use a byte[] for byte fields instead of a ByteBuffer.
This is because downstream plugins are less likely to have
issues dealing with byte[] and because ByteBuffer is not
serializable, which can cause issues in certain Spark pipelines.
CDAP-16955 add metrics for records into an autojoin
Added a no-op map to auto-joiner input to count records in for
the stage, similar to what is done for normal joiners.
Enhanced autojoin unit tests to check values of
records.in and records.out metrics
Merge pull request #12311 from cdapio/bugfix/CDAP-16955-autojoin-records-in
CDAP-16955 add metrics for records into an autojoin
Merge pull request #12325 from cdapio/feature/CDAP-16943-dataframes-bytearr
CDAP-16943 use byte[] instead of ByteBuffer for record conversion
1 of 1974 failed
HYP › BAD › #300 4 months ago
CDAP-16875 changed joiner plugin to use new auto-join API
Changed the joiner to use the new AutoJoiner API to take
advantage of the performance improvements it allows in Spark
pipelines.

Introduced two new optional properties. The first determines
whether the join will use null safe equality, and the second
provides hint to the execution engine about which input datasets
should be broadcast to perform an in-memory join.

Removed much of the existing join logic because it has been moved
into the application code. Updated the documentation to use
the property names that show up in the UI, to display the properties
in the same order that they appear in the UI, and to include
descriptions of the two new properties.
Merge pull request #1108 from cdapio/feature/CDAP-16875-auto-join
CDAP-16875 changed joiner plugin to use new auto-join API
Testless build
IT › UPD2 › #540 4 months ago
CDAP-16852 handle dynamic schemas in auto join
Properly handle the case when the inputs into an auto-join stage
have null schemas due to the schema being unknown at deploy time,
usually because of macros.

In order to support this type of use case, added a way for a
plugin to specify the output schema in the JoinDefinition.
This should be used when the output schema cannot be derived
at deployment time due to macros. This is what the existing
Joiner currently requires the user to do for dynamic join
use cases.

For Spark, the implementation becomes significantly more difficult
due to the fact that the schema of each input stage needs to be
known in the Spark driver, in order to convert RDDs into
DataFrames.

Added logic that derives the input schema using the output schema,
selected fields, and join keys. It is possible to derive
a usable schema when all the join keys are present in the final
output schema.
Merge pull request #12252 from cdapio/feature/CDAP-16852-auto-join-macro-schemas
CDAP-16852 handle dynamic schemas in auto join
Testless build
CDAP › RUT › #1134 4 months ago
Merge pull request #12252 from cdapio/feature/CDAP-16852-auto-join-macro-schemas
CDAP-16852 handle dynamic schemas in auto join
CDAP-16852 handle dynamic schemas in auto join
Properly handle the case when the inputs into an auto-join stage
have null schemas due to the schema being unknown at deploy time,
usually because of macros.

In order to support this type of use case, added a way for a
plugin to specify the output schema in the JoinDefinition.
This should be used when the output schema cannot be derived
at deployment time due to macros. This is what the existing
Joiner currently requires the user to do for dynamic join
use cases.

For Spark, the implementation becomes significantly more difficult
due to the fact that the schema of each input stage needs to be
known in the Spark driver, in order to convert RDDs into
DataFrames.

Added logic that derives the input schema using the output schema,
selected fields, and join keys. It is possible to derive
a usable schema when all the join keys are present in the final
output schema.
2765 passed
CDAP › RUT › #1130 4 months ago
Merge pull request #12246 from cdapio/feature/CDAP-16875-join-error-export
CDAP-16875 export join error in hydrator test base
CDAP-16875 export join error in hydrator test base
export the package for join error so that it is accessible to
plugins in unit tests.
4 of 2779 failed
CDAP › RUT › #1127 4 months ago
CDAP-16709 implement manual broadcasts
Honor the broadcast flag set in the JoinDefinition when joining
multiple DataFrames. Added a small tweak to the join logic to
join all non-broadcasted datasets first in order to ensure that
both sides of the join are not broadcast, and to reduce the amount
of data that is being shuffled in non-broadcast intermediate joins.
CDAP-16711 implement null safe keys as an option
Changed the 'dropNullKeys' property to 'nullSafe' since it more
accurately describes what is happening. Null keys are not dropped
in outer joins, they just do not count as being equal to a null
key on the other side.

Implemented by using Spark's null safe equality when configured
to do so and normal equality otherwise.
CDAP-16709 batch spark auto-join implementation
Implemented auto join for batch spark pipelines.

Added a join method to SparkCollection that takes in the list of
other SparkCollections that it should be joined to.
RDDCollection converts RDDs into Datasets and uses the Dataset
join method to implement the join. This allows Spark to broadcast
small datasets automatically, and to use sort merge join instead
of shuffle hash join, which has better memory characteristics.

As part of this, added a separate RDDCollection implementation for
Spark1 and Spark2, since the Spark API for joins is not compatible.
CDAP-16708 add API for autojoiner
Add a new set of classes for AutoJoiner, which can be used instead
of the current Joiner interface. This new API leaves all of the
implementation details up to the application, which will allow the
app to perform the join in better ways. For example, in the Spark
program, it will allow using broadcast joins.

Plugin developers are responsible for returning a JoinDefinition
based on information about incoming stages.

This change includes the JoinDefinition as well as all the classes
required to create a definition. It also includes validation logic
to make sure the plugin cannot create a definition that tries to
join on a field that doesn't exist, or tries to join on fields
that have mismatched types, or any other type of error.
CDAP-16708 rename Field to JoinField
Merge pull request #12234 from cdapio/feature/CDAP-16709-pipeline-performance-cp
Feature/cdap 16709 pipeline performance cp
CDAP-16711 implemented null safe equality in mapreduce
Implemented the nullSafe flag for mapreduce auto join.
This was done by filtering out records on the map side if they
come from an optional stage and have a null key or a field in
the key that is null.
CDAP-16709 implement auto join for mapreduce
Implemented AutoJoiner for mapreduce by creating a bridge that
implements the old Joiner API using the JoinDefinition from the
new AutoJoiner API.
CDAP-16709 implemented auto-join for spark streaming
Implemented auto-join for spark streaming by using the same
JoinerBridge that is used for MapReduce. This means auto-joins
in streaming pipelines will have the same characteristics as normal
joins, meaning they will be executed as shuffle hash joins.

This is probably ok, as only data within the micro batch is being
joined, which means it shouldn't be too likely to go OOM assuming
there is enough executor memory.
2760 passed
CDAP › DUT › #2935 4 months ago
CDAP-16708 add API for autojoiner
Add a new set of classes for AutoJoiner, which can be used instead
of the current Joiner interface. This new API leaves all of the
implementation details up to the application, which will allow the
app to perform the join in better ways. For example, in the Spark
program, it will allow using broadcast joins.

Plugin developers are responsible for returning a JoinDefinition
based on information about incoming stages.

This change includes the JoinDefinition as well as all the classes
required to create a definition. It also includes validation logic
to make sure the plugin cannot create a definition that tries to
join on a field that doesn't exist, or tries to join on fields
that have mismatched types, or any other type of error.
CDAP-16709 implement auto join for mapreduce
Implemented AutoJoiner for mapreduce by creating a bridge that
implements the old Joiner API using the JoinDefinition from the
new AutoJoiner API.
CDAP-16709 implemented auto-join for spark streaming
Implemented auto-join for spark streaming by using the same
JoinerBridge that is used for MapReduce. This means auto-joins
in streaming pipelines will have the same characteristics as normal
joins, meaning they will be executed as shuffle hash joins.

This is probably ok, as only data within the micro batch is being
joined, which means it shouldn't be too likely to go OOM assuming
there is enough executor memory.
CDAP-16711 implement null safe keys as an option
Changed the 'dropNullKeys' property to 'nullSafe' since it more
accurately describes what is happening. Null keys are not dropped
in outer joins, they just do not count as being equal to a null
key on the other side.

Implemented by using Spark's null safe equality when configured
to do so and normal equality otherwise.
CDAP-16709 batch spark auto-join implementation
Implemented auto join for batch spark pipelines.

Added a join method to SparkCollection that takes in the list of
other SparkCollections that it should be joined to.
RDDCollection converts RDDs into Datasets and uses the Dataset
join method to implement the join. This allows Spark to broadcast
small datasets automatically, and to use sort merge join instead
of shuffle hash join, which has better memory characteristics.

As part of this, added a separate RDDCollection implementation for
Spark1 and Spark2, since the Spark API for joins is not compatible.
CDAP-16711 implemented null safe equality in mapreduce
Implemented the nullSafe flag for mapreduce auto join.
This was done by filtering out records on the map side if they
come from an optional stage and have a null key or a field in
the key that is null.
Merge pull request #12234 from cdapio/feature/CDAP-16709-pipeline-performance-cp
Feature/cdap 16709 pipeline performance cp
CDAP-16708 rename Field to JoinField
CDAP-16709 implement manual broadcasts
Honor the broadcast flag set in the JoinDefinition when joining
multiple DataFrames. Added a small tweak to the join logic to
join all non-broadcasted datasets first in order to ensure that
both sides of the join are not broadcast, and to reduce the amount
of data that is being shuffled in non-broadcast intermediate joins.
2760 passed
Build Completed Code commits Tests
CDAP › RUT › #1234 2 months ago
CDAP-17232 wait longer for program state to reduce flakiness
Increase the time to wait for program state in gateway tests from
10 seconds to 30 seconds to reduce test flakiness on slow machines.
Merge pull request #12700 from cdapio/bugfix/CDAP-17232-wait-longer-for-programs
CDAP-17232 wait longer for program state to reduce flakiness
2839 passed
CDAP › DUT › #3028 2 months ago
Merge pull request #12594 from cdapio/feature/CDAP-17078-consolidate-multi-output-sinks
CDAP-17078 consolidate multiple outputs from same sink
CDAP-17078 consolidate multiple outputs from same sink
Some sinks have multiple outputs, which can cause a lot of
data recomputation. Added logic to consolidate these outputs.
2827 passed
CDAP › DUT › #3010 3 months ago
CDAP-17078 add a combiner dag to consolidate nodes
Merge pull request #12483 from cdapio/feature/CDAP-17078-dag-consolidation-logic
CDAP-17078 add a combiner dag to consolidate nodes
2805 passed
CDAP › RUT › #1172 4 months ago
CDAP-17024 set spark sql case sensitivity
Set spark sql to be case sensitive so that autojoins behave in
the same way as batchjoiners.
Merge pull request #12412 from cdapio/bugfix/CDAP-17024-autojoin-case-sensitive
CDAP-17024 set spark sql case sensitivity
2788 passed
CDAP › RUT › #1160 4 months ago
CDAP-17000 increase spark network timeout by default
Merge pull request #12376 from cdapio/feature/CDAP-17000-spark-network-timeout
CDAP-17000 increase spark network timeout by default
2787 passed
CDAP › DUT › #2967 4 months ago
CDAP-16935 partition dataframes before join
partition dataframes right before the join using the same
partitioning as the join would, except using the number of
partitions specified by the plugin instead of a global number
defined by the spark conf.
Merge pull request #12361 from cdapio/feature/CDAP-16935-autojoin-set-partitions
CDAP-16935 partition dataframes before join
2787 passed
CDAP › DUT › #2937 4 months ago
CDAP-16875 include more detailed information in join exception
Added a list of JoinErrors to the InvalidJoinException and
refactored the join validation logic to include all the
errors found in that list. This allows plugins to
translate the errors found by the platform to problems
with plugin specific configuration properties.
Without this, plugins would have to re-implement all of the
validation checks that are performed by the platform.

Enhance the AutoJoinerContext to include a way to get a
FailureCollector to allow plugins to handle validation of the
join definition in the same way as other validation errors.
Merge pull request #12242 from cdapio/feature/CDAP-16875-join-exception-detail
CDAP-16875 include more detailed information in join exception
2759 passed
CDAP › RUT › #1129 4 months ago
Merge pull request #12242 from cdapio/feature/CDAP-16875-join-exception-detail
CDAP-16875 include more detailed information in join exception
CDAP-16875 include more detailed information in join exception
Added a list of JoinErrors to the InvalidJoinException and
refactored the join validation logic to include all the
errors found in that list. This allows plugins to
translate the errors found by the platform to problems
with plugin specific configuration properties.
Without this, plugins would have to re-implement all of the
validation checks that are performed by the platform.

Enhance the AutoJoinerContext to include a way to get a
FailureCollector to allow plugins to handle validation of the
join definition in the same way as other validation errors.
2759 passed
CDAP › RUT › #1095 6 months ago
Merge pull request #12065 from cdapio/fix/more-vcpu-options
Added more options (16, 32, 64, 96) for master and worker vCPUs
2704 passed
You have insufficient permissions to see all of the builds.