albertshau <ashau@google.com>: Author Summary

Builds triggered by albertshau <ashau@google.com>

Builds triggered by an author are those builds which contains changes committed by the author.
1272
351 (28%)
921 (72%)

Breakages and fixes

Broken means the build has failed but the previous build was successful.
Fixed means that the build was successful but the previous build has failed.
112 (9% of all builds triggered)
109 (9% of all builds triggered)
-3
Build Completed Code commits Tests
HYP › WT › #340 9 hours ago
Merge pull request #419 from data-integrations/bugfix/CDAP-17029-gcs-ignore-last-empty-line
CDAP-17029 ignore last empty line of text gcs files
CDAP-17029 ignore last empty line of text gcs files
350 passed
IT › UPD2 › #562 1 week ago
CDAP-17024 set spark sql case sensitivity
Set spark sql to be case sensitive so that autojoins behave in
the same way as batchjoiners.
Merge pull request #12412 from cdapio/bugfix/CDAP-17024-autojoin-case-sensitive
CDAP-17024 set spark sql case sensitivity
Testless build
CDAP › UDUT › #1009 1 week ago
CDAP-17024 set spark sql case sensitivity
Set spark sql to be case sensitive so that autojoins behave in
the same way as batchjoiners.
Merge pull request #12412 from cdapio/bugfix/CDAP-17024-autojoin-case-sensitive
CDAP-17024 set spark sql case sensitivity
Testless build
CDAP › URUT › #1018 1 week ago
Merge pull request #12412 from cdapio/bugfix/CDAP-17024-autojoin-case-sensitive
CDAP-17024 set spark sql case sensitivity
CDAP-17024 set spark sql case sensitivity
Set spark sql to be case sensitive so that autojoins behave in
the same way as batchjoiners.
Testless build
CDAP › RUT › #1172 1 week ago
CDAP-17024 set spark sql case sensitivity
Set spark sql to be case sensitive so that autojoins behave in
the same way as batchjoiners.
Merge pull request #12412 from cdapio/bugfix/CDAP-17024-autojoin-case-sensitive
CDAP-17024 set spark sql case sensitivity
2788 passed
CDAP › DUT › #2981 1 week ago
CDAP-17024 set spark sql case sensitivity
Set spark sql to be case sensitive so that autojoins behave in
the same way as batchjoiners.
Merge pull request #12412 from cdapio/bugfix/CDAP-17024-autojoin-case-sensitive
CDAP-17024 set spark sql case sensitivity
2788 passed
CDAP › BPP › #1250 1 week ago
Merge pull request #12412 from cdapio/bugfix/CDAP-17024-autojoin-case-sensitive
CDAP-17024 set spark sql case sensitivity
CDAP-17024 set spark sql case sensitivity
Set spark sql to be case sensitive so that autojoins behave in
the same way as batchjoiners.
Testless build
CDAP › DRC › #4972 1 week ago
CDAP-17024 set spark sql case sensitivity
Set spark sql to be case sensitive so that autojoins behave in
the same way as batchjoiners.
Merge pull request #12412 from cdapio/bugfix/CDAP-17024-autojoin-case-sensitive
CDAP-17024 set spark sql case sensitivity
Testless build
IT › UPD2 › #554 2 weeks ago
CDAP-16994 fix schema equality check during pipeline deployment
Fixed a bug that would cause pipeline deployment to fails if
two stages have the same output stage and their output schemas
are exactly the same except for the schema name.

Implemented a schema equality check that ignores record name
when checking equality, and used it for the validation checks.
CDAP-17000 increase spark network timeout by default
Merge pull request #12376 from cdapio/feature/CDAP-17000-spark-network-timeout
CDAP-17000 increase spark network timeout by default
Merge pull request #12360 from cdapio/bugfix/CDAP-16994-fix-same-schema-comparison
CDAP-16994 fix schema equality check during pipeline deployment
CDAP-16935 partition dataframes before join
partition dataframes right before the join using the same
partitioning as the join would, except using the number of
partitions specified by the plugin instead of a global number
defined by the spark conf.
Merge pull request #12361 from cdapio/feature/CDAP-16935-autojoin-set-partitions
CDAP-16935 partition dataframes before join
Testless build
CDAP › DUT › #2969 2 weeks ago
CDAP-17000 increase spark network timeout by default
Merge pull request #12376 from cdapio/feature/CDAP-17000-spark-network-timeout
CDAP-17000 increase spark network timeout by default
2787 passed
Build Completed Code commits Tests
CDAP › RUT › #1147 3 weeks ago
CDAP-16943 use byte[] instead of ByteBuffer for record conversion
When converting from a spark Row to a cdap StructuredRecord,
use a byte[] for byte fields instead of a ByteBuffer.
This is because downstream plugins are less likely to have
issues dealing with byte[] and because ByteBuffer is not
serializable, which can cause issues in certain Spark pipelines.
CDAP-16955 add metrics for records into an autojoin
Added a no-op map to auto-joiner input to count records in for
the stage, similar to what is done for normal joiners.
Enhanced autojoin unit tests to check values of
records.in and records.out metrics
Merge pull request #12311 from cdapio/bugfix/CDAP-16955-autojoin-records-in
CDAP-16955 add metrics for records into an autojoin
Merge pull request #12325 from cdapio/feature/CDAP-16943-dataframes-bytearr
CDAP-16943 use byte[] instead of ByteBuffer for record conversion
1 of 1974 failed
HYP › BAD › #300 3 weeks ago
CDAP-16875 changed joiner plugin to use new auto-join API
Changed the joiner to use the new AutoJoiner API to take
advantage of the performance improvements it allows in Spark
pipelines.

Introduced two new optional properties. The first determines
whether the join will use null safe equality, and the second
provides hint to the execution engine about which input datasets
should be broadcast to perform an in-memory join.

Removed much of the existing join logic because it has been moved
into the application code. Updated the documentation to use
the property names that show up in the UI, to display the properties
in the same order that they appear in the UI, and to include
descriptions of the two new properties.
Merge pull request #1108 from cdapio/feature/CDAP-16875-auto-join
CDAP-16875 changed joiner plugin to use new auto-join API
Testless build
IT › UPD2 › #540 1 month ago
CDAP-16852 handle dynamic schemas in auto join
Properly handle the case when the inputs into an auto-join stage
have null schemas due to the schema being unknown at deploy time,
usually because of macros.

In order to support this type of use case, added a way for a
plugin to specify the output schema in the JoinDefinition.
This should be used when the output schema cannot be derived
at deployment time due to macros. This is what the existing
Joiner currently requires the user to do for dynamic join
use cases.

For Spark, the implementation becomes significantly more difficult
due to the fact that the schema of each input stage needs to be
known in the Spark driver, in order to convert RDDs into
DataFrames.

Added logic that derives the input schema using the output schema,
selected fields, and join keys. It is possible to derive
a usable schema when all the join keys are present in the final
output schema.
Merge pull request #12252 from cdapio/feature/CDAP-16852-auto-join-macro-schemas
CDAP-16852 handle dynamic schemas in auto join
Testless build
CDAP › RUT › #1134 1 month ago
Merge pull request #12252 from cdapio/feature/CDAP-16852-auto-join-macro-schemas
CDAP-16852 handle dynamic schemas in auto join
CDAP-16852 handle dynamic schemas in auto join
Properly handle the case when the inputs into an auto-join stage
have null schemas due to the schema being unknown at deploy time,
usually because of macros.

In order to support this type of use case, added a way for a
plugin to specify the output schema in the JoinDefinition.
This should be used when the output schema cannot be derived
at deployment time due to macros. This is what the existing
Joiner currently requires the user to do for dynamic join
use cases.

For Spark, the implementation becomes significantly more difficult
due to the fact that the schema of each input stage needs to be
known in the Spark driver, in order to convert RDDs into
DataFrames.

Added logic that derives the input schema using the output schema,
selected fields, and join keys. It is possible to derive
a usable schema when all the join keys are present in the final
output schema.
2765 passed
CDAP › RUT › #1130 1 month ago
Merge pull request #12246 from cdapio/feature/CDAP-16875-join-error-export
CDAP-16875 export join error in hydrator test base
CDAP-16875 export join error in hydrator test base
export the package for join error so that it is accessible to
plugins in unit tests.
4 of 2779 failed
CDAP › RUT › #1127 1 month ago
CDAP-16709 implement manual broadcasts
Honor the broadcast flag set in the JoinDefinition when joining
multiple DataFrames. Added a small tweak to the join logic to
join all non-broadcasted datasets first in order to ensure that
both sides of the join are not broadcast, and to reduce the amount
of data that is being shuffled in non-broadcast intermediate joins.
CDAP-16711 implement null safe keys as an option
Changed the 'dropNullKeys' property to 'nullSafe' since it more
accurately describes what is happening. Null keys are not dropped
in outer joins, they just do not count as being equal to a null
key on the other side.

Implemented by using Spark's null safe equality when configured
to do so and normal equality otherwise.
CDAP-16709 batch spark auto-join implementation
Implemented auto join for batch spark pipelines.

Added a join method to SparkCollection that takes in the list of
other SparkCollections that it should be joined to.
RDDCollection converts RDDs into Datasets and uses the Dataset
join method to implement the join. This allows Spark to broadcast
small datasets automatically, and to use sort merge join instead
of shuffle hash join, which has better memory characteristics.

As part of this, added a separate RDDCollection implementation for
Spark1 and Spark2, since the Spark API for joins is not compatible.
CDAP-16708 add API for autojoiner
Add a new set of classes for AutoJoiner, which can be used instead
of the current Joiner interface. This new API leaves all of the
implementation details up to the application, which will allow the
app to perform the join in better ways. For example, in the Spark
program, it will allow using broadcast joins.

Plugin developers are responsible for returning a JoinDefinition
based on information about incoming stages.

This change includes the JoinDefinition as well as all the classes
required to create a definition. It also includes validation logic
to make sure the plugin cannot create a definition that tries to
join on a field that doesn't exist, or tries to join on fields
that have mismatched types, or any other type of error.
CDAP-16708 rename Field to JoinField
Merge pull request #12234 from cdapio/feature/CDAP-16709-pipeline-performance-cp
Feature/cdap 16709 pipeline performance cp
CDAP-16711 implemented null safe equality in mapreduce
Implemented the nullSafe flag for mapreduce auto join.
This was done by filtering out records on the map side if they
come from an optional stage and have a null key or a field in
the key that is null.
CDAP-16709 implement auto join for mapreduce
Implemented AutoJoiner for mapreduce by creating a bridge that
implements the old Joiner API using the JoinDefinition from the
new AutoJoiner API.
CDAP-16709 implemented auto-join for spark streaming
Implemented auto-join for spark streaming by using the same
JoinerBridge that is used for MapReduce. This means auto-joins
in streaming pipelines will have the same characteristics as normal
joins, meaning they will be executed as shuffle hash joins.

This is probably ok, as only data within the micro batch is being
joined, which means it shouldn't be too likely to go OOM assuming
there is enough executor memory.
2760 passed
CDAP › DUT › #2935 1 month ago
CDAP-16708 add API for autojoiner
Add a new set of classes for AutoJoiner, which can be used instead
of the current Joiner interface. This new API leaves all of the
implementation details up to the application, which will allow the
app to perform the join in better ways. For example, in the Spark
program, it will allow using broadcast joins.

Plugin developers are responsible for returning a JoinDefinition
based on information about incoming stages.

This change includes the JoinDefinition as well as all the classes
required to create a definition. It also includes validation logic
to make sure the plugin cannot create a definition that tries to
join on a field that doesn't exist, or tries to join on fields
that have mismatched types, or any other type of error.
CDAP-16709 implement auto join for mapreduce
Implemented AutoJoiner for mapreduce by creating a bridge that
implements the old Joiner API using the JoinDefinition from the
new AutoJoiner API.
CDAP-16709 implemented auto-join for spark streaming
Implemented auto-join for spark streaming by using the same
JoinerBridge that is used for MapReduce. This means auto-joins
in streaming pipelines will have the same characteristics as normal
joins, meaning they will be executed as shuffle hash joins.

This is probably ok, as only data within the micro batch is being
joined, which means it shouldn't be too likely to go OOM assuming
there is enough executor memory.
CDAP-16711 implement null safe keys as an option
Changed the 'dropNullKeys' property to 'nullSafe' since it more
accurately describes what is happening. Null keys are not dropped
in outer joins, they just do not count as being equal to a null
key on the other side.

Implemented by using Spark's null safe equality when configured
to do so and normal equality otherwise.
CDAP-16709 batch spark auto-join implementation
Implemented auto join for batch spark pipelines.

Added a join method to SparkCollection that takes in the list of
other SparkCollections that it should be joined to.
RDDCollection converts RDDs into Datasets and uses the Dataset
join method to implement the join. This allows Spark to broadcast
small datasets automatically, and to use sort merge join instead
of shuffle hash join, which has better memory characteristics.

As part of this, added a separate RDDCollection implementation for
Spark1 and Spark2, since the Spark API for joins is not compatible.
CDAP-16711 implemented null safe equality in mapreduce
Implemented the nullSafe flag for mapreduce auto join.
This was done by filtering out records on the map side if they
come from an optional stage and have a null key or a field in
the key that is null.
Merge pull request #12234 from cdapio/feature/CDAP-16709-pipeline-performance-cp
Feature/cdap 16709 pipeline performance cp
CDAP-16708 rename Field to JoinField
CDAP-16709 implement manual broadcasts
Honor the broadcast flag set in the JoinDefinition when joining
multiple DataFrames. Added a small tweak to the join logic to
join all non-broadcasted datasets first in order to ensure that
both sides of the join are not broadcast, and to reduce the amount
of data that is being shuffled in non-broadcast intermediate joins.
2760 passed
CDAP › RUT › #1108 2 months ago
Merge branch 'release/6.2' into merge-6.2
1 of 1139 failed
You have insufficient permissions to see all of the builds.
Build Completed Code commits Tests
CDAP › RUT › #1172 1 week ago
CDAP-17024 set spark sql case sensitivity
Set spark sql to be case sensitive so that autojoins behave in
the same way as batchjoiners.
Merge pull request #12412 from cdapio/bugfix/CDAP-17024-autojoin-case-sensitive
CDAP-17024 set spark sql case sensitivity
2788 passed
CDAP › RUT › #1160 2 weeks ago
CDAP-17000 increase spark network timeout by default
Merge pull request #12376 from cdapio/feature/CDAP-17000-spark-network-timeout
CDAP-17000 increase spark network timeout by default
2787 passed
CDAP › DUT › #2967 2 weeks ago
CDAP-16935 partition dataframes before join
partition dataframes right before the join using the same
partitioning as the join would, except using the number of
partitions specified by the plugin instead of a global number
defined by the spark conf.
Merge pull request #12361 from cdapio/feature/CDAP-16935-autojoin-set-partitions
CDAP-16935 partition dataframes before join
2787 passed
CDAP › DUT › #2937 1 month ago
CDAP-16875 include more detailed information in join exception
Added a list of JoinErrors to the InvalidJoinException and
refactored the join validation logic to include all the
errors found in that list. This allows plugins to
translate the errors found by the platform to problems
with plugin specific configuration properties.
Without this, plugins would have to re-implement all of the
validation checks that are performed by the platform.

Enhance the AutoJoinerContext to include a way to get a
FailureCollector to allow plugins to handle validation of the
join definition in the same way as other validation errors.
Merge pull request #12242 from cdapio/feature/CDAP-16875-join-exception-detail
CDAP-16875 include more detailed information in join exception
2759 passed
CDAP › RUT › #1129 1 month ago
Merge pull request #12242 from cdapio/feature/CDAP-16875-join-exception-detail
CDAP-16875 include more detailed information in join exception
CDAP-16875 include more detailed information in join exception
Added a list of JoinErrors to the InvalidJoinException and
refactored the join validation logic to include all the
errors found in that list. This allows plugins to
translate the errors found by the platform to problems
with plugin specific configuration properties.
Without this, plugins would have to re-implement all of the
validation checks that are performed by the platform.

Enhance the AutoJoinerContext to include a way to get a
FailureCollector to allow plugins to handle validation of the
join definition in the same way as other validation errors.
2759 passed
CDAP › RUT › #1095 2 months ago
Merge pull request #12065 from cdapio/fix/more-vcpu-options
Added more options (16, 32, 64, 96) for master and worker vCPUs
2704 passed
You have insufficient permissions to see all of the builds.