apache/spark

Commits on Dec 27, 2016

[SPARK-19006][DOCS] mention spark.kryoserializer.buffer.max must be l…

…ess than 2048m in doc

## What changes were proposed in this pull request?

On configuration doc page:https://spark.apache.org/docs/latest/configuration.html
We mentioned spark.kryoserializer.buffer.max : Maximum allowable size of Kryo serialization buffer. This must be larger than any object you attempt to serialize. Increase this if you get a "buffer limit exceeded" exception inside Kryo.
from source code, it has hard coded upper limit :
```
val maxBufferSizeMb = conf.getSizeAsMb("spark.kryoserializer.buffer.max", "64m").toInt
if (maxBufferSizeMb >= ByteUnit.GiB.toMiB(2))
{ throw new IllegalArgumentException("spark.kryoserializer.buffer.max must be less than " + s"2048 mb, got: + $maxBufferSizeMb mb.") }
```
We should mention "this value must be less than 2048 mb" on the configuration doc page as well.

## How was this patch tested?

None. Since it's minor doc change.

Author: Yuexin Zhang <[email protected]>

Closes #16412 from cnZach/SPARK-19006.

Yuexin Zhang committed with srowen

Dec 27, 2016

28ab0ec

[SPARK-18842][TESTS] De-duplicate paths in classpaths in processes fo…

…r local-cluster mode in ReplSuite to work around the length limitation on Windows

## What changes were proposed in this pull request?

`ReplSuite`s hang due to the length limitation on Windows with the exception as below:

```
Spark context available as 'sc' (master = local-cluster[1,1,1024], app id = app-20161223114000-0000).
Spark session available as 'spark'.
Exception in thread "ExecutorRunner for app-20161223114000-0000/26995" java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.Arrays.copyOf(Arrays.java:3332)
	at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
	at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:622)
	at java.lang.StringBuilder.append(StringBuilder.java:202)
	at java.lang.ProcessImpl.createCommandLine(ProcessImpl.java:194)
	at java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
	at java.lang.ProcessImpl.start(ProcessImpl.java:137)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	at org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167)
	at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
```

The reason is, it keeps failing and goes in an infinite loop. This fails because it uses the paths (via `getFile`) from URLs in the tests whereas some added afterward are normal local paths.
(`url.getFile` gives `/C:/a/b/c` and some paths are added later as the format of `C:\a\b\c`. )

So, many classpaths are duplicated because normal local paths and paths from URLs are mixed. This length is up to 40K which hits the length limitation problem (32K) on Windows.

The full command line built here is - https://gist.github.com/HyukjinKwon/46af7946c9a5fd4c6fc70a8a0aba1beb

## How was this patch tested?

Manually via AppVeyor.

**Before**
https://ci.appveyor.com/project/spark-test/spark/build/395-find-path-issues

**After**
https://ci.appveyor.com/project/spark-test/spark/build/398-find-path-issues

Author: hyukjinkwon <[email protected]>

Closes #16398 from HyukjinKwon/SPARK-18842-more.

HyukjinKwon committed with srowen

Dec 27, 2016

d8e14db

Revert "[SPARK-18990][SQL] make DatasetBenchmark fairer for Dataset"
```
This reverts commit a05cc42.
```
yhuai committed Dec 27, 2016
2404d8e

[SPARK-18990][SQL] make DatasetBenchmark fairer for Dataset

## What changes were proposed in this pull request?

Currently `DatasetBenchmark` use `case class Data(l: Long, s: String)` as the record type of `RDD` and `Dataset`, which introduce serialization overhead only to `Dataset` and is unfair.

This PR use `Long` as the record type, to be fairer for `Dataset`

## How was this patch tested?

existing tests

Author: Wenchen Fan <[email protected]>

Closes #16391 from cloud-fan/benchmark.

cloud-fan committed Dec 27, 2016

a05cc42

[SPARK-19004][SQL] Fix `JDBCWriteSuite.testH2Dialect` by removing `ge…

…tCatalystType`

## What changes were proposed in this pull request?

`JDBCSuite` and `JDBCWriterSuite` have their own `testH2Dialect`s for their testing purposes.

This PR fixes `testH2Dialect` in `JDBCWriterSuite` by removing `getCatalystType` implementation in order to return correct types. Currently, it always returns `Some(StringType)` incorrectly. Note that, for the `testH2Dialect` in `JDBCSuite`, it's intentional because of the test case `Remap types via JdbcDialects`.

## How was this patch tested?

This is a test only update.

Author: Dongjoon Hyun <[email protected]>

Closes #16409 from dongjoon-hyun/SPARK-H2-DIALECT.

dongjoon-hyun committed with gatorsmile

Dec 27, 2016

c2a2069

[SPARK-18999][SQL][MINOR] simplify Literal codegen

## What changes were proposed in this pull request?

`Literal` can use `CodegenContex.addReferenceObj` to implement codegen, instead of `CodegenFallback`.  This can also simplify the generated code a little bit, before we will generate: `((Expression) references[1]).eval(null)`, now it's just `references[1]`.

## How was this patch tested?

N/A

Author: Wenchen Fan <[email protected]>

Closes #16402 from cloud-fan/minor.

cloud-fan committed with gatorsmile

Dec 27, 2016

6ddbf46

Commits on Dec 26, 2016

[SPARK-18989][SQL] DESC TABLE should not fail with format class not f…

…ound

## What changes were proposed in this pull request?

When we describe a table, we only wanna see the information of this table, not read it, so it's ok even if the format class is not present at the classpath.

## How was this patch tested?

new regression test

Author: Wenchen Fan <[email protected]>

Closes #16388 from cloud-fan/hive.

cloud-fan committed with gatorsmile

Dec 26, 2016

dd724c8

[SPARK-18980][SQL] implement Aggregator with TypedImperativeAggregate

## What changes were proposed in this pull request?

Currently we implement `Aggregator` with `DeclarativeAggregate`, which will serialize/deserialize the buffer object every time we process an input.

This PR implements `Aggregator` with `TypedImperativeAggregate` and avoids to serialize/deserialize buffer object many times. The benchmark shows we get about 2 times speed up.

For simple buffer object that doesn't need serialization, we still go with `DeclarativeAggregate`, to avoid performance regression.

## How was this patch tested?

N/A

Author: Wenchen Fan <[email protected]>

Closes #16383 from cloud-fan/aggregator.

cloud-fan committed Dec 26, 2016

8a7db8a

[SPARK-17755][CORE] Use workerRef to send RegisterWorkerResponse to a…

…void the race condition

## What changes were proposed in this pull request?

The root cause of this issue is that RegisterWorkerResponse and LaunchExecutor are sent via two different channels (TCP connections) and their order is not guaranteed.

This PR changes the master and worker codes to use `workerRef` to send RegisterWorkerResponse, so that RegisterWorkerResponse and LaunchExecutor are sent via the same connection. Hence `LaunchExecutor` will always be after `RegisterWorkerResponse` and never be ignored.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #16345 from zsxwing/SPARK-17755.

zsxwing committed Dec 26, 2016

7026ee2

Commits on Dec 24, 2016

[SPARK-18943][SQL] Avoid per-record type dispatch in CSV when reading

## What changes were proposed in this pull request?

`CSVRelation.csvParser` does type dispatch for each value in each row. We can prevent this because the schema is already kept in `CSVRelation`.

So, this PR proposes that converters are created first according to the schema, and then apply them to each.

I just ran some small benchmarks as below after resembling the logics in https://github.com/apache/spark/blob/7c33b0fd050f3d2b08c1cfd7efbff8166832c1af/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L170-L178 to test the updated logics.

```scala
test("Benchmark for CSV converter") {
  var numMalformedRecords = 0
  val N = 500 << 12
  val schema = StructType(
    StructField("a", StringType) ::
    StructField("b", StringType) ::
    StructField("c", StringType) ::
    StructField("d", StringType) :: Nil)

  val row = Array("1.0", "test", "2015-08-20 14:57:00", "FALSE")
  val data = spark.sparkContext.parallelize(List.fill(N)(row))
  val parser = CSVRelation.csvParser(schema, schema.fieldNames, CSVOptions())

  val benchmark = new Benchmark("CSV converter", N)
  benchmark.addCase("cast CSV string tokens", 10) { _ =>
    data.flatMap { recordTokens =>
      parser(recordTokens, numMalformedRecords)
    }.collect()
  }
  benchmark.run()
}
```

**Before**

```
CSV converter:                           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
cast CSV string tokens                        1061 / 1130          1.9         517.9       1.0X
```

**After**

```
CSV converter:                           Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
cast CSV string tokens                         940 / 1011          2.2         459.2       1.0X
```

## How was this patch tested?

Tests in `CSVTypeCastSuite` and `CSVRelation`

Author: hyukjinkwon <[email protected]>

Closes #16351 from HyukjinKwon/type-dispatch.

HyukjinKwon committed with cloud-fan

Dec 24, 2016

d6cbec7

[SPARK-18837][WEBUI] Very long stage descriptions do not wrap in the UI

## What changes were proposed in this pull request?

This issue was reported by wangyum.

In the AllJobsPage, JobPage and StagePage, the description length was limited before like as follows.

![ui-2 0 0](https://cloud.githubusercontent.com/assets/4736016/21319673/8b225246-c651-11e6-9041-4fcdd04f4dec.gif)

But recently, the limitation seems to have been accidentally removed.

![ui-2 1 0](https://cloud.githubusercontent.com/assets/4736016/21319825/104779f6-c652-11e6-8bfa-dfd800396352.gif)

The cause is that some tables are no longer `sortable` class although they were, and `sortable` class does not only mark tables as sortable but also limited the width of their child `td` elements.
The reason why now some tables are not `sortable` class is because another sortable mechanism was introduced by #13620 and #13708 with pagination feature.

To fix this issue, I've introduced new class `table-cell-width-limited` which limits the description cell width and the description is like what it was.

<img width="1260" alt="2016-12-20 1 00 34" src="https://cloud.githubusercontent.com/assets/4736016/21320478/89141c7a-c654-11e6-8494-f8f91325980b.png">

## How was this patch tested?

Tested manually with my browser.

Author: Kousuke Saruta <[email protected]>

Closes #16338 from sarutak/SPARK-18837.

sarutak committed with srowen

Dec 24, 2016

f2ceb2a

[SPARK-18800][SQL] Correct the assert in UnsafeKVExternalSorter which…

… ensures array size

## What changes were proposed in this pull request?

`UnsafeKVExternalSorter` uses `UnsafeInMemorySorter` to sort the records of `BytesToBytesMap` if it is given a map.

Currently we use the number of keys in `BytesToBytesMap` to determine if the array used for sort is enough or not. We has an assert that ensures the size of the array is enough: `map.numKeys() <= map.getArray().size() / 2`.

However, each record in the map takes two entries in the array, one is record pointer, another is key prefix. So the correct assert should be `map.numKeys() * 2 <= map.getArray().size() / 2`.

## How was this patch tested?

N/A

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <[email protected]>

Closes #16232 from viirya/SPARK-18800-fix-UnsafeKVExternalSorter.

viirya committed with srowen

Dec 24, 2016

07fcbea

[SPARK-18911][SQL] Define CatalogStatistics to interact with metastor…

…e and convert it to Statistics in relations

## What changes were proposed in this pull request?

Statistics in LogicalPlan should use attributes to refer to columns rather than column names, because two columns from two relations can have the same column name. But CatalogTable doesn't have the concepts of attribute or broadcast hint in Statistics. Therefore, putting Statistics in CatalogTable is confusing.

We define a different statistic structure in CatalogTable, which is only responsible for interacting with metastore, and is converted to statistics in LogicalPlan when it is used.

## How was this patch tested?

add test cases

Author: wangzhenhua <[email protected]>
Author: Zhenhua Wang <[email protected]>

Closes #16323 from wzhfy/nameToAttr.

wzhfy committed with cloud-fan

Dec 24, 2016

3cff816

Commits on Dec 23, 2016

[SPARK-18991][CORE] Change ContextCleaner.referenceBuffer to use Conc…

…urrentHashMap to make it faster

## What changes were proposed in this pull request?

The time complexity of ConcurrentHashMap's `remove` is O(1). Changing ContextCleaner.referenceBuffer's type from `ConcurrentLinkedQueue` to `ConcurrentHashMap's` will make the removal much faster.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #16390 from zsxwing/SPARK-18991.

zsxwing committed Dec 23, 2016

a848f0b

[SPARK-18963] o.a.s.unsafe.types.UTF8StringSuite.writeToOutputStreamI…

…ntArray test

fails on big endian. Only change byte order on little endian

## What changes were proposed in this pull request?

Fix test to only change byte order on LE platforms

## How was this patch tested?

Test run on Big Endian and Little Endian platforms

Author: Pete Robbins <[email protected]>

Closes #16375 from robbinspg/SPARK-18963.

robbinspg committed with srowen

Dec 23, 2016

1311448

[SPARK-18958][SPARKR] R API toJSON on DataFrame

## What changes were proposed in this pull request?

It would make it easier to integrate with other component expecting row-based JSON format.
This replaces the non-public toJSON RDD API.

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <[email protected]>

Closes #16368 from felixcheung/rJSON.

felixcheung committed with Felix Cheung

Dec 23, 2016

17579bd

[SPARK-18972][CORE] Fix the netty thread names for RPC

## What changes were proposed in this pull request?

Right now the name of threads created by Netty for Spark RPC are `shuffle-client-**` and `shuffle-server-**`. It's pretty confusing.

This PR just uses the module name in TransportConf to set the thread name. In addition, it also includes the following minor fixes:

- TransportChannelHandler.channelActive and channelInactive should call the corresponding super methods.
- Make ShuffleBlockFetcherIterator throw NoSuchElementException if it has no more elements. Otherwise,  if the caller calls `next` without `hasNext`, it will just hang.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #16380 from zsxwing/SPARK-18972.

zsxwing committed Dec 23, 2016

f252cb5

[SPARK-18985][SS] Add missing @InterfaceStability.Evolving for Struct…

…ured Streaming APIs

## What changes were proposed in this pull request?

Add missing InterfaceStability.Evolving for Structured Streaming APIs

## How was this patch tested?

Compiling the codes.

Author: Shixiong Zhu <[email protected]>

Closes #16385 from zsxwing/SPARK-18985.

zsxwing committed Dec 23, 2016

2246ce8

Commits on Dec 22, 2016

[SPARK-18537][WEB UI] Add a REST api to serve spark streaming informa…

…tion

## What changes were proposed in this pull request?

This PR is an inheritance from #16000, and is a completion of #15904.

**Description**

- Augment the `org.apache.spark.status.api.v1` package for serving streaming information.
- Retrieve the streaming information through StreamingJobProgressListener.

> this api should cover exceptly the same amount of information as you can get from the web interface
> the implementation is base on the current REST implementation of spark-core
> and will be available for running applications only
>
> https://issues.apache.org/jira/browse/SPARK-18537

## How was this patch tested?

Local test.

Author: saturday_s <[email protected]>
Author: Chan Chor Pang <[email protected]>
Author: peterCPChan <[email protected]>

Closes #16253 from saturday-shi/SPARK-18537.

saturday-shi committed with vanzin

Dec 22, 2016

ce99f51

[SPARK-18975][CORE] Add an API to remove SparkListener

## What changes were proposed in this pull request?

In current Spark we could add customized SparkListener through `SparkContext#addListener` API, but there's no equivalent API to remove the registered one. In our scenario SparkListener will be added repeatedly accordingly to the changed environment. If lacks the ability to remove listeners, there might be many registered listeners finally, this is unnecessary and potentially affects the performance. So here propose to add an API to remove registered listener.

## How was this patch tested?

Add an unit test to verify it.

Author: jerryshao <[email protected]>

Closes #16382 from jerryshao/SPARK-18975.

jerryshao committed with rxin

Dec 22, 2016

31da755

[SPARK-18973][SQL] Remove SortPartitions and RedistributeData

## What changes were proposed in this pull request?
SortPartitions and RedistributeData logical operators are not actually used and can be removed. Note that we do have a Sort operator (with global flag false) that subsumed SortPartitions.

## How was this patch tested?
Also updated test cases to reflect the removal.

Author: Reynold Xin <[email protected]>

Closes #16381 from rxin/SPARK-18973.

rxin committed with hvanhovell

Dec 22, 2016

2615100

[SPARK-16975][SQL][FOLLOWUP] Do not duplicately check file paths in d…

…ata sources implementing FileFormat

## What changes were proposed in this pull request?

This PR cleans up duplicated checking for file paths in implemented data sources and prevent to attempt to list twice in ORC data source.

#14585 handles a problem for the partition column name having `_` and the issue itself is resolved correctly. However, it seems the data sources implementing `FileFormat` are validating the paths duplicately. Assuming from the comment in `CSVFileFormat`, `// TODO: Move filtering.`, I guess we don't have to check this duplicately.

Currently, this seems being filtered in `PartitioningAwareFileIndex.shouldFilterOut` and`PartitioningAwareFileIndex.isDataPath`. So, `FileFormat.inferSchema` will always receive leaf files. For example, running to codes below:

``` scala
spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
spark.read.parquet("/tmp/parquet")
```

gives the paths below without directories but just valid data files:

``` bash
/tmp/parquet/_col=0/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
/tmp/parquet/_col=1/part-r-00000-094a8efa-bece-4b50-b54c-7918d1f7b3f8.snappy.parquet
/tmp/parquet/_col=2/part-r-00000-25de2b50-225a-4bcf-a2bc-9eb9ed407ef6.snappy.parquet
...
```

to `FileFormat.inferSchema`.

## How was this patch tested?

Unit test added in `HadoopFsRelationTest` and related existing tests.

Author: hyukjinkwon <[email protected]>

Closes #14627 from HyukjinKwon/SPARK-16975.

HyukjinKwon committed with rxin

Dec 22, 2016

76622c6

[SPARK-18922][TESTS] Fix more resource-closing-related and path-relat…

…ed test failures in identified ones on Windows

## What changes were proposed in this pull request?

There are several tests failing due to resource-closing-related and path-related  problems on Windows as below.

- `SQLQuerySuite`:

```
- specifying database name for a temporary table is not allowed *** FAILED *** (125 milliseconds)
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark  arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
```

- `JsonSuite`:

```
- Loading a JSON dataset from a text file with SQL *** FAILED *** (94 milliseconds)
  org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark  arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
```

- `StateStoreSuite`:

```
- SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds)
  java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?????-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
  at org.apache.hadoop.fs.Path.initialize(Path.java:206)
  at org.apache.hadoop.fs.Path.<init>(Path.java:116)
  at org.apache.hadoop.fs.Path.<init>(Path.java:89)
  ...
  Cause: java.net.URISyntaxException: Relative path in absolute URI: StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?????-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.<init>(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:203)
```

- `HDFSMetadataLogSuite`:

```
- FileManager: FileContextManager *** FAILED *** (94 milliseconds)
  java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21
  at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
  at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
  at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)

- FileManager: FileSystemManager *** FAILED *** (78 milliseconds)
  java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088
  at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
  at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
  at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
```

And, there are some tests being failed due to the length limitation on cmd in Windows as below:

- `LauncherBackendSuite`:

```
- local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds)
  The code passed to eventually never returned normally. Attempted 283 times over 30.0960053 seconds. Last failure message: The reference was null. (LauncherBackendSuite.scala:56)
  org.scalatest.exceptions.TestFailedDueToTimeoutException:
  at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
  at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)

- standalone/client: launcher handle *** FAILED *** (30 seconds, 47 milliseconds)
  The code passed to eventually never returned normally. Attempted 282 times over 30.037987100000002 seconds. Last failure message: The reference was null. (LauncherBackendSuite.scala:56)
  org.scalatest.exceptions.TestFailedDueToTimeoutException:
  at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
  at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
```

The executed command is, https://gist.github.com/HyukjinKwon/d3fdd2e694e5c022992838a618a516bd, which is 16K length; however, the length limitation is 8K on Windows. So, it is being failed to launch.

This PR proposes to fix the test failures on Windows and skip the tests failed due to the length limitation

## How was this patch tested?

Manually tested via AppVeyor

**Before**

`SQLQuerySuite `: https://ci.appveyor.com/project/spark-test/spark/build/306-pr-references
`JsonSuite`: https://ci.appveyor.com/project/spark-test/spark/build/307-pr-references
`StateStoreSuite` : https://ci.appveyor.com/project/spark-test/spark/build/305-pr-references
`HDFSMetadataLogSuite`: https://ci.appveyor.com/project/spark-test/spark/build/304-pr-references
`LauncherBackendSuite`: https://ci.appveyor.com/project/spark-test/spark/build/303-pr-references

**After**

`SQLQuerySuite`: https://ci.appveyor.com/project/spark-test/spark/build/293-SQLQuerySuite
`JsonSuite`: https://ci.appveyor.com/project/spark-test/spark/build/294-JsonSuite
`StateStoreSuite`: https://ci.appveyor.com/project/spark-test/spark/build/297-StateStoreSuite
`HDFSMetadataLogSuite`: https://ci.appveyor.com/project/spark-test/spark/build/319-pr-references
`LauncherBackendSuite`: failed test skipped.

Author: hyukjinkwon <[email protected]>

Closes #16335 from HyukjinKwon/more-fixes-on-windows.

HyukjinKwon committed with srowen

Dec 22, 2016

4186aba

[SPARK-18953][CORE][WEB UI] Do now show the link to a dead worker on …

…the master page

## What changes were proposed in this pull request?

For a dead worker, we will not be able to see its worker page anyway. This PR removes the links to dead workers from the master page.

## How was this patch tested?

Since this is UI change, please do the following steps manually.

**1. Start a master and a slave**

```
sbin/start-master.sh
sbin/start-slave.sh spark://10.22.16.140:7077
```

![1_live_worker_a](https://cloud.githubusercontent.com/assets/9700541/21373572/d7e187d6-c6d4-11e6-9110-f4371d215dec.png)

**2. Stop the slave**
```
sbin/stop-slave.sh
```

![2_dead_worker_a](https://cloud.githubusercontent.com/assets/9700541/21373579/dd9e9704-c6d4-11e6-9047-a22cb0aa83ed.png)

**3. Start a slave**

```
sbin/start-slave.sh spark://10.22.16.140:7077
```

![3_dead_worder_a_and_live_worker_b](https://cloud.githubusercontent.com/assets/9700541/21373582/e1b207f4-c6d4-11e6-89cb-6d8970175a5e.png)

**4. Stop the slave**

```
sbin/stop-slave.sh
```

![4_dead_worker_a_and_b](https://cloud.githubusercontent.com/assets/9700541/21373584/e5fecb4e-c6d4-11e6-95d3-49defe366946.png)

**5. Driver list testing**

Do the followings and stop the slave in a minute by `sbin/stop-slave.sh`.

```
sbin/start-master.sh
sbin/start-slave.sh spark://10.22.16.140:7077
bin/spark-submit --master=spark://10.22.16.140:7077 --deploy-mode=cluster --class org.apache.spark.examples.SparkPi examples/target/scala-2.11/jars/spark-examples_2.11-2.2.0-SNAPSHOT.jar 10000
```

![5_dead_worker_in_driver_list](https://cloud.githubusercontent.com/assets/9700541/21401320/be6cc9fc-c768-11e6-8de7-6512961296a5.png)

Author: Dongjoon Hyun <[email protected]>

Closes #16366 from dongjoon-hyun/SPARK-18953.

dongjoon-hyun committed with srowen

Dec 22, 2016

f489339

[DOC] bucketing is applicable to all file-based data sources

## What changes were proposed in this pull request?
Starting Spark 2.1.0, bucketing feature is available for all file-based data sources. This patch fixes some function docs that haven't yet been updated to reflect that.

## How was this patch tested?
N/A

Author: Reynold Xin <[email protected]>

Closes #16349 from rxin/ds-doc.

rxin committed Dec 22, 2016

2e861df

[SQL] Minor readability improvement for partition handling code

## What changes were proposed in this pull request?
This patch includes minor changes to improve readability for partition handling code. I'm in the middle of implementing some new feature and found some naming / implicit type inference not as intuitive.

## How was this patch tested?
This patch should have no semantic change and the changes should be covered by existing test cases.

Author: Reynold Xin <[email protected]>

Closes #16378 from rxin/minor-fix.

rxin committed with cloud-fan

Dec 22, 2016

7c5b7b3

[SPARK-18908][SS] Creating StreamingQueryException should check if lo…

…gicalPlan is created

## What changes were proposed in this pull request?

This PR audits places using `logicalPlan` in StreamExecution and ensures they all handles the case that `logicalPlan` cannot be created.

In addition, this PR also fixes the following issues in `StreamingQueryException`:
- `StreamingQueryException` and `StreamExecution` are cycle-dependent because in the `StreamingQueryException`'s constructor, it calls `StreamExecution`'s `toDebugString` which uses `StreamingQueryException`. Hence it will output `null` value in the error message.
- Duplicated stack trace when calling Throwable.printStackTrace because StreamingQueryException's toString contains the stack trace.

## How was this patch tested?

The updated `test("max files per trigger - incorrect values")`. I found this issue when I switched from `testStream` to the real codes to verify the failure in this test.

Author: Shixiong Zhu <[email protected]>

Closes #16322 from zsxwing/SPARK-18907.

zsxwing committed Dec 22, 2016

ff7d82a

[BUILD] make-distribution should find JAVA_HOME for non-RHEL systems

## What changes were proposed in this pull request?

make-distribution.sh should find JAVA_HOME for Ubuntu, Mac and other non-RHEL systems

## How was this patch tested?

Manually

Author: Felix Cheung <[email protected]>

Closes #16363 from felixcheung/buildjava.

felixcheung committed with Felix Cheung

Dec 22, 2016

e1b43dc

[FLAKY-TEST] InputStreamsSuite.socket input stream

## What changes were proposed in this pull request?

https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.InputStreamsSuite&test_name=socket+input+stream

## How was this patch tested?

Tested 2,000 times.

Author: Burak Yavuz <[email protected]>

Closes #16343 from brkyvz/sock.

brkyvz committed with tdas

Dec 22, 2016

afe3651

[SPARK-18903][SPARKR] Add API to get SparkUI URL

## What changes were proposed in this pull request?

API for SparkUI URL from SparkContext

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <[email protected]>

Closes #16367 from felixcheung/rwebui.

felixcheung committed with Felix Cheung

Dec 22, 2016

7e8994f

[SPARK-18528][SQL] Fix a bug to initialise an iterator of aggregation…

… buffer

## What changes were proposed in this pull request?
This pr is to fix an `NullPointerException` issue caused by a following `limit + aggregate` query;
```
scala> val df = Seq(("a", 1), ("b", 2), ("c", 1), ("d", 5)).toDF("id", "value")
scala> df.limit(2).groupBy("id").count().show
WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
```
The root culprit is that [`$doAgg()`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L596) skips an initialization of [the buffer iterator](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L603); `BaseLimitExec` sets `stopEarly=true` and `$doAgg()` exits in the middle without the initialization.

## How was this patch tested?
Added a test to check if no exception happens for limit + aggregates in `DataFrameAggregateSuite.scala`.

Author: Takeshi YAMAMURO <[email protected]>

Closes #15980 from maropu/SPARK-18528.

maropu committed with hvanhovell

Dec 22, 2016

b41ec99

[SPARK-18234][SS] Made update mode public

## What changes were proposed in this pull request?

Made update mode public. As part of that here are the changes.
- Update DatastreamWriter to accept "update"
- Changed package of InternalOutputModes from o.a.s.sql to o.a.s.sql.catalyst
- Added update mode state removing with watermark to StateStoreSaveExec

## How was this patch tested?

Added new tests in changed modules

Author: Tathagata Das <[email protected]>

Closes #16360 from tdas/SPARK-18234.

tdas committed Dec 22, 2016

83a6ace

[SPARK-17807][CORE] split test-tags into test-JAR

Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR.

Alternative to #16303.

Author: Ryan Williams <[email protected]>

Closes #16311 from ryan-williams/tt.

ryan-williams committed with vanzin

Dec 22, 2016

afd9bc1

Commits on Dec 21, 2016

[SPARK-18588][SS][KAFKA] Create a new KafkaConsumer when error happen…

…s to fix the flaky test

## What changes were proposed in this pull request?

When KafkaSource fails on Kafka errors, we should create a new consumer to retry rather than using the existing broken one because it's possible that the broken one will fail again.

This PR also assigns a new group id to the new created consumer for a possible race condition:  the broken consumer cannot talk with the Kafka cluster in `close` but the new consumer can talk to Kafka cluster. I'm not sure if this will happen or not. Just for safety to avoid that the Kafka cluster thinks there are two consumers with the same group id in a short time window. (Note: CachedKafkaConsumer doesn't need this fix since `assign` never uses the group id.)

## How was this patch tested?

In https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70370/console , it ran this flaky test 120 times and all passed.

Author: Shixiong Zhu <[email protected]>

Closes #16282 from zsxwing/kafka-fix.

zsxwing committed with tdas

Dec 21, 2016

95efc89

[SPARK-18775][SQL] Limit the max number of records written per file

## What changes were proposed in this pull request?
Currently, Spark writes a single file out per task, sometimes leading to very large files. It would be great to have an option to limit the max number of records written per file in a task, to avoid humongous files.

This patch introduces a new write config option `maxRecordsPerFile` (default to a session-wide setting `spark.sql.files.maxRecordsPerFile`) that limits the max number of records written to a single file. A non-positive value indicates there is no limit (same behavior as not having this flag).

## How was this patch tested?
Added test cases in PartitionedWriteSuite for both dynamic partition insert and non-dynamic partition insert.

Author: Reynold Xin <[email protected]>

Closes #16204 from rxin/SPARK-18775.

rxin committed with hvanhovell

Dec 21, 2016

354e936

Older

apache/spark mirrored from git://git.apache.org/spark.git