Spark 初體驗Spark 初體驗 概覽,。=0){ val docId = row.getAs[String]("doc_id") var appellor = relateInfo(0).getAs[String]("value") var appellorList = appellor.split(",") for (name <- appellorList) yield Row(docId, name, "") } else{ Array[Row]() } }var testRDD = testDF.rdd.flatMap(row => parseLitigant(row))var newDF = spark.createDataFrame(testRDD, schema)
Investigation of Dynamic Allocation in Spark
http://bbzoh.cn/content/16/1102/18/16883405_603440248.shtml
2016/11/2 18:16:00
In Spark a resource unit is executor, executor is combined with a bunch of CPU cores and memory.We’ve already introduced about how to calculate the desired resources (executor numbers), now we have to issue these resource requests to the cluster manager to allocate/deallocate the resources. Here we will introduce how YARN support resource allocation and deallocation.
使用spark ml pipeline進行機器學(xué)習(xí)
http://bbzoh.cn/content/16/1102/16/16883405_603420250.shtml
2016/11/2 16:45:52
使用spark ml pipeline進行機器學(xué)習(xí)一,、關(guān)于spark ml pipeline與機器學(xué)習(xí)一個典型的機器學(xué)習(xí)構(gòu)建包含若干個過程1、源數(shù)據(jù)ETL2,、數(shù)據(jù)預(yù)處理3,、特征選取4、模型訓(xùn)練與驗證以上四個步驟可以抽象為一個包括多個步驟的流水線式工作,,從數(shù)據(jù)收集開始至輸出我們需要的最終結(jié)果,。
How To Improve Deep Learning Performance
http://bbzoh.cn/content/16/1017/16/16883405_599122738.shtml
2016/10/17 16:11:48
If you have one more idea or an extension of one of the ideas listed, let me know, I and all readers would benefit!Try a batch size of one (online learning).Here’s how to handle the overwhelm:Pick one groupData.Algorithms.Tuning.Ensembles.Pick one method from the group.Pick one thing to try of the chosen method.Compare the results, keep if there was an improvement.Repeat.Share Your Results.
MLUtilsimport org.apache.spark.rdd.Vectorimport org.apache.spark.mllib.linalg.distributed.RowMatrixval rows: RDD[Vector] = ... // an RDD of local vectors// Create a RowMatrix from an RDD[Vector].import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}val rows: RDD[IndexedRow] = ... // an RDD of indexed rows// Create an IndexedRowMatrix from an RDD[IndexedRow].
Experiments with HBase, Phoenix, and SQL at Scale
http://bbzoh.cn/content/16/1008/09/16883405_596604361.shtml
2016/10/8 9:34:50
Experiments with HBase, Phoenix, and SQL at Scale.While not a lot of data at all for this type of cluster (it all fir easily in the HBase block cache) it none-the-less lets us gauge how Phoenix and HBase can scale their workloads across the cluster.Increase phoenix.query.threadPoolSize (1000, 2000, or 4000) and phoenix.query.queueSize (maybe 100000).Phoenix/HBase do quite well in terms of scaling.