最近經(jīng)歷的一些大數(shù)據(jù)（Spark/Hadoop）面試題

jasonbetter 2017-06-07

展開全文

公司A：

1.講講你做的過的項目。項目里有哪些難點重點注意點呢,？
2.講講多線程吧,，要是你,，你怎么實現(xiàn)一個線程池呢？
3.講一下Mapreduce或者hdfs的原理和機制,。map讀取數(shù)據(jù)分片,。
4.shuffle 是什么,？怎么調(diào)優(yōu),？
5.項目用什么語言寫？ Scala,？ Scala的特點,？和Java的區(qū)別？
6.理論基礎(chǔ)怎么樣,，比如數(shù)據(jù)結(jié)構(gòu),，里面的快速排序，或者,，樹,？講一講你了解的樹的知識,？
7.數(shù)學(xué)怎么樣呢,？
8.講一下數(shù)據(jù)庫，SQl ,，左外連接,，原理，實現(xiàn),？
9.還了解過數(shù)據(jù)的什么知識,？數(shù)據(jù)庫引擎？
10.Hadoop的機架怎么配置的,？
11.Hbase的設(shè)計有什么心得？
12.Hbase的操作是用的什么API還是什么工具,？
13.對調(diào)度怎么理解.? 用什么工具嗎？
14.用kettle 這種工具還是自己寫程序,？你們公司是怎么做的,？
15.你們數(shù)據(jù)中心開發(fā)周期是多長？
16.你們hbase里面是存一些什么數(shù)據(jù),。

二面,。三個人。

1.講講你做的項目,。
2.平時對多線程這方面是怎么處理呢,？異步是怎么思考呢？遇到的一些鎖啊,，是怎么做的呢？比如兩個人同時操作一樣?xùn)|西,。怎么做的呢,？一些并發(fā)操作設(shè)計到一些變量怎么做的呢,？
3.你們用的最多是 http協(xié)議吧,？有沒有特殊的頭呢？講講你對tcp/ip的理解,？
4.有沒有用過Zookeeper呢,？ Zookeeper的適用場景是什么？ HA 狀態(tài)維護分布式鎖全局配置文件管理操作Zookeeper是用的什么,？

Spark方面：

5.spark開發(fā)分兩個方面,？哪兩個方面呢？
6.比如一個讀取hdfs上的文件,，然后count有多少行的操作,，你可以說說過程嗎。那這個count是在內(nèi)存中,，還是磁盤中計算的呢,？磁盤中。
7.spark和Mapreduce快,？為什么快呢,？快在哪里呢,？ 1.內(nèi)存迭代,。2.RDD設(shè)計。 3,算子的設(shè)計,。
8.spark sql又為什么比hive快呢,？
10.RDD的數(shù)據(jù)結(jié)構(gòu)是怎么樣的？ Partition數(shù)組,。 dependence
11.hadoop的生態(tài)呢,。說說你的認識。 hdfs底層存儲 hbase 數(shù)據(jù)庫 hive數(shù)據(jù)倉庫 Zookeeper分布式鎖 spark大數(shù)據(jù)分析

公司B：

1.Spark工作的一個流程,。

提交任務(wù),。 
QQ圖片20161019131411.png
用戶提交一個任務(wù)。 入口是從sc開始的,。 sc會去創(chuàng)建一個taskScheduler,。根據(jù)不同的提交模式,， 會根據(jù)相應(yīng)的taskchedulerImpl進行任務(wù)調(diào)度。
同時會去創(chuàng)建Scheduler和DAGScheduler,。DAGScheduler 會根據(jù)RDD的寬依賴或者窄依賴,，進行階段的劃分。劃分好后放入taskset中,，交給taskscheduler ,。
appclient會到master上注冊。首先會去判斷數(shù)據(jù)本地化,，盡量選最好的本地化模式去執(zhí)行,。
打散 Executor選擇相應(yīng)的Executor去執(zhí)行。ExecutorRunner會去創(chuàng)建CoarseGrainerExecutorBackend進程,。 通過線程池的方式去執(zhí)行任務(wù),。

反向：
Executor向 SchedulerBackend反向注冊

Spark On Yarn模式下。 driver負責(zé)計算調(diào)度,。appmaster 負責(zé)資源的申請,。

2.Hbase的PUT的一個過程。
3.RDD算子里操作一個外部map比如往里面put數(shù)據(jù),。然后算子外再遍歷map,。有什么問題嗎。
4.shuffle的過程,。調(diào)優(yōu),。
5.5個partition里面分布有12345678910.用算子求最大值或者和,。不能用廣播變量和累加器,。或者sortbykey.
6.大表和小表join.
7.知道spark怎么讀hbase嗎,？spark on hbase.,。華為的。
8.做過hbase的二級索引嗎,？
9.sort shuffle的優(yōu)點,？
10.stage怎么劃分的？寬依賴窄依賴是什么,？

公司W(wǎng)：

1.講講你做過的項目(一個整體思路)
2.問問大概情況,。公司里集群規(guī)模。hbase數(shù)據(jù)量,。數(shù)據(jù)規(guī)模,。
3.然后挑選數(shù)據(jù)工廠開始詳細問。問hbase.,。加閑聊,。
4.問二次排序是什么,。topn是什么。二次排序要繼承什么接口,？
5.計算的數(shù)據(jù)怎么來的,。
6.kakfadirect是什么，,。為什么要用這個,，有什么優(yōu)點？,。和其他的有什么區(qū)別,。

http://blog.csdn.net/erfucun/article/details/52275369

  /**
   * Create an input stream that directly pulls messages from Kafka Brokers
   * without using any receiver. This stream can guarantee that each message
   * from Kafka is included in transformations exactly once (see points below).
   *
   * Points to note:
   *  - No receivers: This stream does not use any receiver. It directly queries Kafka
   *  - Offsets: This does not use Zookeeper to store offsets. The consumed offsets are tracked
   *    by the stream itself. For interoperability with Kafka monitoring tools that depend on
   *    Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application.
   *    You can access the offsets used in each batch from the generated RDDs (see
   *    [[org.apache.spark.streaming.kafka.HasOffsetRanges]]).
   *  - Failure Recovery: To recover from driver failures, you have to enable checkpointing
   *    in the [[StreamingContext]]. The information on consumed offset can be
   *    recovered from the checkpoint. See the programming guide for details (constraints, etc.).
   *  - End-to-end semantics: This stream ensures that every records is effectively received and
   *    transformed exactly once, but gives no guarantees on whether the transformed data are
   *    outputted exactly once. For end-to-end exactly-once semantics, you have to either ensure
   *    that the output operation is idempotent, or use transactions to output records atomically.
   *    See the programming guide for more details.
   *
   * @param ssc StreamingContext object
   * @param kafkaParams Kafka <a href="http://kafka./documentation.html#configuration">
   *    configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers"
   *    to be set with Kafka broker(s) (NOT zookeeper servers) specified in
   *    host1:port1,host2:port2 form.
   * @param fromOffsets Per-topic/partition Kafka offsets defining the (inclusive)
   *    starting point of the stream
   * @param messageHandler Function for translating each message and metadata into the desired type
   */