Hadoop Fundamental

HDFS

Block size 设置大点减小seek time

Namenodes 不持久化保存 block 位置,在启动的时候从datanode获取

Namenode 单点问题

  • namenode 异步原子写状态到本地或远程存储上
  • secondary namenode

Block cache

  • Managed by namenode
  • Configurable on a per-file basis

Architecture

Namenode federation

  • Multi namespace multi namenodes

HDFS HA

  • NFS filer
  • QJM

Failover and fencing

  • failover controller (default zookeeper)
  • fencing QJM only allow one namenode to write to the edit log at one time
  • client dns map

Read/Write

YARN

  • ResourceManager
  • ApplicationMaster
  • Resource Model
    • Resource name
    • Amount of memory
    • CPUs
    • Eventually resources such as disk/network I/O, GPUs, and more

Hadoop I/O

Data Integrity

CRC-32 循环检测

Compression

Native lib CodecPool

Serialization

Writable Interface

MapReduce

  • MRUnit
  • Web UI
  • Debug (Remote)
  • Log
  • Tuning a Job, Profile
  • Workflow

How work

  • Job Submission
  • Job initialization
  • Ubertask for small application
  • Task assignment
  • Task execution
  • Progress and Status Updates

Phases of a Hadoop job

FileInputFormat splits max(minimumSize, min(maximumsize, blockSize) small files CombineFileInputFormat

日志收集系统

results matching ""

    No results matching ""