yarn - Why DistributedCache is caching all files directly into the root of the tmp storage dir in Hadoop 2 -
yarn - Why DistributedCache is caching all files directly into the root of the tmp storage dir in Hadoop 2 -
i'm migrating hadoop 1.0.4 codes hadoop 2.3 platform, , met weird behavior alter of distributedcache:
in hadoop 1, if want cache file @ /user/foo/file/bar/name.avro
, distributedcache re-create file local cache folder, , create same sub directories accordingly. file stored @ /[root_of_tmp_cache_dir]/user/foo/file/bar/name.avro
.
now same codes in hadoop 2 set file straight in root folder without creating sub directories. cached file stored at: /[root_of_tmp_cache_dir]/name.avro
.
this cause name conflicts if caching multiple files when file names part-r-00000.avro
.
of course, applying link , renaming cached file unique name can 1 way solve problem; more generally, creating unique names seem trivial in many cases, when needs guarantee unique names across different mappers/reducers. i'm wondering if there other ways alter behavior, such creating folder within tmp dir or maybe tune mapreduce configuration parameter?
one way tried seek create uri "path#path", linking ourselves, seem next exception:
14-10-2014 16:05:41 pdt admm_train info - caused by: java.lang.illegalargumentexception: resource name must relative 14-10-2014 16:05:41 pdt admm_train info - @ org.apache.hadoop.mapreduce.v2.util.mrapps.parsedistributedcacheartifacts(mrapps.java:489) 14-10-2014 16:05:41 pdt admm_train info - @ org.apache.hadoop.mapreduce.v2.util.mrapps.setupdistributedcache(mrapps.java:430) 14-10-2014 16:05:41 pdt admm_train info - @ org.apache.hadoop.mapred.yarnrunner.createapplicationsubmissioncontext(yarnrunner.java:455) 14-10-2014 16:05:41 pdt admm_train info - @ org.apache.hadoop.mapred.yarnrunner.submitjob(yarnrunner.java:283) 14-10-2014 16:05:41 pdt admm_train info - @ org.apache.hadoop.mapreduce.jobsubmitter.submitjobinternal(jobsubmitter.java:432)
this bug due alter of internal behavior of distributedcache in hadoop 1 , 2.
in hadoop 1, to-be-cached file stored @ local tmp directory retaining previous path structure. example, if caching hdfs:///foo/bar/file1 in hadoop1, stored @ /[some tmp path]/foo/bar/file1.
in hadoop 2, distributecache peel out path structure, store hdfs:///foo/bar/file1, straight store @ /[some tmp path]/file1.
also if using symblink names, hadoop 2 rename file link name while hadoop 1 not. results in compatibility conflicts while switching hadoop 1 hadoop 2.
an easy solution utilize symlink name, , access file via name stored differently, can still access them in same fashion.
yarn hadoop2 distributed-cache
Comments
Post a Comment