yarn - Why DistributedCache is caching all files directly into the root of the tmp storage dir in Hadoop 2 -

i'm migrating hadoop 1.0.4 codes hadoop 2.3 platform, , met weird behavior alter of distributedcache:

in hadoop 1, if want cache file @ /user/foo/file/bar/name.avro, distributedcache re-create file local cache folder, , create same sub directories accordingly. file stored @ /[root_of_tmp_cache_dir]/user/foo/file/bar/name.avro.

now same codes in hadoop 2 set file straight in root folder without creating sub directories. cached file stored at: /[root_of_tmp_cache_dir]/name.avro.

this cause name conflicts if caching multiple files when file names part-r-00000.avro.

of course, applying link , renaming cached file unique name can 1 way solve problem; more generally, creating unique names seem trivial in many cases, when needs guarantee unique names across different mappers/reducers. i'm wondering if there other ways alter behavior, such creating folder within tmp dir or maybe tune mapreduce configuration parameter?

one way tried seek create uri "path#path", linking ourselves, seem next exception:

14-10-2014 16:05:41 pdt admm_train info - caused by: java.lang.illegalargumentexception: resource name must relative 14-10-2014 16:05:41 pdt admm_train info -   @ org.apache.hadoop.mapreduce.v2.util.mrapps.parsedistributedcacheartifacts(mrapps.java:489) 14-10-2014 16:05:41 pdt admm_train info -   @ org.apache.hadoop.mapreduce.v2.util.mrapps.setupdistributedcache(mrapps.java:430) 14-10-2014 16:05:41 pdt admm_train info -   @ org.apache.hadoop.mapred.yarnrunner.createapplicationsubmissioncontext(yarnrunner.java:455) 14-10-2014 16:05:41 pdt admm_train info -   @ org.apache.hadoop.mapred.yarnrunner.submitjob(yarnrunner.java:283) 14-10-2014 16:05:41 pdt admm_train info -   @ org.apache.hadoop.mapreduce.jobsubmitter.submitjobinternal(jobsubmitter.java:432)

this bug due alter of internal behavior of distributedcache in hadoop 1 , 2.

in hadoop 1, to-be-cached file stored @ local tmp directory retaining previous path structure. example, if caching hdfs:///foo/bar/file1 in hadoop1, stored @ /[some tmp path]/foo/bar/file1.

in hadoop 2, distributecache peel out path structure, store hdfs:///foo/bar/file1, straight store @ /[some tmp path]/file1.

also if using symblink names, hadoop 2 rename file link name while hadoop 1 not. results in compatibility conflicts while switching hadoop 1 hadoop 2.

an easy solution utilize symlink name, , access file via name stored differently, can still access them in same fashion.

yarn hadoop2 distributed-cache

Search This Blog

Jaimee

yarn - Why DistributedCache is caching all files directly into the root of the tmp storage dir in Hadoop 2 -

Comments

Post a Comment

Popular posts from this blog

c - Compilation of a code: unkown type name string -

java - Bypassing "final local variable defined in an enclosing type" -

json - Hibernate and Jackson (java.lang.IllegalStateException: Cannot call sendError() after the response has been committed) -