Hadoop Streaming not working with python script

Hi,

I tried to write a simple mapper and reducer in python. Checked if they are working using files as input, working as expected. But while trying to run through hadoop stream it is giving error.
May be there is something I missed. Please help me out with this.

mapper.py

#!/usr/bin/python
import sys
import re

def read_input(file):
for line in file:
line = re.split(’[\t]+’,line.lower())
yield line[0] , line[1].strip()
def main():
file = sys.stdin
#file = open(“test.txt”,“r”)
for u , dna in read_input(file):
print dna +’\t’+ u

if name == “main”:
main()

reducer.py

#!/usr/bin/python
import sys
import re
from itertools import groupby
from operator import itemgetter

def read_mapper(file):
for line in file:
line = re.split(’[\t]+’,line.lower())
yield line[0] , line[1].strip()
def main():
file = sys.stdin
#file = open(“test.txt”,“r”)
dna_dict = {}
for dna , u in read_mapper(file):
if dna not in dna_dict:
dna_dict[dna] = set()
dna_dict[dna].add(u)
for k in dna_dict:
#print(k +’\t’+ list(dna_dict[k]))
print “%s%s%s” % (k, ‘\t’, list(dna_dict[k]))

if name == “main”:
main()

Execution command:

hadoop jar /usr/hdp/2.6.2.0-205/hadoop-mapreduce/hadoop-streaming.jar -input /data/mr/dna/dna.txt -output mapreduce-progra
mming/same_dna_test -mapper mapper.py -file mapper.py -reducer reducer.py -file reducer.py

Error:

20/05/16 17:59:09 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper.py, reducer.py] [/usr/hdp/2.6.2.0-205/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.2.0-205.jar] /tmp/streamjob5818579733745706316.jar tmpDir=null
20/05/16 17:59:10 INFO client.RMProxy: Connecting to ResourceManager at cxln2.c.thelab-240901.internal/10.142.1.2:8050
20/05/16 17:59:10 INFO client.AHSProxy: Connecting to Application History server at cxln2.c.thelab-240901.internal/10.142.1.2:10200
20/05/16 17:59:10 INFO client.RMProxy: Connecting to ResourceManager at cxln2.c.thelab-240901.internal/10.142.1.2:8050
20/05/16 17:59:10 INFO client.AHSProxy: Connecting to Application History server at cxln2.c.thelab-240901.internal/10.142.1.2:10200
20/05/16 17:59:11 INFO mapred.FileInputFormat: Total input paths to process : 1
20/05/16 17:59:11 INFO mapreduce.JobSubmitter: number of splits:2
20/05/16 17:59:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1588872731573_3067
20/05/16 17:59:12 INFO impl.YarnClientImpl: Submitted application application_1588872731573_3067
20/05/16 17:59:12 INFO mapreduce.Job: The url to track the job: http://cxln2.c.thelab-240901.internal:8088/proxy/application_1588872731573_3067/
20/05/16 17:59:12 INFO mapreduce.Job: Running job: job_1588872731573_3067
20/05/16 17:59:20 INFO mapreduce.Job: Job job_1588872731573_3067 running in uber mode : false
20/05/16 17:59:20 INFO mapreduce.Job: map 0% reduce 0%
20/05/16 17:59:26 INFO mapreduce.Job: Task Id : attempt_1588872731573_3067_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

20/05/16 17:59:26 INFO mapreduce.Job: Task Id : attempt_1588872731573_3067_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

20/05/16 17:59:31 INFO mapreduce.Job: Task Id : attempt_1588872731573_3067_m_000000_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

20/05/16 17:59:33 INFO mapreduce.Job: Task Id : attempt_1588872731573_3067_m_000001_1, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

20/05/16 17:59:36 INFO mapreduce.Job: Task Id : attempt_1588872731573_3067_m_000000_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)

Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

20/05/16 17:59:38 INFO mapreduce.Job: Task Id : attempt_1588872731573_3067_m_000001_2, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
20/05/16 17:59:44 INFO mapreduce.Job: map 100% reduce 100%
20/05/16 17:59:44 INFO mapreduce.Job: Job job_1588872731573_3067 failed with state FAILED due to: Task failed task_1588872731573_3067_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
20/05/16 17:59:44 INFO mapreduce.Job: Counters: 17
Job Counters
Failed map tasks=7
Killed map tasks=1
Killed reduce tasks=1
Launched map tasks=8
Other local map tasks=6
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=433596
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=36133
Total time spent by all reduce tasks (ms)=0
Total vcore-milliseconds taken by all map tasks=36133
Total vcore-milliseconds taken by all reduce tasks=0
Total megabyte-milliseconds taken by all map tasks=55500288
Total megabyte-milliseconds taken by all reduce tasks=0
Map-Reduce Framework
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
20/05/16 17:59:44 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!

Hi,

subproccess failed error comes when there is an error in the code. Could you please take inspirations from the MapReduce code from this GitHub repository

Please keep us updated.