How to run a python program in the CloudXLab?

Arman_Kanooni · January 2, 2021, 8:39am

Hi,

I am asking for a guidance to run a python program on the cloudxlab following this pattern:

Assuming that my python code which can be a custom map/reduce process, lets call it mypython.py
Assuming that my input folder is in the sub-directory /myinputfolder
Assuming that my input file is in the sub-directory /myinoutfolder/myinputdata.txt

Can someone provide me with the right command to run this python program using hadoop streaming-jar file?

The following is an example of a command that I used and the system come back with the file not found error!

python /user/drarmankanooni3849/RatingsBreakdown.py -r hadoop --hadoop-streaming- jar /hdp/apps/2.3.4.0-3485/mapreduce/hadoop-streaming.jar /user/drarmankanooni3849/movielens/u.data

Thank you,
Arman

sandeepgiri · January 3, 2021, 7:11am

This should help you:

sandeepgiri · January 3, 2021, 7:12am

After that go thru this one:

Arman_Kanooni · January 3, 2021, 8:53am

Hi Sandeep,

I appreciate your video link to run a map reduce job using a generic library.

In my Python RatingsBreakdown.py program as shown below, I am using the MRJob library from mrjob.job in Python which is different from the ones that you mentioned. Also, this library allows me to create different programs beside a word count.

I am looking for a step by step process to do this. Is the CloudXLab has already MRJob library enabled?

Please HELP

=========================

from mrjob.job import MRJob
from mrjob.step import MRStep

class RatingsBreakdown(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings)
]

def mapper_get_ratings(self, _, line):
    (userID, movieID, rating, timestamp) = line.split('\t')
    yield rating, 1

def reducer_count_ratings(self, key, values):
    yield key, sum(values)

if name == ‘main’:
RatingsBreakdown.run()