Hi, in order to train a model I trained a dataset on Logistic Regression to start with and used that model in the below script but it gives me an error saying
“No module named ‘sklearn’” I have installed the package there but still doesn’t work. Can someone please tell me what can be done? Here is the script I found on this (Deploy a Python model (more efficiently) over Spark | by Schaun Wheeler | Towards Data Science)
import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql.window import Window as w
model = LogisticRegression(C=1e5)
model.fit(X, Y)
#creating test data from Pyspark
vectorAssembler = VectorAssembler(inputCols = [col for col in df.columns if '_id' not in col and 'label' not in col], outputCol="features")
features_vectorized = vectorAssembler.transform(df)
model_broadcast = sc.broadcast(model)
# udf to predict on the cluster
def predict_new(feature_map):
ids, features = zip(*[
(k, v) for d in feature_map for k, v in d.items()
])
ind = model_broadcast.value.classes_.tolist().index(1.0)
probs = [
float(v) for v in
model_broadcast.value.predict_proba(features)[:, ind]
]
return dict(zip(ids, probs))
predict_new_udf = f.udf(
predict_new,
t.MapType(t.LongType(), t.FloatType()
))
# set the number of prediction groups to create
nparts = 5000
# put everything together
outcome_sdf = (
features_vectorized.select(
f.create_map(f.col('id'), f.col('features')).alias('feature_map'),
(f.row_number().over(w.partitionBy(f.lit(1)).orderBy(f.lit(1))) % nparts).alias('grouper')
)
.groupby(f.col('grouper'))
.agg(f.collect_list(f.col('feature_map')).alias('feature_map'))
.select(predict_new_udf(f.col('feature_map')).alias('results'))
.select(f.explode(f.col('results')).alias('unique_id', 'probability_estimate'))
)
I am unaware a bit about clusters. So I read that I need to install sklearn on all clusters so I did ‘pip install -U scikit-learn’ and same for spark-sklearn as well. Can you help me run this?