Broadcast variable in spark

I have a cluster where we have 8 nodes. Each has 116 GB ram and 16 core and I am trying to read table X which is 250 GB in size. Table X I am joining with table Y 10 times to derive 10 columns. The size of table Y is 100 MB.

Now my question is when I am broadcasting table Y and explicitly caching the table y, script takes around 20 hours but when I am not caching and only broadcasting the table y the complete process took only 1 hour.

Unable to understand what is actually causing more time if we explicitly cache 100 MB after broadcasting the table.

Please help.

Let me try to explain when to use caching and when to use broadcasting. In your case, the broadcasting is the solution and caching is going to be unnessary overhead.

Broadcasting

Say there is some data (such as looks up tables, libraries, ml models) that you want to push to all machines so that the processing can become really fast. This is also needed in the map-side joins.

Caching
if you have an RDD that takes a long time to compute, you would like to compute it only once and keep using it. In such cases, you will cache it. Caching essentially keep the results of first execution of RDD.

It would be great if you could provide code snippets. Are you using the dataframe joins or using RDD based joins? Also, could you help me understand which cluster manager are you using?

1 Like

@sgiri thanks a lot for your reply.

I am currently using azure databricks cluster where I am using dataframes. There is no fancy join I am performing. As I mentioned that I take the X table of 250 GB and then join with table Y 10 times. Code, unfortunately I can not share as it is at my work place.

FYI the surprising thing to me is that this 100 MB table if I broadcast and cache both then processing thing takes 20 hours and if it is only broadcasted that it takes only one hour.

Gone through the apache spark documentation no where I found anything relevant which can help me.