There is a parameter is "spark.sql.autoBroadcastJoinThreshold" which is set to 10mb by default. The REPARTITION_BY_RANGE hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. This hint is equivalent to repartitionByRange Dataset APIs. If both sides of the join have the broadcast hints, the one with the smaller size (based on stats) will be broadcast. This has the advantage that the other side of the join doesnt require any shuffle and it will be beneficial especially if this other side is very large, so not doing the shuffle will bring notable speed-up as compared to other algorithms that would have to do the shuffle. Example: below i have used broadcast but you can use either mapjoin/broadcastjoin hints will result same explain plan. Joins with another DataFrame, using the given join expression. BNLJ will be chosen if one side can be broadcasted similarly as in the case of BHJ. Imagine a situation like this, In this query we join two DataFrames, where the second dfB is a result of some expensive transformations, there is called a user-defined function (UDF) and then the data is aggregated. The reason why is SMJ preferred by default is that it is more robust with respect to OoM errors. You can hint to Spark SQL that a given DF should be broadcast for join by calling method broadcast on the DataFrame before joining it Broadcast join is an optimization technique in the Spark SQL engine that is used to join two DataFrames. We can pass a sequence of columns with the shortcut join syntax to automatically delete the duplicate column. Hints let you make decisions that are usually made by the optimizer while generating an execution plan. See If the data is not local, various shuffle operations are required and can have a negative impact on performance. Using the hints in Spark SQL gives us the power to affect the physical plan. If you are using spark 2.2+ then you can use any of these MAPJOIN/BROADCAST/BROADCASTJOIN hints.