Spark – Adding literal or constant to DataFrame Example:
Spark SQL functions lit() and typedLit()are used to add a new column by assigning a literal or constant value to Spark DataFrame. These both functions return Column as return type.
Both of these are available in Spark by importing
org.apache.spark.sql.functions
lit() – Syntax:
lit(literal : scala.Any) : org.apache.spark.sql.Column
Creates a [[Column]] of literal value. The passed in object is returned directly if it is already a [[Column]]. If the object is a Scala Symbol, it is converted into a [[Column]] also. Otherwise, a new [[Column]] is created to represent the literal value.
Let’s take a look at some Spark code:
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val data = Seq(("111",50000),("222",60000),("333",40000))
val df = data.toDF("EmpId","Salary")
Example 1 : Simple usage of lit() function:
Let’s see a scala example of how to create a new column with constant value using lit() Spark SQL function. On below snippet, we are creating a new column by adding a literal ‘1’ to spark DataFrame.
val df2 = df.select(col("EmpId"),col("Salary"),lit("1").as("lit_column1"))
df2.show()
+-----+------+------------+ |EmpId|Salary|lit_column1| +-----+------+-----------+ | 111| 50000| 1| | 222| 60000| 1| | 333| 40000| 1| +-----+------+-----------+
Adding a same constant literal to all records in DataFrame may not be real-time use full so let’s see another example.
Example 2 : lit() function with withColumn:
The following scala code example shows how to use lit() Spark sql function, using withColumn to derive a new column based on some conditions.
// Syntax
val df3 = df2.withColumn("lit_column2", lit("2"))
df3.show
// Actual Use
val df4 = df3.withColumn(
"lit_column3",
when(col("Salary") >= 40000 && col("Salary") <= 50000, lit("100").cast(IntegerType))
.otherwise(lit("200").cast(IntegerType)))
df4.show()
DF3: +-----+------+------------+-----------+ |EmpId|Salary|lit_column1|lit_column2| +-----+------+-----------+-----------+ | 111| 50000| 1| 2| | 222| 60000| 1| 2| | 333| 40000| 1| 2| +-----+------+------- ---+-----------+ DF4: +-----+------+------------+-----------+-----------+ |EmpId|Salary|lit_column1|lit_column2|lit_column3| +-----+------+-----------+-----------+-----------+ | 111| 50000| 1| 2| 100| | 222| 60000| 1| 2| 200| | 333| 40000| 1| 2| 100| +-----+------+--------- -+-----------+------ ----+
typedLit() – Syntax:
def typedLit[T : TypeTag](literal: T): org.apache.spark.sql.Column
Difference between lit and typedLit is that typedLit function can handle parameterized scala types e.g.: List, Seq and Map.
Example 3 : typedLit() with Seq and Map:
Following example shows on how to create a new column with collection using typedLit() sql function. On below snippet, we are creating a new column by adding a collection literal Seq(1, 2, 3), Map(“a” -> 1, “b” -> 2) and structure (“a”, 2, 1.0) to spark DataFrame.
val df5 = df4.withColumn("typedLit_seq", typedLit(Seq(1, 2, 3)))
.withColumn("typedLit_map", typedLit(Map("a" -> 1, "b" -> 2)))
.withColumn("typedLit_struct", typedLit(("a", 2, 1.0)))
df5.printSchema()
df5.show()
df5.printSchema(): root |-- EmpId: string (nullable = true) |-- Salary: integer (nullable = false) |-- lit_column1: string (nullable = false) |-- lit_column2: string (nullable = false) |-- lit_column3: integer (nullable = true) |-- typedLit_seq: array (nullable = false) | |-- element: integer (containsNull = false) |-- typedLit_map: map (nullable = false) | |-- key: string | |-- value: integer (valueContainsNull = false) |-- typedLit_struct: struct (nullable = false) | |-- _1: string (nullable = true) | |-- _2: integer (nullable = false) | |-- _3: double (nullable = false) df5.show(): +-----+------+-----------+-----------+-----------+------------+----------------+---------------+ |EmpId|Salary|lit_column1|lit_column2|lit_column3|typedLit_seq| typedLit_map|typedLit_struct| +-----+------+-----------+-----------+-----------+------------+----------------+---------------+ | 111| 50000| 1| 2| 100| [1, 2, 3]|[a -> 1, b -> 2]| [a, 2, 1.0]| | 222| 60000| 1| 2| 200| [1, 2, 3]|[a -> 1, b -> 2]| [a, 2, 1.0]| | 333| 40000| 1| 2| 100| [1, 2, 3]|[a -> 1, b -> 2]| [a, 2, 1.0]| +-----+------+-----------+-----------+-----------+------------+----------------+---------------+
Complete Example:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.IntegerType
object litTypeLit extends App {
val spark = SparkSession.builder()
.appName("sparkbyexamples.com")
.master("local")
.getOrCreate()
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val data = Seq(("111", 50000), ("222", 60000), ("333", 40000))
val df = data.toDF("EmpId", "Salary")
val df2 = df.select(col("EmpId"), col("Salary"), lit("1").as("lit_column1"))
df2.show()
val df3 = df2.withColumn("lit_column2", lit("2"))
df3.show
val df4 = df3.withColumn(
"lit_column3",
when(col("Salary") >= 40000 && col("Salary") <= 50000, lit("100").cast(IntegerType))
.otherwise(lit("200").cast(IntegerType)))
df4.show()
val df5 = df4.withColumn("typedLit_seq", typedLit(Seq(1, 2, 3)))
.withColumn("typedLit_map", typedLit(Map("a" -> 1, "b" -> 2)))
.withColumn("typedLit_struct", typedLit(("a", 2, 1.0)))
df5.printSchema()
df5.show()
}
Conclusion:
You have learned multiple ways to add a constant literal value to DataFrame using Spark SQL lit() function and have learned the difference between lit and typedLit functions.
When possible try to use predefined Spark SQL functions as they are little bit more compile-time safety and performs better when compared to user-defined functions. If your application is critical on performance try to avoid using custom UDF functions as these are not guarantee on performance.

