Tuesday, February 22, 2022

3.1. Spark (as a processing engine) - 1> Spark-shell (With RDD)

3.1. Spark (as a processing engine) - 1> Spark-shell (With RDD)

 



 

 




val inputFileRDD = sc.textFile("wordcountSimple.txt")
inputFileRDD.collect

val tokenizedRDD = inputFileRDD.flatMap(x => x.split(" "))   //x is each element or row
//OR               inputFileRDD.flatMap(_.split(" "))


tokenizedRDD.collect

val tokenizedMap = tokenizedRDD.map(x => (x,1) )   //get tuple to get (abc,1) (def,1) etc
//OR               tokenizedRDD.map((_, 1))


tokenizedMap.collect

val reducedRDD = tokenizedMap.reduceByKey( (x,y) => (x+y) )   //group all the values belongs to same key
//OR             tokenizedMap.reduceByKey(_ + _)
//press tab to see exact method names (tokenizedMap.reduc



reducedRDD.collect

reducedRDD.saveAsTextFile("sparkWordCount")  //sparkWordCount is folder, under that it stores as part-<< >> files



From hands-on
----------------

hdfs dfs -put u.data

From hands-on
--------------------

spark-shell --master local[*]  (1st enter to ->  scala> prompt)
val lines = sc.textFile("u.data")
val movies = lines.map(x => (x.split("\t")(1).toInt, 1))    //to take 2nd column and return as (column2, 1) for all elements
movies.take(5)
val movieCounts = movies.reduceByKey( (x, y) => x + y )
movieCounts.collect

 

No comments:

Post a Comment

6.1. Kafka : Run

  6.1. Kafka : Run