3.1. Spark (as a processing engine) - 1> Spark-shell (With RDD)
val inputFileRDD = sc.textFile("wordcountSimple.txt")
inputFileRDD.collect
val tokenizedRDD = inputFileRDD.flatMap(x => x.split(" ")) //x is each element or row
//OR inputFileRDD.flatMap(_.split(" "))
tokenizedRDD.collect
val tokenizedMap = tokenizedRDD.map(x => (x,1) ) //get tuple to get (abc,1) (def,1) etc
//OR tokenizedRDD.map((_, 1))
tokenizedMap.collect
val reducedRDD = tokenizedMap.reduceByKey( (x,y) => (x+y) ) //group all the values belongs to same key
//OR tokenizedMap.reduceByKey(_ + _)
//press tab to see exact method names (tokenizedMap.reduc
reducedRDD.collect
reducedRDD.saveAsTextFile("sparkWordCount") //sparkWordCount is folder, under that it stores as part-<< >> files
----------------
hdfs dfs -put u.data
From hands-on
--------------------
spark-shell --master local[*] (1st enter to -> scala> prompt)
val lines = sc.textFile("u.data")
val movies = lines.map(x => (x.split("\t")(1).toInt, 1)) //to take 2nd column and return as (column2, 1) for all elements
movies.take(5)
val movieCounts = movies.reduceByKey( (x, y) => x + y )
movieCounts.collect
No comments:
Post a Comment