Thursday, June 11, 2020




[e023586@sandbox-hdp ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://sandbox-hdp.hortonworks.com:4040
Spark context available as 'sc' (master = yarn, app id = application_1591918179668_0006).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1.3.0.1.0-187
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

val df = spark.read.options(Map("header" -> "true",
"inferSchema" -> "true",
"nullValue" -> "NA",
"timestampFormat" -> "yyyy-MM-dd'T'HH:mm:ss",
"mode" -> "failfast")
).csv("/user/e023586/notebook/spark/survey.csv")

// Exiting paste mode, now interpreting.

df: org.apache.spark.sql.DataFrame = [Timestamp: timestamp, Age: bigint ... 25 more fields]

scala>

scala>

scala>

scala>

scala> :paste
// Entering paste mode (ctrl-D to finish)

val df = spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/user/e023586/notebook/spark/survey.csv")
.load()

// Exiting paste mode, now interpreting.

df: org.apache.spark.sql.DataFrame = [Timestamp: timestamp, Age: bigint ... 25 more fields]

scala> df.rdd.getNumPartitions
res0: Int = 1

scala> val df3 = df.repartition(3).toDF
df3: org.apache.spark.sql.DataFrame = [Timestamp: timestamp, Age: bigint ... 25 more fields]

scala> df3.rdd.getNumPartitions
res1: Int = 3

scala> df.select("Timestamp","Age","remote_work","leave").filter("Age >30").show
20/06/12 02:18:04 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+-------------------+---+-----------+------------------+
|          Timestamp|Age|remote_work|             leave|
+-------------------+---+-----------+------------------+
|2014-08-27 11:29:31| 37|         No|     Somewhat easy|
|2014-08-27 11:29:37| 44|         No|        Don't know|
|2014-08-27 11:29:44| 32|         No|Somewhat difficult|
|2014-08-27 11:29:46| 31|         No|Somewhat difficult|
|2014-08-27 11:30:22| 31|        Yes|        Don't know|
|2014-08-27 11:31:22| 33|         No|        Don't know|
|2014-08-27 11:31:50| 35|        Yes|Somewhat difficult|
|2014-08-27 11:32:05| 39|        Yes|        Don't know|
|2014-08-27 11:32:39| 42|         No|    Very difficult|
|2014-08-27 11:32:44| 31|        Yes|        Don't know|
|2014-08-27 11:33:23| 42|         No|Somewhat difficult|
|2014-08-27 11:33:26| 36|         No|        Don't know|
|2014-08-27 11:34:37| 32|         No|        Don't know|
|2014-08-27 11:34:53| 46|        Yes|         Very easy|
|2014-08-27 11:35:08| 36|        Yes|     Somewhat easy|
|2014-08-27 11:35:24| 31|        Yes|Somewhat difficult|
|2014-08-27 11:35:48| 46|        Yes|        Don't know|
|2014-08-27 11:36:24| 41|         No|        Don't know|
|2014-08-27 11:36:48| 33|         No|        Don't know|
|2014-08-27 11:37:08| 35|         No|         Very easy|
+-------------------+---+-----------+------------------+
only showing top 20 rows


scala> df.printSchema
root
 |-- Timestamp: timestamp (nullable = true)
 |-- Age: long (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- state: string (nullable = true)
 |-- self_employed: string (nullable = true)
 |-- family_history: string (nullable = true)
 |-- treatment: string (nullable = true)
 |-- work_interfere: string (nullable = true)
 |-- no_employees: string (nullable = true)
 |-- remote_work: string (nullable = true)
 |-- tech_company: string (nullable = true)
 |-- benefits: string (nullable = true)
 |-- care_options: string (nullable = true)
 |-- wellness_program: string (nullable = true)
 |-- seek_help: string (nullable = true)
 |-- anonymity: string (nullable = true)
 |-- leave: string (nullable = true)
 |-- mental_health_consequence: string (nullable = true)
 |-- phys_health_consequence: string (nullable = true)
 |-- coworkers: string (nullable = true)
 |-- supervisor: string (nullable = true)
 |-- mental_health_interview: string (nullable = true)
 |-- phys_health_interview: string (nullable = true)
 |-- mental_vs_physical: string (nullable = true)
 |-- obs_consequence: string (nullable = true)
 |-- comments: string (nullable = true)


scala>












[nareshjella@gw02 ~]$
[nareshjella@gw02 ~]$ spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set
Spark1 will be picked by default
Ivy Default Cache set to: /home/nareshjella/.ivy2/cache
The jars for the packages stored in: /home/nareshjella/.ivy2/jars
:: loading settings :: url = jar:file:/usr/hdp/2.6.5.0-292/spark/lib/spark-assembly-1.6.3.2.6.5.0-292-hadoop2.7.3.2.6.5.0-292.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
        confs: [default]
        found com.databricks#spark-csv_2.10;1.5.0 in central
        found org.apache.commons#commons-csv;1.1 in central
        found com.univocity#univocity-parsers;1.5.1 in central
downloading https://repo1.maven.org/maven2/com/databricks/spark-csv_2.10/1.5.0/spark-csv_2.10-1.5.0.jar ...
        [SUCCESSFUL ] com.databricks#spark-csv_2.10;1.5.0!spark-csv_2.10.jar (36ms)
downloading https://repo1.maven.org/maven2/org/apache/commons/commons-csv/1.1/commons-csv-1.1.jar ...
        [SUCCESSFUL ] org.apache.commons#commons-csv;1.1!commons-csv.jar (14ms)
downloading https://repo1.maven.org/maven2/com/univocity/univocity-parsers/1.5.1/univocity-parsers-1.5.1.jar ...
        [SUCCESSFUL ] com.univocity#univocity-parsers;1.5.1!univocity-parsers.jar (28ms)
:: resolution report :: resolve 1500ms :: artifacts dl 82ms
        :: modules in use:
        com.databricks#spark-csv_2.10;1.5.0 from central in [default]
        com.univocity#univocity-parsers;1.5.1 from central in [default]
        org.apache.commons#commons-csv;1.1 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
        confs: [default]
        3 artifacts copied, 0 already retrieved (342kB/6ms)
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.3
      /_/

Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext

scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/home/nareshjella/notebook/spark/survey.csv")
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://nn01.itversity.com:8020/home/nareshjella/notebook/spark/survey.csv
        at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
        at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1314)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
        at org.apache.spark.rdd.RDD.take(RDD.scala:1309)
        at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1349)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
        at org.apache.spark.rdd.RDD.first(RDD.scala:1348)
        at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:269)
        at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:265)
        at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:242)
        at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:74)
        at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:171)
        at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:44)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:26)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
        at $iwC$$iwC$$iwC.<init>(<console>:39)
        at $iwC$$iwC.<init>(<console>:41)
        at $iwC.<init>(<console>:43)
        at <init>(<console>:45)
        at .<init>(<console>:49)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/user/nareshjella/notebook/spark/survey.csv")
df: org.apache.spark.sql.DataFrame = [Timestamp: timestamp, Age: bigint, Gender: string, Country: string, state: string, self_employed: string, family_history: string, treatment: string, work_interfere: string, no_employees: string, remote_work: string, tech_company: string, benefits: string, care_options: string, wellness_program: string, seek_help: string, anonymity: string, leave: string, mental_health_consequence: string, phys_health_consequence: string, coworkers: string, supervisor: string, mental_health_interview: string, phys_health_interview: string, mental_vs_physical: string, obs_consequence: string, comments: string]

scala> df.select("Timestamp","Age","remote_work","leave").filter("Age >30").show
+--------------------+---+-----------+------------------+
|           Timestamp|Age|remote_work|             leave|
+--------------------+---+-----------+------------------+
|2014-08-27 11:29:...| 37|         No|     Somewhat easy|
|2014-08-27 11:29:...| 44|         No|        Don't know|
|2014-08-27 11:29:...| 32|         No|Somewhat difficult|
|2014-08-27 11:29:...| 31|         No|Somewhat difficult|
|2014-08-27 11:30:...| 31|        Yes|        Don't know|
|2014-08-27 11:31:...| 33|         No|        Don't know|
|2014-08-27 11:31:...| 35|        Yes|Somewhat difficult|
|2014-08-27 11:32:...| 39|        Yes|        Don't know|
|2014-08-27 11:32:...| 42|         No|    Very difficult|
|2014-08-27 11:32:...| 31|        Yes|        Don't know|
|2014-08-27 11:33:...| 42|         No|Somewhat difficult|
|2014-08-27 11:33:...| 36|         No|        Don't know|
|2014-08-27 11:34:...| 32|         No|        Don't know|
|2014-08-27 11:34:...| 46|        Yes|         Very easy|
|2014-08-27 11:35:...| 36|        Yes|     Somewhat easy|
|2014-08-27 11:35:...| 31|        Yes|Somewhat difficult|
|2014-08-27 11:35:...| 46|        Yes|        Don't know|
|2014-08-27 11:36:...| 41|         No|        Don't know|
|2014-08-27 11:36:...| 33|         No|        Don't know|
|2014-08-27 11:37:...| 35|         No|         Very easy|
+--------------------+---+-----------+------------------+
only showing top 20 rows


scala> 

No comments:

Post a Comment