How to exclude csv columns in spark?

Question

Rdd consists of entire csv records and not able to find ways to exclude particular colums from it.

Am tried drop().

For example CSV File consists of three columns no,name and age.

Now I need to exclude 2 columns no and name

Val excluColumns='no,name'    
rdd.drop(excluColumns)

Makes Issue in code.

Am new to this spark and anyone guide me to do this.

EDIT-1

val cols="no,name"
val excluColumns= Seq(cols)
df.drop(excluColumns:_*)
  .show()

It leads conversion issue.

can you share how you created the rdd? rdds don't have column names — Anahcolus
– Anahcolus, Commented Apr 18, 2018 at 6:53

Anahcolus · Accepted Answer · 2018-04-18 07:19:09Z

3

RDDs don't have column names so you will have to read it as dataframe and use drop as (assuming that you have header in the file)

the file should be as

no,name,age
1,bill,23
2,charles,24
3,gates,45

You read it to dataframe as

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", true).load("File.csv")

which should give you

+---+-------+---+
|no |name   |age|
+---+-------+---+
|1  |bill   |23 |
|2  |charles|24 |
|3  |gates  |45 |
+---+-------+---+

Then you can create sequence of columns to be dropped and use it as below

val excluColumns= "no,name".split(",")
df.drop(excluColumns:_*)
  .show()

This should give you age column only

+---+
|age|
+---+
| 23|
| 24|
| 45|
+---+

edited Apr 18, 2018 at 7:19

answered Apr 18, 2018 at 7:03

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mahendran V M Over a year ago

Like i have specified the column names in string "val" in my question.How i pass those string Seq() method.it makes issue

Mahendran V M Over a year ago

i have added my EDIT-1 in question. Also, my column names comes in string only not in seperation.

Mohammad Neamul Islam · Accepted Answer · 2018-04-18 06:57:28Z

1

  StringWriter sw = new StringWriter();
                sw.WriteLine("\"Id No\",\"Customer Name\",\"Customer Mobile No\",\"Customer BusinessZone\"");
                Response.ClearContent();
                Response.AddHeader("content-disposition", "attachment;filename=Security_User.csv");
                Response.ContentType = "text/csv";
                foreach (var user in _securityUserService.GetAllCustomer())
                {
                    sw.WriteLine(string.Format("\"{0}\",\"{1}\",\"{2}\",\"{3}\"",
                                               user.Id,
                                               user.FullName,
                                               user.Phone,
                                               user.BusinessZones.Name));
                }

                Response.Write(sw.ToString());

                Response.End();
            }

answered Apr 18, 2018 at 6:57

Mohammad Neamul Islam

2936 silver badges7 bronze badges

1 Comment

Mahendran V M Over a year ago

Is this possible in spark ?

Collectives™ on Stack Overflow

How to exclude csv columns in spark?

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related