1

I have a data in CSV in below format:

"/some/page-1.md","title","My title 1"
"/some/page-1.md","description","My description 1"
"/some/page-1.md","type","Tutorial"
"/some/page-1.md","index","True"
"/some/page-2.md","title","My title 2"
"/some/page-2.md","description","My description 2"
"/some/page-2.md","type","Tutorial"
"/some/page-2.md","index","False"
"/some/page-2.md","custom_1","abc"
"/some/page-3.md","title","My title 3"
"/some/page-3.md","description","My description 3"
"/some/page-3.md","type","Tutorial"
"/some/page-3.md","index","True"
"/some/page-3.md","custom_2","def"

I am reading it to Pandas DataFrame:

df = pd.read_csv(csvFile, index_col=False, dtype=object, header=None)
print(df)

Output is following:

                  0            1                 2
0   /some/page-1.md        title        My title 1
1   /some/page-1.md  description  My description 1
2   /some/page-1.md         type          Tutorial
3   /some/page-1.md        index              True
4   /some/page-2.md        title        My title 2
5   /some/page-2.md  description  My description 2
6   /some/page-2.md         type          Tutorial
7   /some/page-2.md        index             False
8   /some/page-2.md     custom_1               abc
9   /some/page-3.md        title        My title 3
10  /some/page-3.md  description  My description 3
11  /some/page-3.md         type          Tutorial
12  /some/page-3.md        index              True
13  /some/page-3.md     custom_2               def

I'd like to transform it to DataFrame in below format, where first header is "file" and values are from column 0. Other headers are taken from column 1 and values from column 2:

              file       title       description      type  index  custom_1  custom_2
0  /some/page-1.md  My title 1  My description 1  Tutorial   True       NaN       NaN
1  /some/page-2.md  My title 2  My description 2  Tutorial  False       abc       NaN
2  /some/page-3.md  My title 3  My description 3  Tutorial   True       NaN       def

Is there a way to do this with Pandas?

1
  • Use, e.g.: df.pivot(index=0, columns=1, values=2).rename_axis(index='file', columns=None).reset_index() Commented Apr 3, 2024 at 11:57

1 Answer 1

0

I have changed your first column names to file, header, and value. So, can easily handle what u want. You need to use pivot_table method to reach your goal. The final code is shown in below.

df = pd.DataFrame(data, columns=["file", "header", "value"])


result = df.pivot_table(index='file', columns='header', values='value', aggfunc='first').reset_index()

result = result[result.index.notna()]

Your output will be like that. Soi we need to remove "header" label .

header             file custom_1 custom_2       description  index       title      type
0       /some/page-1.md      NaN      NaN  My description 1   True  My title 1  Tutorial
1       /some/page-2.md      abc      NaN  My description 2  False  My title 2  Tutorial
2       /some/page-3.md      NaN      def  My description 3   True  My title 3  Tutorial

For removing "header" label u need to use :

result.columns.name = None

Final output will be like

              file custom_1 custom_2       description  index       title      type
0  /some/page-1.md      NaN      NaN  My description 1   True  My title 1  Tutorial
1  /some/page-2.md      abc      NaN  My description 2  False  My title 2  Tutorial
2  /some/page-3.md      NaN      def  My description 3   True  My title 3  Tutorial
Sign up to request clarification or add additional context in comments.

4 Comments

You don't need pivot_table, unless there are duplicates, and if there would be, it is not at all clear from the OP's question whether first should then be the aggfunc. Anyway, the question is a common duplicate.
yeah I most of time use pivot_table for any case. So, far it worked for me
My point was that simply df.pivot also works if there are no duplicates (and there aren't any in the OP's example). Otherwise, the OP hasn't given an indication on what s/he wants to be the output if there are duplicates. Your code would show no error, and pick the first. Maybe OP wants that, maybe not. Maybe it should raise an error, maybe the aggfunc should be a list. So, saying it "works", doesn't mean it leads a result that the OP expects. Anyway, maybe at least add that this code will pick the first when there are duplicates :).
thanks for your advice. I got your point and you are right. I appreciated @ouroboros1

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.