2

I am new to python and pandas and have created a test web page with html code to use to help with learning how to pull the data and then format into CSV for use in excel. Below is the code I have come up with that puts it into a nice format but I am stuck on how to format it into a CSV file to import.

Code:

# Importing pandas 
import pandas as pd 

# The webpage URL whose table we want to extract 
url = "/home/dvm01/e007"

# Assign the table data to a Pandas dataframe 
table = pd.read_html(url,**index_col=0**)[0]
#table2 = pd.read_html(url)[0],pd.read_html(url)[1],pd.read_html(url)[6]

# Print the dataframe 
print(table)
#print(table2)

# Store the dataframe in Excel file 
#table.to_excel("data.xlsx") 

Output:

            Account                                          Account.1
ID:                                         e007
Description:  ABST: 198, SUR: J DOUTHIT
Geo ID:                            014.0198.0000

What I am trying to figure out is how to remove the index for the rows and make the text before the first: to be a column header. In row 1 I have two: but everything after the first: should be the data for the column header.

I would like to take the above current output and and have ID, Description, and Geo ID as the column headers and the text that comes after the ':' to be the data for each of the headers.

I do not need 'Account' and 'Account.1' I believe these are being recognized as column headers. Below is what I would like the output to look like in Excel, but I cannot figure out how to format it correctly to export out to a CSV that can be imported. Maybe I do not even need to import or format into a CSV, the 'table.to_excel' function seems to not need that step.

+------+---------------------------+---------------+
| ID   | Description               | Geo ID        |
+------+---------------------------+---------------+
| e007 | ABST: 198, SUR: J Douthit | 014.0198.0000 |
+------+---------------------------+---------------+

I was able to remove the index numbers, by using index_col=0 above where I define the dfs variable. Not sure that is the best way but it does do what I was trying to accomplish for that portion.

Since I am new to python I am having a hard time formatting my question into Google or StackOverflow to get the answers I am looking for. If someone could just point me in the right direction in what I am looking for, that would work but examples would be nice as well.

Thanks for any guidance

4
  • 1
    If you're trying to drop the index in your exports from pandas, you need to use the index=False keyword argument in your output function, such as to_csv or to_excel. A sample of how you want your output to look vs. how it looks now is always good practice. Commented Oct 18, 2019 at 18:18
  • 1
    + @mayosten don't worry about this, when I started to learn python (and pandas at the same time) didn't know how to remove the index for a long time, heck didn't even know it was called an index. You'll get there mate! Commented Oct 18, 2019 at 18:21
  • I am trying to format what I would like the output to look like but not having much success. Reading through the stackoverflow.com/editing-help#code I should be able to use <br> tags or <pre> tags to format text but it is not working Commented Oct 18, 2019 at 18:39
  • 1
    @Datanovice, thanks for the formatting, I was going crazy trying to figure out how to make that look correct. Commented Oct 18, 2019 at 19:38

1 Answer 1

2

so to format your questions you can show us an example of what you want. try something like this:

|id|name|data1|data2|date3|-url-|
|--|----|-----|-----|-----|-----|
|1 |xyz |datax|datay|dataz|x:url|
|2 |xyz |datax|datay|dataz|x:url|
|3 |xyz |datax|datay|dataz|x:url|
...

Then you can ask questions about how to create the right Dataframe output that fits your desired design:)

you can also use this generator online: https://www.tablesgenerator.com/text_tables

+----+------+-------+-------+-------+------+
| Id | Name | Data1 | Data2 | Data3 | Url  |
+----+------+-------+-------+-------+------+
| 1  | xyz  | datax | datay | dataz | xurl | 
+----+------+-------+-------+-------+------+
| 2  | xyz  | datax | datay | dataz | xurl |
+----+------+-------+-------+-------+------+
| 3  | xyz  | datax | datay | dataz | xurl |
+----+------+-------+-------+-------+------+

Ok now that you have your Data Table design. next I would ask you to try using Jupyter Notebook. This will let you test your dataframes line by line. Each Test should be a new transfermation of the dataset.

How I see the workflow going to get to your needs: 1. See test and see what your current DF columns are:

print(df.columns)

2. use this command to edit your columns:

df.rename(columns={'old column 1':'ID',
                          'old column 2':'Description',
                          'old column 3':'Geo ID'}, 
                 inplace=True)
  1. use this command change an index data

    df.rename(index={0:'zero',1:'one'}, inplace=True)

  2. use this command to change a row

    df.loc['--insert_Column_here--', '--insert_row_here--'] = new_value
    
Sign up to request clarification or add additional context in comments.

2 Comments

thank you, I was seriously beginning to lose it trying to get that table to format.
@Rey youre welcome please help me by upvoting and or accepting my answer:)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.