Questions tagged [pandas]
Pandas is a Python data analysis library.
608 questions
4
votes
3
answers
84
views
Optimizing DataFrame iteration when generating large hierarchical text files
I have a custom object which stores dataframes in memory given a certain hierarchy, and I want to store this data in a file while maintaining the hierarchy. This hierarchy involved parents, children, ...
4
votes
4
answers
1k
views
8
votes
3
answers
491
views
Interpolating based on non-diagonal neighboring values
I have a comma-separated value (CSV) file as input, and I am supposed to interpolate all missing (nan) values based on neighboring non-diagonal values.
The CSV ...
4
votes
2
answers
353
views
rolling quarterly mean
I want to calculate the quarterly average of a time-indexed dataframe column in a rolling fashion. The mean at any timestamp should not contain information about future timestamps.
This is a code to ...
3
votes
3
answers
140
views
Increase time efficiency when writing arrays to CSV file
I have the following code to amend two rows of "test_base.csv" with the entries of the arrays "a_temp" and "b_temp," saving the result into "result.csv." .csv ...
4
votes
1
answer
214
views
Finding specific promotions from two columns [closed]
I'm trying to build a function that identifies those who are promoted into a list of jobcodes, or are promoted within that list of jobcodes.
Initially I was using ...
0
votes
1
answer
123
views
What's the fastest way to get "postcodes" for thousands of coordinates (latitudes & longitudes) in Python? [closed]
I have a dataset that contains 750,000 rows. I want to query each row and get the postcodes using the latitudes and longitudes.
Problem:
The code is executing very fast when I query like 100 rows, and ...
1
vote
1
answer
105
views
Replace iterrow loops in pandas matrices with something else to shorten the running time
This post is modified from this one: https://codereview.stackexchange.com/posts/292885/edit (Alternatives to iterrow loops in python pandas dataframes).
I have a piece of code to calculate price ...
6
votes
2
answers
748
views
Alternatives to iterrow loops in python pandas dataframes
I have a piece of code to calculate price sensitivity based on the product and its rating.
Below is the original data set with product type, reported year, customer’s rating, price per unit, and ...
2
votes
1
answer
58
views
Maintain a log containing values if certain conditions are met
I'm trying to capture profits and set a stop loss in my trading strategy. I want the stop loss to be set daily based on the past data and if the current price, i.e., price for the date falls below the ...
2
votes
1
answer
253
views
Python using generators with Excelwriter - Performance
I'm looking to understand if my code has an obvious blockage or performance pain point that will cause it to operate slower or use more memory than it should.
The current Excelfile i am processing ...
3
votes
1
answer
296
views
Transferring dataframe columns into dataframe rows
I have the following data:
...
1
vote
1
answer
149
views
Custom neural network implementation in TensorFlow to compare normalisation vs. no normalisation on data
I am performing a sports prediction multi-class classification problem, and wanted to compare the differences in model performance between normalised and non-normalised data. You can see the 2 ...
3
votes
1
answer
245
views
Machine learning training, hyperparameter tuning and testing with 3 different models
I am trying to solve a multi-class classification involving prediction the outcome of a football match (target variable = Win, Lose or Draw). With a dataset of 2280 rows, which is 6 seasons of ...
3
votes
1
answer
89
views
Calculating premium splits for policies
Looking for a better approach to write below transformation using Python. Is it possible to avoid loop and still achieve the desired output?
It is too slow for 10 million rows.
...
6
votes
2
answers
131
views
Creating csvs using Pandas on large dataset for document retrieval
I am trying to build a useable NLP corpus but getting bottlenecked by how long the program takes (200 hours). With so much data I know that optimizing my code even a little bit will net me huge time ...
2
votes
1
answer
90
views
Extending die roll simulations for complex data science tasks
I've developed a Python script that simulates die rolls and analyses the results. I'm now looking to extend and modify this code for more complex data science tasks and simulations.
Is this code ...
3
votes
3
answers
203
views
Syntactic sugar for derived variables from Pandas DataFrame columns
Update: Okay, after trying to use this for a while, I think it's probably a bad idea. Please use (lambda x: x["a"] + x["b"])(df) if really ...
0
votes
2
answers
161
views
Optimize a Python code which indicates duplicated values in an excel file [closed]
I wrote this code to indicate duplicated values. It actually works but I hope to know if there's another possible solution to optimize this process. Thanks.
...
2
votes
1
answer
101
views
Combined or separate data-cleaning routine
I am a junior data engineer that have 3 years of experience with Python. I write a lot of Python code for my job and I came up with this question I can't solve by my own. I don't have the chance to ...
2
votes
1
answer
75
views
Use row data from a database to find rows in dataframes that match and use data to generate a separate dataframe
I have a DataFrame (database_df) that contains the general record with the IDs that are the same team in each of the lines, containing these values I need to find ...
1
vote
2
answers
97
views
Imrove performance when updating DataFrame rows based on complex criteria
My question got rejected the last time so I am trying a better approach to getting a solution:
...
2
votes
1
answer
210
views
Is this the right implementation for Linear Programming (puLP) on python?
I have created a LP function to help maximize a set of features. My first time playing with this library and also conducting LP.
Variables:
Number of features => X
Number of Categories => Y
...
2
votes
1
answer
74
views
Pandas to combine data files & add new calculated columns to result
I currently have the following python code that adds a few calculated columns to my consol file. Essentially it combines all the sales files into one combined DF and then adds 4 new sales columns with ...
1
vote
1
answer
159
views
Using Pandas to group data based on name and see if column value is greater than or equal to values based on group names
As you'll see from the below code, I'm creating separate data frames of a much larger data frame, then updating a column for each one. What I'm doing is looking at the second column and checking to ...
1
vote
2
answers
106
views
Split Pandas dataset column based on values (suffixes: string operation)
In Python using Pandas, I am splitting a dataset column into 4 lists based on the suffix of the values. For the 3 suffixes I am using a list comprehension then for the 4th one, a set operation that ...
2
votes
1
answer
92
views
Generating Test Data with Python
Background: I'm a BI developer building a new dashboard for a client. They want to track performance for the week/month/year to date against the prior period. Unfortunately, I don't have direct access ...
2
votes
1
answer
177
views
Flag tukey outliers using python pandas groupby
I'm new to python and pandas.
I would like to use pandas groupby() to flag values in a df that are outliers. I think I've got it working, but as I'm new to python, ...
2
votes
1
answer
173
views
Applying cointegration function from statsmodels on a large dataframe
I need to apply the coint function from the statsmodels library to 207 times series with 1397 points each, two by two.
Currently, it takes between 35-40 minutes on my computer with an Intel 24 Cores ...
1
vote
1
answer
405
views
Finding highly correlated variables in a dataframe by evaluating its correlation matrix's values
I read data from Excel into a Pandas DataFrame, so that every column represents a different variable, and every row represents a different sample. I made the function below to identify potential ...
7
votes
2
answers
632
views
groupby in pandas and plot
I have a csv file that looks like this:
...
1
vote
1
answer
420
views
Protecting functions from empty DataFrames
Pandas likes to throw cryptic errors when you feed its functions with empty DataFrames saying nothing that would help you to identify the root cause. In order to ...
2
votes
2
answers
2k
views
Mapping pandas' Series to dataclasses
I've got something really simple this time where I'm mapping pandas' Series to dataclasses with a oneliner helper function (as ...
1
vote
2
answers
221
views
Replace personal names and addresses with company ones
The problem:
I am given a data frame. Somewhere in that dataframe there is 3*N
number of columns that I need to modify based on a condition. The
columns of interest look like this:
names_1
address_1
...
5
votes
1
answer
219
views
constraint solving graduation using HTML Parsing, pandas, and z3
not sure if this project fits on code review, but my code is getting extremely messy, and would love some tips to clean it up!
Overview
The project is designed to take in an HTML file (a degree audit),...
1
vote
1
answer
100
views
simulated samples for central limit theorem
I am trying to help students visualize the central limit theorem and wanted to do this with simulated data.
I created a population dataset with three variables:
...
1
vote
1
answer
389
views
Pandas Upsampling Time Series Splitting Equally the values through the weeks starting on monday
I build my code studying this question: "Divide total sum equally to higher sampled time periods when upsampling with pandas".
I am wondering if can be improved the code and if it is right.
...
1
vote
1
answer
111
views
Write a Python script to generate a random DataFrame based on specific inputs
I found myself many times in the past trying to generate fake DataFrames in pandas. I decided just for fun, to write a script that I can specify some inputs and ...
4
votes
2
answers
239
views
Unstructured to Structured TOC
The following code tries to convert an unstructured TOC with bounding box layout data given by the output of pdftotext -bbox-layout -f 11 -l 13 new_book.pdf toc.html...
4
votes
1
answer
1k
views
Python BeautifulSoup - preparing HTML rows and td tags for Pandas
I'm using BeautifulSoup to parse a bunch of combined tables' rows, row by row, column by column to prepare it for import into Pandas. I can't use to_html() because ...
1
vote
1
answer
200
views
Efficient way to read files python - 10 folders with 100k txt files in each one
i am looking for an efficient way to read and append texts of .txt files to a dataframe. I currently have 10 folders with 100k documents each.
What i specifically need to do is:
getting the names of ...
1
vote
1
answer
121
views
Make unique id based on text data column with similarity scoring
I have the following dataframe:
...
1
vote
1
answer
100
views
Find profitable bets from historic results
Each of the lines in my CSV is a possibility of investment that I register on historic, but I would only make the investment if in the existing history (previous lines) the sum of the results is above ...
2
votes
1
answer
84
views
Create new columns in a DataFrame using functions and reposition the new columns
I would like a review regarding the method I use to create the new columns and then reposition them in the correct place where they should be.
The new column called ...
-4
votes
1
answer
49
views
Find characters from same homeworld as Chewbacca [closed]
The problem is
Find the names of all characters which are from the same homeworld as Chewbacca
My code is
...
5
votes
1
answer
203
views
Web scraper for data sources from Statistics Canada
I've written a parser to scrape data from Canadian Statistics Bureau.
...
2
votes
1
answer
194
views
Efficient List comprehension with multiple conditions using shift? [closed]
I am new to python.
I am trying to get the total number of failures by checking first how did the transition of the column Failure Sensor. Then creating the Start column from devicetimestamp if the ...
3
votes
1
answer
83
views
Cleaning Float Column of Longitude
I am cleaning a dataset where columns lat and long are presenting some values multiplied by 10. Not only 10, but changing 10^n. I wrote the code below. I am not sure if it is the best way, but is ...
1
vote
0
answers
57
views
BoundingBox dataclass implementation with cupy, cudf, and nvector
The dataset I'm working with is rather large so I've been experimenting with cudf and cupy. Here you can find instructions for ...
2
votes
1
answer
335
views
python: requests large.zip -> unzip -> fix -> filter ->gunzip
I wrote a function to download a large zipfile 5-7gb from Iowa State MRMS data archive.
The zip files appear to be malformed and results in a BadZipFileError hence ...