Questions tagged [python]
Use for data science questions related to the programming language Python. Not intended for general coding questions (which should be asked on Stack Overflow).
6,626 questions
0
votes
0
answers
12
views
Unexpected Feature Importance Pattern in Random Forest Classification of MNIST Digits 0 and 1
I performed Random Forest–based feature importance analysis on the MNIST dataset, focusing only on digits 0 and 1.
When I visualize the importance map (see image below), it doesn’t resemble the ...
0
votes
0
answers
10
views
How can I group transcribed phrases into meaningful chunks without using complex models?
I have a large set of phrases obtained via Azure Fast Transcription, and I need to group them into coherent semantic chunks (to use later in a RAG pipeline).
Initially, I tried grouping phrases based ...
0
votes
0
answers
17
views
How to extract my fingerprint from my laptop's finger sensor
So like I have a bunch of fingerprint as a data set (my college gave me). Now I want to use these fingerprint as datasets and train a model to understand the different things. That is beside the point....
1
vote
0
answers
34
views
How to identify and quantify main tendencies across participants from cluster membership heatmaps?
I'd appreciate your thoughts on the following problem.
I've created a heatmap plot (attached) showing the cluster membership ratio for each participant (in separate subplots) and condition (η).
Now, I'...
1
vote
0
answers
11
views
How to interpret an unstable learning curve on a model tuned with Hyberband Tuning?
I have used Hyperband automatic tuning for an ANN model to predict price. After running the model with the automatic tuning, I am obtaining an R2 score of 1.00 that suggests overfitting, however, I am ...
4
votes
0
answers
29
views
Time-efficient parallelization of masks for pre-processing a dataset
I have a large dataset (~10M points) in python and I want to filter it using a large number of different custom masks, as part of calculations to create a new but related dataset. Because the dataset ...
5
votes
1
answer
59
views
Jupyter notebooks compiled from different building blocks
I use Jupyter notebooks to teach programming, using markdown in text cells, and I want to separate the concepts by level-1 headings (starting with # Heading), for ...
4
votes
1
answer
76
views
RAG Chatbot does not keep track of chat session history
I built a RAG chatbot in python,langchain, OpenAI LLM, and FAISS for the vectorstore. And the data is stored as JSON. The chatbot does not always keep track of the inputs and outputs.
Here is an ...
2
votes
1
answer
40
views
Is it possible to make the python widget in Orange to give output and receive input (both in the same widget)
I'm working on a project which works on loop control, when I try to implement that in the orange platform, I'm unable to connect one widget (python script) to another in loop, as the connection is ...
0
votes
1
answer
58
views
NLP : How to clean the data of a conversation correctly?
Say we have the data as follows
Input
...
2
votes
0
answers
65
views
RAG Chatbot does not answer paraphrased questions
I built a RAG chatbot in python,langchain, and FAISS for the vectorstore.
And the data is stored as JSON.
The chatbot sometimes refuses to answer when a question is rephrased.
Here are two ...
0
votes
0
answers
45
views
Qiskit Problem: this solution is a bit slow, is there a way to make it faster and increase the accuracy a little bit?
I'm currently making a small binary classification program using Quantum Machine Learning (EstimatorQNN to be more specific). My program classifies data inside the Wisconsin Breast Cancer database and ...
5
votes
1
answer
119
views
How to get MLFlow built container to listen on 0.0.0.0?
I'm following this tutorial and am stuck on step 8: https://mlflow.org/docs/latest/ml/getting-started/hyperparameter-tuning/#test-your-container
The inference server is listening on ...
7
votes
1
answer
76
views
Time series imputation using transformers and LLMs
So I was working on a multivariate time-series data, is it possible that I can impute or interpolate the missing data using transformer or pre-trained, fine-tuned LLMs?
Some insights about it please.
...
8
votes
1
answer
150
views
How to correctly implement the loss function for my distillation of Mask2Former?
I have a Mask2Former model fine-tuned on my own custom dataset and it is working nicely. I want to play around with knowledge distillation and use my pretrained ...
4
votes
2
answers
441
views
NLP of noisy unpredictable text to extract dates--just regex?
Question: Are there better approaches than regex for extracting event dates (including relative) from noisy text? Are there NLP tools that can help disambiguate multiple date mentions in various ...
9
votes
2
answers
215
views
Is it best practice to remove outliers from transaction data used for training?
I am building a random forest regression model. The goal is to predict the maximum each customer will spend in a single transaction during the next 90 days.
I have transaction data for 7m customers, ...
1
vote
0
answers
29
views
Issue with running training on multigpu using DDP
I am training a classifier model but since it is taking far too long I want to use multigpu for the training. The current code is
...
10
votes
1
answer
5k
views
Is CUDA 13 a thing (or am I misinterpreting something)?
A few days ago I installed my new NVIDIA GeForce RTX 5090 and I can't get pytorch to work on my Win11 Desktop (just background info, the question is not directly ...
4
votes
1
answer
49
views
Is there a way to programatically schedule jobs on Airflow or Cron Daemon?
The question is more data engineering related than data science, but since there is no data engineering stack exchange, thought I will shoot it here.
Basically, as the title says. So, as part of a ...
9
votes
1
answer
1k
views
How should a typical reward curve look like while training a RL model
I have set up a DQN with TorchRL to solve a problem where the agent can move in a square grid and pick some rewards scattered randomly on it. Right now, I am using a 5x5 grid and have 3 rewards on it. ...
7
votes
1
answer
296
views
SciPy's dendrogram method depicts two cluster merges as one
I am following the example code in the linkage documentation:
...
6
votes
1
answer
87
views
SciPy's linkage method should take 1D condensed distance matrix of length n choose 2
I am educating myself on hierarchical clustering and the relevant SciPy methods.
The 1st argument of the linkage method is a 1D condensed distance matrix $X$ of ...
1
vote
1
answer
70
views
Improving a GenAI Tool to Explain XGBoost Model Outputs for Individual Predictions
I have developed an XGBoost model to predict a target variable based on a set of input indicators. I'm now building a Generative AI-based tool that can take an individual's data—i.e., the values of ...
0
votes
0
answers
21
views
Runtime complexity of scikit-learn’s One-vs-Rest LogisticRegression (LBFGS) vs. RidgeClassifier
I’m working through the runtime analysis of scikit-learn’s OneVsRestClassifier for two cases:
LogisticRegression (solver=lbfgs, ...
1
vote
0
answers
50
views
How to improve classification model (item will sell that day or not) for dataset with multiple sparce timeseries?
I am trying to create one big model(lightGB) that forecasts sales for each product for cosmetic chain store. Dataset I am working with is last 5 years data and has these columns:
...
4
votes
1
answer
214
views
Why do I need to call np.transpose() on this?
I have the following python script:
...
0
votes
0
answers
30
views
expected the model to forecast resolution time more accurately based on past ticket patterns. I was also hoping to unde
day
Modified today
Viewed 25 times
0
I want to build a model that forecasts ticket resolution time for a data science software support tickets . I’ve calculated queuing time and resolution time from ...
2
votes
1
answer
90
views
How to visualize the images?
Suppose we have 24 images per day, one per hour. And every image is 24×24 CSV file.
I do the following transformation for every day:
The first image is unchanged.
For the second image, move
column i ...
2
votes
1
answer
41
views
N-Beats, Pytorch forecasting: predicitons are slightly shifted
I am applying the N Beats Model of the pytorch-forecasting package on a traffic dataset. I am doing single step prediction with a context length of 5. Now the prediction is unfortunately slightly ...
1
vote
1
answer
137
views
How to preprocess code samples for a neural network to detect AI-generated code?
I’m building a plagiarism detector to identify AI-generated code on platforms like Codeforces. I’ve scraped 1,193 human and AI-generated code samples (Python, C++, Java) for the same problems. My goal ...
1
vote
0
answers
127
views
ML models that train on graphs but infer without any edges (edge prediction task)
I'm exploring a machine learning research direction and I'm looking for ideas or pointers to existing models/projects that fit the following setup:
The model is trained on graphs with edge information ...
1
vote
0
answers
29
views
Time series OLS: Stationarity transformation
I am building a time-series forecasting model using OLS.
For preparation, I am making all series stationary (for now).
What I don't understand:
Series should be stationary
Achieving stationarity can ...
0
votes
0
answers
31
views
Predicting dependency links between industrial tasks using a transformer (CamemBERT) — poor results
I'm working on a machine learning project aimed at automatically predicting dependency links between tasks in industrial maintenance procedures in a group of tasks called gamme.
Each gamme consists of ...
4
votes
2
answers
144
views
Quants : Beta calculation using pandas
Editing to add one key information ( df and dailyRet ), which I noticed how imp it is... after solving this issue.
...
1
vote
0
answers
44
views
VARMA runtime issues: fixed window rolling forecasting
I'm currently exploring a couple of statistical forecasting methods. My problem rose when using VARMA(2,1) fixed window rolling forecast. The example code that I'm using is the following:
Here I only ...
3
votes
1
answer
112
views
Which model is the best suitable for generating edges?
I'm trying to develop a model who'd be able to generate dependencies between industrial tasks. In order to do that, i went for the GNN solution : i have nodes = tasks, dependencies = edges, and have ...
0
votes
1
answer
37
views
Why is my upscaling gan not working?
I have been trying to code an upscaling gan but while the code run, I pretty much always end up with terrible result when the gan doesn't collapse, collapse which happen often.
I previously tried to ...
2
votes
0
answers
48
views
Need help with model architecture and sampling negative edges
I am currently training a graph transformer model in order to develop an AI who'd be able to generate edges on a unseen graph (link dependencies between text with historical data).
I divided my ...
1
vote
0
answers
40
views
GNN Loss NaN after first training example?
I am trying to train a GNN but am getting a NaN loss function immediately after the first training example. Below I have included all of the pertinent code. My input is 385 points in 3D space confined ...
1
vote
1
answer
56
views
Need support to straighten,crop image properly for requirement in computer vision
My requirement:
Need to extract license plates without duplicates and store images in a folder,then apply ocr to extract text from images.
What i have achieved:
Iam able to detect license plates ...
1
vote
0
answers
38
views
How to correctly use a transformer model for a generating dependencies project
I'm currectly trying to train a model in order to predict dependencies between text, here it's industrials tasks, based on historical data. The goal is to learn that "Task A precedes Task B for ...
2
votes
1
answer
129
views
XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed
I'm currently working on a Parallel and Distributed Computing project where I'm comparing the performance of both XGBoost and CatBoost when trained on CPU vs GPU. The goal is to demonstrate how GPU ...
3
votes
0
answers
46
views
suppose 1 category in a variable create data leakage, can we use other categories in the same variable as dummy to predict?
We are predicting conversion. Conversion means customer converted from paying one-off to paying regular (subscribe)
If one feature is categorical feature "Activity" , consisting 15+ ...
3
votes
0
answers
43
views
How can i plot large .nc files with xarray and matplotlib?
I have a 11GB .nc file with lon/lat positions, and particle trajectories on the ocean surface for a timespan of 40 days.
For small files (Approx 140MB) i use xarray, netCDF4, matplotlib and cartopy to ...
8
votes
1
answer
179
views
How to correctly perform link prediction inference on a new, unseen graph?"
I'm working on an industrial AI use case where I train a Graph Neural Network (GCN) for link prediction — specifically, to predict successor tasks in project planning graphs (e.g., for construction or ...
2
votes
0
answers
36
views
Anomaly detection time in time-series for drops
I am looking into different statistical methods for determining a decrease in a numeric "count" feature across a time-series dataset. The dataset is relatively small (about 50 records), and ...
4
votes
0
answers
27
views
Low Accuracy from Geospatial Random forest ML modeling problem - Training Exported from qGIS, SCP
I am doing a geospatial assessment integrated with ML modeling. The problem is the very low accuracy percentage, as more training features increases, it gets lower. What could be the solution to such ...
1
vote
0
answers
37
views
Isolation Forest sample size
I am using sklearn's Isolation Forest as a model to detect anomalies. My dataset is relatively small, 50 records with only 2-3 features.
To prevent any overfitting, what would you recommend to tune ...
2
votes
0
answers
41
views
What's wrong with my ML implementation? (from a technical report)
I came across a (short and curt) technical report that claims to be SOTA on keyword spotting, but it didn't share its code and had a very short explanation of its network. I implemented the model, but ...