I have to do linearity test between different pair of variables. I'm using the code ggpairs(data) to do it as I have multiple variables. But many of my variables have y=0 values and x=0 values. So my graphs are similar to this :
I would say that there is a positive correlation, but due to the x=0 and y=0 values, I'm not sure anymore how to interpret it. So, my questions are :
Do we have to remove theses points (y=0 and x=0 values) from the scatter plot when we do linearity test and when we calculate the pearson correlation coefficient or should we include them?
If we need to exclude them, how can we do it, in a way that it only removes the y=0 and x=0 for the corresponding scatter plot without removing the entire row from the database or without affecting the other scatter plots?
As an example, we can use this data set : The variables that I have use for the scatter plots (for each pair) are D_biologie, D_chimie, D_math,D_physic...., which are the duration of work in a specific field in years
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), D_biologie = c(0,
1, 5, 2, 3, 12, 0, 4, 0, 0), D_chimie = c(2, 9, 0, 4, 0, 40,
0, 6, 9, 0), D_math = c(5, 2, 0, 6, 0, 30, 10, 7, 0, 50), D_physic = c(12,
3, 5, 7, 12, 5, 0, 9, 40, 6), D_french = c(40, 4, 35, 9, 40,
0, 4, 4, 5, 7), D_eng = c(30, 0, 0, 10, 30, 4, 2, 0, 0, 50),
D_hist = c(5, 6, 0, 4, 5, 0, 6, 7, 0, 0), D_geo = c(0, 8,
2, 0, 0, 0, 9, 1, 0, 0)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L))
y = x. Absent more context, I would assume a correlation should include all the data.