My R experience is pretty limited. I'm working on some textual analysis of ~11,000 survey comments. I'm guided primarily by the Silge & Robinson "Text Mining with R" book. Anyway....
There are several different locations in the dataset and I have split the data into a number of frames representing "Location_X" and "Not_X", "Location_Y" and "Not_Y" etc. I've then calculated the relative frequency of words (starting with individual words) and wind up with a dataframe named scatter_frequency that looks like
+---------------+--------------+--------------+
| word | location_x | not_x |
+---------------+--------------+--------------+
| acceptance | 1.538130e-04 | 8.972231e-05 |
| accepted | 1.076691e-04 | 1.794446e-04 |
| accepting | 1.768850e-04 | 1.794446e-04 |
| access | 8.305903e-04 | 8.075008e-04 |
| accessible | 1.461224e-04 | 4.486115e-05 |
| accident | 7.690651e-06 | 4.486115e-05 |
| accolades | 7.690651e-06 | 4.486115e-05 |
| accommodate | 2.307195e-05 | 4.486115e-05 |
| accommodating | 1.538130e-05 | 4.486115e-05 |
| accomplish | 4.460578e-04 | 7.626396e-04 |
| accomplished | 3.614606e-04 | 3.140281e-04 |
+---------------+--------------+--------------+
and so on for ~4,000 rows
I then plot
ggplot(scatter_frequency, aes(x=location_x, y=not_x)) +
geom_abline(color="gray40", lty=2) +
geom_jitter(alpha=0.1, size=2.5, width=0.3, height=0.3) +
geom_text(aes(label=word), check_overlap = TRUE, vjust=1.5) +
scale_x_log10(labels=percent_format()) +
scale_y_log10(labels=percent_format()) +
scale_color_gradient(limits=c(0, 0.001),
low="darkslategray4", high="gray75") +
theme(legend.position = "none") +
labs(x="Location X", y="Not X")
and produce this plot
you can see where I blurred out some identifying terms, but this is pretty representative.
So far so good...we can now see which terms appeared frequently (further to the right) and more frequently in one data set than the other (further away from the line). What interesting are the terms that appear furthest from the line, as they are either conspicuously common or uncommon at location x. The terms near the line aren't all that interesting. This was a survey on management, so it's no surprise "leadership" and "management" appear. But the fact that "abusive" is much more common at location x than the other locations IS interesting. And I'd like to know what word corresponds to the dot that is well off the line below and to the left of "shop"
So my question is, is there a programmatic way to restrict labeling to those "interesting" points? As in, choose which point are labeled based on their distance from the line?
This may not be the best formed question...thanks in advance for your patience.

