I’m building a plagiarism detector to identify AI-generated code on platforms like Codeforces. I’ve scraped 1,193 human and AI-generated code samples (Python, C++, Java) for the same problems. My goal is to train a neural network to distinguish them. I’ve tokenized the code using Python’s tokenize module but am unsure how to handle multi-language code or convert it into features like ASTs or embeddings. What’s the best way to preprocess these samples for a binary classification model? I’m using Python and Any advice on feature extraction or tools like tree-sitter or any advice on what constraints of neural network would be best in which way would help!
-
$\begingroup$ What you are trying to do is what LLM is for. Kindly have a basic understanding of what ANN can and can't do. These problems require you to train a sequence of words for which you use Transformers and RNN. Kindly go through this course to tackle up the problem - onlinecourses.nptel.ac.in/noc25_cs45/announcements?force=true $\endgroup$The_Data_Scientist_Man– The_Data_Scientist_Man2025-06-07 12:53:24 +00:00Commented Jun 7 at 12:53
-
$\begingroup$ yeah but this can't provide the accuracy i want and when it comes to detection it is crucial to focus on accuracy $\endgroup$vinod pandey– vinod pandey2025-06-07 15:19:16 +00:00Commented Jun 7 at 15:19
1 Answer
You are unlikely to get good results in such a short period of time; however, this problem can be viewed as an NLP task, for which you can use a NLP package like spaCy. One, mostly off-the-shelf approach would be:
- Post-process the output of Python's
astmodule to create spaCyDocobjects. (You may find theasttokensmodule helpful here.) - Add a label distinguishing "AI-generated span" and "manually-written span" to
.catsof eachDoc. - Put the
Docs in twoDocBins, ensuring that you've got roughly 50% manually-written and 50% AI-generated in each. (Aim for exactly 50%: that's easier.) Thus produce two serialised.spacyfiles:training_data.spacyandevaluation_data.spacy, which should not share anyDocs. (Double-check this by loading them!) - Devise a pipeline. I'd recommend turning the
tokenizemodule into a custom spaCyTokenizer, then training aTagger,DependencyParser,EntityRecognizer, and finally aTextCategorizerto make the verdict. - Describe the pipeline in a training config file.
- Train a model on your training data, using the config file.
- Evaluate your model on your evaluation data.
- If it looks like it's working, combine all your data together into
all_data.spacyand produce a final "best" model. (You won't be able to evaluate this model properly without new test-cases.)
I would not expect this to give good results: automatic detection of AI-generated code is very hard, in a world with bad actors such as OpenAI who try to make the "best possible" generative AI systems. (The Efficient Market Hypothesis is an analogy: if an automated detection strategy works, they can specifically target that strategy and render it unusable.) Manual detection works because human readers can comprehend the meaning (or lack thereof) of the code, in ways that computer systems currently can't. Therefore, don't be too disappointed if you're barely hitting more than 50% accuracy in your evaluation: a statistically-significant 55% or 60% would be very good for the simplistic approach I've described.
If you have time, I might also recommend using a pre-trained transformer or a spaCy core model to process the contents of strings and comments, and giving this as an additional input to your TextCategorizer. However, I don't know how you'd begin doing that: unless you find a clear explanation for how to do this in the spaCy documentation, you should only begin this once you've thoroughly completed the simpler version.
If we're being particularly pedantic, you should train a DependencyParser capable of identifying the referent (if any) of comments, and combining multi-line comments where warranted. However, unless there's a clever trick I'm missing, labelling the data would be a very large amount of work indeed.
-
$\begingroup$ I’m using a non-NN method with high accuracy but limited upgrade scope. It uses Jaccard+cosine sim and n-grams. I found key patterns in 95%+ files and compare input deviation from those and closeness to AI. If close to AI, it's cheating; else, legit. but my main goal was to create improvable model which is nn so i read your comment and in your comment i didnt understand one thing by training your model what do you mean i mean classic lstm or cnn or rnn? $\endgroup$vinod pandey– vinod pandey2025-06-08 14:43:43 +00:00Commented Jun 8 at 14:43
-
$\begingroup$ @vinodpandey Cosine similarity of what embedding? $\endgroup$wizzwizz4– wizzwizz42025-06-08 14:50:03 +00:00Commented Jun 8 at 14:50
-
$\begingroup$ Also: if you've got a non-NN method with high accuracy, then that's amazing and (if it continues to work on new datapoints) definitely publishable: I've never heard of such a thing. It doesn't matter that it has "limited upgrade scope": BRAIN Co.'s BakeryScan was much the same, and yet they turned it into one of the best cancer-detection systems out there. $\endgroup$wizzwizz4– wizzwizz42025-06-08 14:53:20 +00:00Commented Jun 8 at 14:53
-
$\begingroup$ By "train a model", I mean asking spaCy to do the magic. It's not actually one model, but a pipeline made up of multiple models: the details are described in spaCy's documentation, if you dig deep enough. (I recommend looking at an example configuration file, perhaps for the English core models.) $\endgroup$wizzwizz4– wizzwizz42025-06-08 14:56:19 +00:00Commented Jun 8 at 14:56
-
$\begingroup$ alright so my non-nn model is ready with 95% accuracy now i want to do something for SO community with my ai what can i do using my ai? on codeforces i have already pinged MIKE the owner of codeforces and he agreed to use after a lot of testing and he took my model code from me $\endgroup$vinod pandey– vinod pandey2025-06-08 17:11:00 +00:00Commented Jun 8 at 17:11