1
$\begingroup$

I’m building a plagiarism detector to identify AI-generated code on platforms like Codeforces. I’ve scraped 1,193 human and AI-generated code samples (Python, C++, Java) for the same problems. My goal is to train a neural network to distinguish them. I’ve tokenized the code using Python’s tokenize module but am unsure how to handle multi-language code or convert it into features like ASTs or embeddings. What’s the best way to preprocess these samples for a binary classification model? I’m using Python and Any advice on feature extraction or tools like tree-sitter or any advice on what constraints of neural network would be best in which way would help!

$\endgroup$
2
  • $\begingroup$ What you are trying to do is what LLM is for. Kindly have a basic understanding of what ANN can and can't do. These problems require you to train a sequence of words for which you use Transformers and RNN. Kindly go through this course to tackle up the problem - onlinecourses.nptel.ac.in/noc25_cs45/announcements?force=true $\endgroup$ Commented Jun 7 at 12:53
  • $\begingroup$ yeah but this can't provide the accuracy i want and when it comes to detection it is crucial to focus on accuracy $\endgroup$ Commented Jun 7 at 15:19

1 Answer 1

3
$\begingroup$

You are unlikely to get good results in such a short period of time; however, this problem can be viewed as an NLP task, for which you can use a NLP package like spaCy. One, mostly off-the-shelf approach would be:

  • Post-process the output of Python's ast module to create spaCy Doc objects. (You may find the asttokens module helpful here.)
  • Add a label distinguishing "AI-generated span" and "manually-written span" to .cats of each Doc.
  • Put the Docs in two DocBins, ensuring that you've got roughly 50% manually-written and 50% AI-generated in each. (Aim for exactly 50%: that's easier.) Thus produce two serialised .spacy files: training_data.spacy and evaluation_data.spacy, which should not share any Docs. (Double-check this by loading them!)
  • Devise a pipeline. I'd recommend turning the tokenize module into a custom spaCy Tokenizer, then training a Tagger, DependencyParser, EntityRecognizer, and finally a TextCategorizer to make the verdict.
  • Describe the pipeline in a training config file.
  • Train a model on your training data, using the config file.
  • Evaluate your model on your evaluation data.
  • If it looks like it's working, combine all your data together into all_data.spacy and produce a final "best" model. (You won't be able to evaluate this model properly without new test-cases.)

I would not expect this to give good results: automatic detection of AI-generated code is very hard, in a world with bad actors such as OpenAI who try to make the "best possible" generative AI systems. (The Efficient Market Hypothesis is an analogy: if an automated detection strategy works, they can specifically target that strategy and render it unusable.) Manual detection works because human readers can comprehend the meaning (or lack thereof) of the code, in ways that computer systems currently can't. Therefore, don't be too disappointed if you're barely hitting more than 50% accuracy in your evaluation: a statistically-significant 55% or 60% would be very good for the simplistic approach I've described.


If you have time, I might also recommend using a pre-trained transformer or a spaCy core model to process the contents of strings and comments, and giving this as an additional input to your TextCategorizer. However, I don't know how you'd begin doing that: unless you find a clear explanation for how to do this in the spaCy documentation, you should only begin this once you've thoroughly completed the simpler version.

If we're being particularly pedantic, you should train a DependencyParser capable of identifying the referent (if any) of comments, and combining multi-line comments where warranted. However, unless there's a clever trick I'm missing, labelling the data would be a very large amount of work indeed.

$\endgroup$
5
  • $\begingroup$ I’m using a non-NN method with high accuracy but limited upgrade scope. It uses Jaccard+cosine sim and n-grams. I found key patterns in 95%+ files and compare input deviation from those and closeness to AI. If close to AI, it's cheating; else, legit. but my main goal was to create improvable model which is nn so i read your comment and in your comment i didnt understand one thing by training your model what do you mean i mean classic lstm or cnn or rnn? $\endgroup$ Commented Jun 8 at 14:43
  • $\begingroup$ @vinodpandey Cosine similarity of what embedding? $\endgroup$ Commented Jun 8 at 14:50
  • $\begingroup$ Also: if you've got a non-NN method with high accuracy, then that's amazing and (if it continues to work on new datapoints) definitely publishable: I've never heard of such a thing. It doesn't matter that it has "limited upgrade scope": BRAIN Co.'s BakeryScan was much the same, and yet they turned it into one of the best cancer-detection systems out there. $\endgroup$ Commented Jun 8 at 14:53
  • $\begingroup$ By "train a model", I mean asking spaCy to do the magic. It's not actually one model, but a pipeline made up of multiple models: the details are described in spaCy's documentation, if you dig deep enough. (I recommend looking at an example configuration file, perhaps for the English core models.) $\endgroup$ Commented Jun 8 at 14:56
  • $\begingroup$ alright so my non-nn model is ready with 95% accuracy now i want to do something for SO community with my ai what can i do using my ai? on codeforces i have already pinged MIKE the owner of codeforces and he agreed to use after a lot of testing and he took my model code from me $\endgroup$ Commented Jun 8 at 17:11

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.