Welcome to AI Blitz XII! π | Starter Kit For This Challenge! π
Community Contribution Prizes π | Find Teammates π―ββοΈ
Discord AI Community π§
Introduction
Open-sourced project repositories need to come with their own intricate documentation. However, with the huge stack of coding languages used by organizations nowadays, the documentation requires a rather tedious amount of time to figure out the language present in the codebase.
Can NLP can help solve this problem by classifying the coding languages? The first AIBlitz puzzle seeks answers through this puzzle.
You are presented with a corpus containing over 45628 lines of code, written in 15 different programming languages. Your model should distinguish between these languages as accurately as possible.
Check out the Starter Code to get a context and reference to the problem statement and clear steps on solving the problem.
πͺ Getting Started
This puzzle is a classification problem and has similarities with the Emotion Detection problem from AI Blitz 9. Emotion Detection problem aims to teach a computer to distinguish between positive and negative emotions. Can you use the resources and tools of that problem to come up with a unique solution for this puzzle?
Hereβs how you can classify the corpus into various programming languages. Our Starter Kit comes with the implementation of Mult Nomial Naive Bayes Classifier paired with Count Vectorizer and TFIDF Transformer. Multi Nomial Naive Bayes Classifier is a popular probabilistic learning method used mostly in NLP. Deriving its form from the classic Naive Bayes algorithm, this algorithm aims at calculating the probability of each tag (here language) for a given sample and then gives the tag (language) with the highest probability as output.
πΎ Dataset
The dataset contains code snippets written by various developers across the world in different programming languages and the language they correspond to. There are snippets from a total of 15 programming languages. The columns present in the dataset are
- id:- unique identifier of the sample
- Code:- written programming code snippet
- Language(Target):- Programming language the code snippet corresponds to.
π Files
Following files are available in the resources
section:
- train.csv: (45628 samples) The CSV contains all three columns id, code, and language.
- test.csv: (4277 sample) This CSV file contains two columns sample_id and the code. You need to predict the language that the code corresponds to.
- Sample_submission.csv: It contains the random labels for the data in test.csv in the desired submission format.
π Submission
- Creating a submission directory
- Use sample_submission.csv to create your submission. The headers of the columns should be "id" and "prediction".
- Save the CSV in the submission directory. The name of the above file should be
submission.csv
. - Inside a submission directory, put the .ipynb notebook from which you trained the model and made inference and save it as
original_notebook.ipynb
.
Overall, this is what your submission directory should look like -
- Zip the submission directory!
Make your first submission here π !!
π Evaluation Criteria
During the evaluation, the F1 score ( weighted average ) and Accuracy Score will be used to test the efficiency of the model where,
π± Contact
- Aditya Jha
- Shubhamai
Notebooks
0
|
0
|
|
1
|
0
|
|
5
|
1
|
|
4
|
1
|
|
6
|
0
|