In this assignment you will implement a C++ program that trains a POS tagger with simple feedforward neural network. Since this is not a programming course, the objective of this assignment is focused on the AI research methodology instead of crazy programming skills. So, a library for training feedforward neural network has been provided to you to make life easier, while you will focus on enjoying the fun part of AI research.
You need to complete this assignment on CSE Lab2 Machine. You don't necessarily need to go to the physical lab in person (although you could). CSE Lab2 has 54 machines, which you can remotely log into with your CSE account. Their hostnames range from csl2wk00.cse.ust.hk to csl2wk53.cse.ust.hk. Create your project under your home directory which is shared among all the CSE Lab2 machines. Your home directory has a storage limit of 100MB, which is enough for you to do this assignment.
For non-CSE students, visit the following link to create your CSE account. https://password.cse.ust.hk:8443/pass.html
In the registration form, there are three "Set the password of" checkboxes. Please check the second and the third checkboxes.
Please download the starting pack (version 3), and unzip it to your home directory on CSE Lab2 machine. This starting pack contains the skeleton code, the neural network library and the dataset. The starting pack has the following structure:
assignment1/ ├── include/ │ └── token_feedforward_classifier.hpp ├── lib/ │ └── libtoken_feedforward_classifier.a └── src/ ├── report.txt (your report goes here) ├── assignment.cpp (your code goes here) ├── traindata.xml ├── devdata.xml ├── assignment.hpp ├── main.cpp ├── obj/ └── makefile
The only two file you need to touch is report.txt and assignment.cpp. Please do not touch other files.
After you unzip the starting pack, go into its src directory and run make. This is what you will get:
csl2wk14:yyanaa:106> make g++8 -c -o obj/assignment.o assignment.cpp -std=c++17 -I../include -L../lib -ltoken_feedforward_classifier assignment.cpp: In function ‘std::vector<std::__cxx11::basic_string<char> > test(const std::vector<std::__cxx11::basic_string<char> >&)’: assignment.cpp:45:1: warning: no return statement in function returning non-void [-Wreturn-type] } ^ g++8 -o main -std=c++17 -I../include -L../lib main.cpp obj/assignment.o -ltoken_feedforward_classifier
As you can see, the compiler is complaining that some function does not have a return statement. This is as expected, because it is your assignment to implement that function. But anyway, the compiler generates a binary called main in this src directory.
csl2wk14:yyanaa:107> ls
assignment.cpp assignment.hpp devdata.xml main main.cpp makefile obj traindata.xml
Although running this binary will throw an error for now, but the existence of this binary indicates that your development is good to go.
You are provided with a traindata.xml and a devdata.xml. They represents the training set and the development test set respectively.
Both training set and development test set are in XML format, in which:
An example of such an xml file is provided here:
<?xml version="1.0"?> <dataset> <sent> <nonterminal value="TOP"> <nonterminal value="NN">time</nonterminal> <nonterminal value="VB">flies</nonterminal> <nonterminal value="PP">like</nonterminal> <nonterminal value="DT">an</nonterminal> <nonterminal value="NN">arrow</nonterminal> </nonterminal> </sent> <sent> <nonterminal value="TOP"> <nonterminal value="NN">fruit</nonterminal> <nonterminal value="NN">flies</nonterminal> <nonterminal value="VB">like</nonterminal> <nonterminal value="DT">an</nonterminal> <nonterminal value="NN">apple</nonterminal> </nonterminal> </sent> </dataset>
You don't need to worry about how to read these XML files into C++ data structure. This dirty part will be handled by our library. However, you need to observe and think about the data. These observations are part of the assignment.
You will be using a simple token-level feedforward neural network classifier to perform POS tagging. The model is token-level, which means that each token is classified separately, totally unaware of its context. In other words, when the model predicts a token's POS tag, it is just focusing on the token itself, without paying attention to the rest of the sentence.
To keep things simple for now, we have already implemented the hard part of the neural net model in our provided token_feedforward_classifier.hpp. You just need to feed it with the training data. Before you write the code, you can first look at what what has been provided in token_feedforward_classifier.hpp:
// // Created by Dekai WU and YAN Yuchen on 20190304. // #ifndef _TOKEN_FEEDFORWARD_CLASSIFIER_HPP_ #define _TOKEN_FEEDFORWARD_CLASSIFIER_HPP_ #include <unordered_set> #include <vector> #include <string> /** * a model that takes a token and predicts a label */ class token_feedfoward_classifier { std::string id_m; public: token_feedfoward_classifier() = delete; token_feedfoward_classifier(const token_feedfoward_classifier&) = delete; token_feedfoward_classifier(token_feedfoward_classifier&&) noexcept = default; token_feedfoward_classifier &operator=(const token_feedfoward_classifier&) = delete; token_feedfoward_classifier &operator=(token_feedfoward_classifier&&) noexcept = default; /** * constructs the model * \param vocab the set of all possible tokens * \param embedding_size word embedding size * \param num_hidden_layers number of hidden layers in between * \param labels the set of all possible labels */ token_feedfoward_classifier(const std::vector<std::string> &vocab, unsigned embedding_size, unsigned num_hidden_layers, const std::vector<std::string> &labels); /** * given a training set, train the model * \param training_set the set of all training tokens * \param training_oracles the desired label of the training tokens * \param num_epochs number of iterations to train on the training set * \return an aggregated loss for each epoch */ std::vector<float> train(const std::vector<std::string> &training_set, const std::vector<std::string> &training_oracles, unsigned num_epochs); /** * given a test set, and predict their labels * \param test_set * \return the predicted label */ std::vector<std::string> test(const std::vector<std::string> &test_set) const; }; #endif
This model is pretty self-explanatory and straightforward to use. There are a few things to note:
After reading through token_feedforward_classifier.hpp, you can start implementing the training workflow. The training workflow has 3 steps: initialization, training and testing. You need to implement them in assignment.cpp.
You need to put your student ID in the global variable STUDENT_ID.
// TODO: put your student ID in this variable const char* STUDENT_ID = "20026012";
In the initialization function, create a token_feedforward_classifier object with appropriate hyperparameters. Note that you don't need to parse the training set XML file by yourself. The vocab and labels are already extracted and gets passed to your init() function as parameters.
/** * initialize your model. this function will be called before the "train" function * \param vocab the vocabulary, represented as a list of unique strings * \param labels the set of all possible labels */ void init(const std::vector<std::string>& vocab, const std::vector<std::string>& labels) { // TODO: do whatever necessary to initialize your model // hint: you need to choose an appropriate embedding size and number of hidden layers }
In the training function, use your token_feedforward_classifier object to train on the given training set. The training set and training oracle has been extracted for you and passed as parameters.
/** * train your model with a training set * \param tokens the list of all training tokens * \param oracles the list of the desired label for the training tokens */ void train(const std::vector<std::string> &tokens, const std::vector<std::string> &oracles) { // TODO: complete this training function // hint: you need to choose an appropriate number of epochs }
In the testing function, use your token_feedforward_classifier object to predict POS tags on the given tokens. The development test set has been extracted for you and passed as parameter.
/** * use your model to predict POS tag * \param tokens the list of tokens to perform POS tag * \return the list of predicted POS tags */ std::vector<std::string> test(const std::vector<std::string> &tokens) { // TODO: complete this testing function }
Great, you have constructed the model. Now you can compile and run it, to see how it performs. If you are curious, feel free to print out the loss values in your training function, or print out the predicted POS tags in your testing function. However, if you do so, please remember to remove those printing statements in your final submission. You functions shouldn't produce any additional console output.
You need to submit a zip archive named assignment1.zip via CASS. The zip archive should ONLY contain the following two files:
Due date: 2019 Mar 15 23:00:00