COMP4221 Assignment 1

Yuchen Yan

Release date: 2019 Mar 08

Objective

In this assignment you will implement a C++ program that trains a POS tagger with simple feedforward neural network. Since this is not a programming course, the objective of this assignment is focused on the AI research methodology instead of crazy programming skills. So, a library for training feedforward neural network has been provided to you to make life easier, while you will focus on enjoying the fun part of AI research.

Development Environment

You need to complete this assignment on CSE Lab2 Machine. You don't necessarily need to go to the physical lab in person (although you could). CSE Lab2 has 54 machines, which you can remotely log into with your CSE account. Their hostnames range from csl2wk00.cse.ust.hk to csl2wk53.cse.ust.hk. Create your project under your home directory which is shared among all the CSE Lab2 machines. Your home directory has a storage limit of 100MB, which is enough for you to do this assignment.

For non-CSE students, visit the following link to create your CSE account. https://password.cse.ust.hk:8443/pass.html

In the registration form, there are three "Set the password of" checkboxes. Please check the second and the third checkboxes.

Assignment Starting Pack

Please download the starting pack (version 3), and unzip it to your home directory on CSE Lab2 machine. This starting pack contains the skeleton code, the neural network library and the dataset. The starting pack has the following structure:

assignment1/
├── include/
│   └── token_feedforward_classifier.hpp
├──	lib/
│   └── libtoken_feedforward_classifier.a
└──	src/
    ├── report.txt     (your report goes here)
    ├── assignment.cpp (your code goes here)
    ├── traindata.xml
    ├── devdata.xml
    ├── assignment.hpp
	├── main.cpp
    ├── obj/
    └── makefile

The only two file you need to touch is report.txt and assignment.cpp. Please do not touch other files.

Verify your development environment

After you unzip the starting pack, go into its src directory and run make. This is what you will get:

csl2wk14:yyanaa:106> make
g++8 -c -o obj/assignment.o assignment.cpp -std=c++17 -I../include -L../lib -ltoken_feedforward_classifier
assignment.cpp: In function ‘std::vector<std::__cxx11::basic_string<char> > test(const std::vector<std::__cxx11::basic_string<char> >&)’:
assignment.cpp:45:1: warning: no return statement in function returning non-void [-Wreturn-type]
 }
 ^
g++8 -o main -std=c++17 -I../include -L../lib main.cpp obj/assignment.o -ltoken_feedforward_classifier

As you can see, the compiler is complaining that some function does not have a return statement. This is as expected, because it is your assignment to implement that function. But anyway, the compiler generates a binary called main in this src directory.

csl2wk14:yyanaa:107> ls
assignment.cpp  assignment.hpp  devdata.xml  main  main.cpp  makefile  obj  traindata.xml

Although running this binary will throw an error for now, but the existence of this binary indicates that your development is good to go.

Training and Testing Dataset

You are provided with a traindata.xml and a devdata.xml. They represents the training set and the development test set respectively.

Both training set and development test set are in XML format, in which:

each sent element contains a sentence
each nonterminal element on the leaf contains a token
the value attribute of a nonterminal element specifies the POS tag

An example of such an xml file is provided here:

<?xml version="1.0"?>
<dataset>
    <sent>
        <nonterminal value="TOP">
            <nonterminal value="NN">time</nonterminal>
            <nonterminal value="VB">flies</nonterminal>
            <nonterminal value="PP">like</nonterminal>
            <nonterminal value="DT">an</nonterminal>
            <nonterminal value="NN">arrow</nonterminal>
        </nonterminal>
    </sent>
    <sent>
        <nonterminal value="TOP">
            <nonterminal value="NN">fruit</nonterminal>
            <nonterminal value="NN">flies</nonterminal>
            <nonterminal value="VB">like</nonterminal>
            <nonterminal value="DT">an</nonterminal>
            <nonterminal value="NN">apple</nonterminal>
        </nonterminal>
    </sent>
</dataset>

You don't need to worry about how to read these XML files into C++ data structure. This dirty part will be handled by our library. However, you need to observe and think about the data. These observations are part of the assignment.

Step by Step Implementation Guide

You will be using a simple token-level feedforward neural network classifier to perform POS tagging. The model is token-level, which means that each token is classified separately, totally unaware of its context. In other words, when the model predicts a token's POS tag, it is just focusing on the token itself, without paying attention to the rest of the sentence.

To keep things simple for now, we have already implemented the hard part of the neural net model in our provided token_feedforward_classifier.hpp. You just need to feed it with the training data. Before you write the code, you can first look at what what has been provided in token_feedforward_classifier.hpp:

//
// Created by Dekai WU and YAN Yuchen on 20190304.
//

#ifndef _TOKEN_FEEDFORWARD_CLASSIFIER_HPP_
#define _TOKEN_FEEDFORWARD_CLASSIFIER_HPP_

#include <unordered_set>
#include <vector>
#include <string>

/**
 * a model that takes a token and predicts a label
 */
class token_feedfoward_classifier {
  std::string id_m;
public:
  token_feedfoward_classifier() = delete;
  token_feedfoward_classifier(const token_feedfoward_classifier&) = delete;
  token_feedfoward_classifier(token_feedfoward_classifier&&) noexcept = default;
  token_feedfoward_classifier &operator=(const token_feedfoward_classifier&) = delete;
  token_feedfoward_classifier &operator=(token_feedfoward_classifier&&) noexcept = default;
  /**
   * constructs the model
   * \param vocab the set of all possible tokens
   * \param embedding_size word embedding size
   * \param num_hidden_layers number of hidden layers in between
   * \param labels the set of all possible labels
   */
  token_feedfoward_classifier(const std::vector<std::string> &vocab,
                              unsigned embedding_size,
                              unsigned num_hidden_layers,
                              const std::vector<std::string> &labels);

  /**
   * given a training set, train the model
   * \param training_set the set of all training tokens
   * \param training_oracles the desired label of the training tokens
   * \param num_epochs number of iterations to train on the training set
   * \return an aggregated loss for each epoch
   */
  std::vector<float> train(const std::vector<std::string> &training_set,
                           const std::vector<std::string> &training_oracles,
                           unsigned num_epochs);

  /**
   * given a test set, and predict their labels
   * \param test_set
   * \return the predicted label
   */
  std::vector<std::string> test(const std::vector<std::string> &test_set) const;
};
#endif

This model is pretty self-explanatory and straightforward to use. There are a few things to note:

in the constructor:
- embedding_size controls how many dimensions the model use to represent a token. A larger number gives the model more representational flexibility, but makes it slower to train. Modern commercial NLP systems usually use 256 or 512 dimensions. You may want to use a smaller number, because you definitely don't want to spend hours training this model.
- num_hidden_layers controls how many hidden layers there are in your model. A larger number gives the model more computational flexibility, but it takes more epochs to converge. For a simple task like POS tagging, you will need no more than 3 hidden layers.
in train()
- num_epochs controls how many epochs(aka. iterations) to train. In this POS tagging task, you will need no more than 10 epochs

After reading through token_feedforward_classifier.hpp, you can start implementing the training workflow. The training workflow has 3 steps: initialization, training and testing. You need to implement them in assignment.cpp.

Step 0: Hardcode your student ID into a string literal

You need to put your student ID in the global variable STUDENT_ID.

// TODO: put your student ID in this variable
const char* STUDENT_ID = "20026012";

Step 1: Initialization function

In the initialization function, create a token_feedforward_classifier object with appropriate hyperparameters. Note that you don't need to parse the training set XML file by yourself. The vocab and labels are already extracted and gets passed to your init() function as parameters.

/**
 * initialize your model. this function will be called before the "train" function
 * \param vocab the vocabulary, represented as a list of unique strings
 * \param labels the set of all possible labels
 */
void init(const std::vector<std::string>& vocab, const std::vector<std::string>& labels) {
  // TODO: do whatever necessary to initialize your model
  // hint: you need to choose an appropriate embedding size and number of hidden layers

}

Step 2: Training function

In the training function, use your token_feedforward_classifier object to train on the given training set. The training set and training oracle has been extracted for you and passed as parameters.

/**
 * train your model with a training set
 * \param tokens the list of all training tokens
 * \param oracles the list of the desired label for the training tokens
 */
void train(const std::vector<std::string> &tokens, const std::vector<std::string> &oracles) {
  // TODO: complete this training function
  // hint: you need to choose an appropriate number of epochs

}

Step 3: Testing function

In the testing function, use your token_feedforward_classifier object to predict POS tags on the given tokens. The development test set has been extracted for you and passed as parameter.

/**
 * use your model to predict POS tag
 * \param tokens the list of tokens to perform POS tag
 * \return the list of predicted POS tags
 */
std::vector<std::string> test(const std::vector<std::string> &tokens) {
  // TODO: complete this testing function

}

Explore your model

Great, you have constructed the model. Now you can compile and run it, to see how it performs. If you are curious, feel free to print out the loss values in your training function, or print out the predicted POS tags in your testing function. However, if you do so, please remember to remove those printing statements in your final submission. You functions shouldn't produce any additional console output.

Submission

You need to submit a zip archive named assignment1.zip via CASS. The zip archive should ONLY contain the following two files:

assignment.cpp
report.txt containing answers to the following questions:
- What you think about the dataset?
- Please explain your observation different embedding size and different numbers of hidden layers.
- How good do you think the token-based feedforward model is? Why?
- Please suggest an alternative model that solves the drawbacks you observe.

Due date: 2019 Mar 15 23:00:00

Grading Scheme

be able to compile a binary that:
- implements all the basic steps - 20 pts
- finishes execution in 60 seconds - 20 pts
- achieves a testing accuracy > 70% - 20 pts
report - 40pts
- 10 pts for each question