COMP3211 Assignment 3

Serkan Kumyol

Release date: 2019.30.04

Your tasks

Report your observations about the data you will use to train RNNs (recurrent neural networks). You might use the hints you will find in this handout but you are not limited with them.
Write down your proposed hypotheses about potentially good RNN architectures, together with their pros and cons, their strengths and weaknesses.
Write a detailed explanation for each of your RNNs models indicating that how are they reflecting their underlying hypothesis.
Write your own C++ code to define your RNN models.
Conduct experiments on your RNN models.
Report experimental results.
Report your error analysis.
Do more iterations following the scientific method.

Objectives

Acquire a basic grasp of the relationship of RNNs to HMMs.
Build your own RNNs.
First understand and build a simplified case of RNNs, for fixed length sequences.
Then understand and build the general case of RNNs, for variable length sequences.
Practice how to do modeling in machine learning, by following the scientific method to develop and improve models
- Do observations on the given data and propose hypotheses.
- Develop your own recurrent neural network (RNN) models based on your hypotheses.
- Conduct experiments using your models.
- Do error analysis on the results of experiments.
- Propose better hypotheses, conduct experiments and do error analysis again. Repeat this procedure until you don't have a better hypothesis.

Introduction to RNNs (recurrent neural networks)

The purpose of this assignment is giving you an intuition about HMMs using RNN. Therefore, you will be developing an HMM by developing RNN using the same building blocks provided in the previous assignment. Briefly, instead of traversing through the states of the HMM trellis, you will be propagating through neural network layers. Since you are already familiar with the feedforward neural networks, you are expected to build sequences of feedforward neural networks for given training data. You will see that why do you need to build sequences of feedforward neural networks in next chapters but, as a spoiler, RNNs are just feedforward neural networks in a loop where each feedforward neural network has two inputs, output of previous neural network t-1 and input element at time t. Key problem you are going to encounter is dynamic input. You are provided with the same dataset as you used for your previous assignment which has a dynamic range of sentence sizes. In order to handle static input size of 4 you need to design your RNN as example code suggests, also you need to modify the get feature function accordingly. RNN with dynamic input has no limitation about the sentence size so you need to figure out how are you going to use your get_feature function and recurring the hidden layer as much as input size so your RNN will be handling a variable size of input in the dataset. As a final note, you are again obliged to report your steps and results using scientific methodology.

From the generative story of HMMs, to RNNs

As you have learned in the class, HMMs can be used as generative models where the output sequences are randomly generated following the distribution described by the initial probabilities, transition probabilities, and emission probabilities. Recall that the probability distribution for the random variable q (state) describes how the probabilities over the values n (nodes) are distributed, and the probability distributions for the random variable e (emission) describes how the probabilities over the values v (vocabulary) are distributed. Thinking metaphorically in terms of the dice rolling story, at each time step t, we roll a transition dice weighted by underlying distribution of the random variable q_t in order to choose which node n to move to, followed by an emission dice weighted by the underlying distribution of the random variable e_t in order to choose which vocabulary item v to emit.

Consider the above trellis representing an HMM unrolled in time. In the dice rolling generative story, first a transition dice is rolled to draw the value of q₀ following a distribution over POS tags where the value of the drawn hidden state is n_i with the transition probability π = P(q₀ = n_i). For the sake of concreteness, in above example let's suppose the dice roll results in NN, which means that n_i is NN. The next step is to roll an emission dice weighted by the probability P(e₀ = v_i) in order to draw the value of e₀ which will be the first generated token of the sequence, v₀. Again for the sake of concreteness, in above example, let's suppose the dice roll results in time which means that v₀is time. This generative process repeats for each time step t until the end of the trellis where all emissions are e₀, e₁ and, e₂ with values time, flies, and like respectively.

Now as a step toward RNNs, the question is: can we instead somehow use a feedforward network in some analogous way to generate sequences? And if we can, how similar and/or interchangeable are RNNs and HMMs? To answer such questions we first need a basic understanding of feedforward neural networks. Luckily you are already familiar with them as you now have a considerable amount of experience from your previous assignment.

To help us stay oriented, here are the symbol names for random variables and values for HMMs. We're going to ask if we can come up with analogous entities in a feedforward network architecture.

	HMM symbols	RNN symbols
Random Variable (R.V.) For the States	q_t
State values	n_i
Random Variable (R.V.) For the emissions	e_t
Emission values	v_i

First, we need to see if we have a random variable for emissions in a simple feedforward neural network. In the figure below, every rectangle represents a layer. The lower layer is an ordinary dense layer. The upper layer is also a dense layer whose outputs are in a 1-of-V format where V is the number of vocabulary items, with a softmax function so that the output activations can be interpreted as an emission probability distribution over all possible vocabulary items.

The above simple feedforward network can be thought of as a generator that takes an empty input and generates a probability distribution over the vocabulary. Taking a closer look, the Dense 0 layer outputs a vector (thick arrow) which contains the values of all possible transitions from initial state I. You can see that the output of the dense layer is also a distribution over some values which are transition values for the given input at time step t₀ (the above example demonstrates only one time step). If you think in terms of an HMM whose transition values roughly stands for the underlying distribution of the random variable q_t which is a distribution over transition values. Imagine that we transit to q_{0
=}NN in HMM, in RNN such transmission is an input to the next layer, in the above case, it is a dense layer with softmax smoothing where we obtain the probability distribution y₀ for emission values e_t with regard to the input vector. Here you might have observed that the output of a dense 0 layer is a distribution but not necessarily a probability distribution, we have no direct observation of transition values hence, such distribution is not identical with HMMs but functions similarly with HMM. Thus, we are going to use h_t to denote random variable over some distribution N for transition dice. As a final remark, transition and emission probabilities are combined unlike HMMs so the value of transition random variable is not transparent, thus, we don't assign a notation for transition values. Below table is to show the similarities between HMM and RNN we have observed so far.

	HMM variable names	RNN Variable Names
Random Variable (R.V.) For the States	q_t	h_t
State values	n_i
Random Variable (R.V.) For the emissions	e_t
Emission values	v_i

Now you should start to see the similarity between HMM states and neural network layers. Above figure demonstrates that we can also use the same feedforward network to generate sequences just like HMM if we cascade them with regard to the input size, in other words, it is an RNN generation model. As you might observe in the figure, it is a sequence of feedforward neural networks where the input of a neural network is a distribution over transition values of the previous network. Since h stands for the probability distribution of transitions, it is plausible to say that at time step t_n the transmission probability to go time step t_n+1 are drawn from h_nand since qt and ht are similar, we observe similar transition procedure as HMM. As a consequence of cascading neural networks, the emission of a network at time step t depends on all previous transitions. We have almost done with generative story but there are still missing variables and an unanswered question since HMMs can be used for the tasks such as recognition, scoring, parsing RNNs should also be able to handle such tasks. Let's re-design our RNN according to recognition task input and outputs as shown in below table.

Transducer types based on the tasks
	Generation	Recognition
Input	---------------	Sequence
Output	Random sequence	Score/probability

Until now we have dealt with RNN as a generator but if we want to handle a sequence of words instead of a mere empty input we need to use 1-hot representations (you probably recalled this one). Furthermore, the most essential part is how we give the sequence of tokens as input to an RNN. We can't give the whole sequence as input at time step t₀because of RNN will handle all input vector at actual time, thus, we need to design our network in a way that is will one input at a time. With regard to such information, we can come up with below RNN architecture.

	HMM variable names	RNN Variable Names
Random Variable (R.V.) For the States	q_t	h_t
State values	n_i
Random Variable (R.V.) For the emissions	e_t	1-hot
Emission values	v_i	1-hot

Above figure is almost identical with the previous figure with only one difference Dense 0 layers are now subject to a token at each time step as well as the output of previous Dense 0 layer. Each layer emits the next token for previously given token and the sum of emission probabilities is the generated score for a given input sequence. You can see that the architecture here is almost identical with your previously given bigram feedforward assignment. If we erase transition edges from above each RNN layer becomes individual bigram feedforward layers as shown in below figure.

A closer look at the RNNs

The Recurrent Neural Network (RNN) is a class of neural networks where hidden layers are recurrently used for computation. This creates an internal state of the network to remember previous decisions. The RNN includes three layers, an input layer which maps each input to a vector, a recurrent hidden layer which recurrently computes and updates a hidden state after reading each input, and an output layer which utilizes the hidden state for specific tasks. Imagine that you have a feed-forward neural network with input "time" and it processes the word character by character. At the time it reaches the character "m", it has already forgotten about "t" and "i", which makes feedforward neural networks insensitive about the input sequence. An RNN is able to remember previous decisions, because of it’s internal memory. Each feedforward network in the hidden layer of RNN gives output at time t - 1 to next feedforward layer. Above figure shows the basic structure of an RNN where q_i is the input layer where input tokens and converted to vector representations, q_tis the hidden state made of a sequence of feedforward networks where the total number of feedforward neural networks is based on the size of input.

If we unroll the previous figure in time, we will see the above figure a sequence of feedforward neural networks. RNNs tweak their weights for both through gradient descent and backpropagation through Time. Briefly, the decision a recurrent net reached at time step t-1 affects the decision it will reach one moment later at time step t. So recurrent networks have two sources of input, the present and the recent past, which combine to determine how they respond to new data. When you define your RNN, you have two design choices as static and dynamic. In static RNN, the number of hidden units (feedforward neural networks) is fixed so your RNN can only have a pre-defined number of inputs (fixed length of sentences). The size of dynamic RNNs is variable among input size, so you need to be able to design your dynamic RNN based on the dynamic input size.

https://i.imgur.com/kpZBDfV.gif

Doing experiments and testing

How do we measure the context captured by our language model? At first glance to the n-gram models you might see that trigram is able to capture more context than bigram models but how can we sure about it? In order to understand if our hypothesis is valid or invalid we need to find a standard way of measuring the error. Here we introduce the perplexity as a way to evaluate your models.

Let's assume that

w_{i}

is the input to your model

P

then the perplexity of a sentence W is as below:

P P (W) = P (w_{0} w_{1} . . . w_{n}) = \sqrt[N]{\frac{1}{P (w_{0} w_{1} . . . w_{n})}} = \sqrt[n]{\prod_{i = 0}^{n} \frac{1}{P (w_{i})}}

The success of your model is based on how it performs for test sentences. Lower perplexity for each test sentences means that your model performs better. In real life, it is difficult to achieve but you should try to minimize perplexity per sentence by iterating the experimental steps. Don't forget to report every iteration you did.

We will measure the success of your model by summing up the perplexity of each sentence. The lower total sum will yield higher accuracy. The total perplexity will be calculated as follows:

P P_{m o d e l} = P P (W_{0}) + P P (W_{0}) \dots + P P (W_{N - 1}) + P P (W_{N})

After you have built one of your proposed models in assignment.cpp, you can run the command make and ./main to compile and run your model. It will print out each test sentence together with your model's output distance for sentence and a total score for your model.

Entropy, cross-entropy and perplexity

Entropy is the average uncertainty of a random variable. In language, entropy is the information that is produced on the average for each letter of text in the language. It is also known as the average number of bits that we need to encode the information. By definition entropy of a random variable is:

H (X) = \sum_{x} - P (X) \log_{2} P (X)

If you look at the definition of the perplexity in Wikipedia, you will see below equation:

2^{H (P)} = 2^{\underset{X}{- \sum} P (X) \log_{2} P (X)}

Where H(P) is the entropy of the distribution P(X) and X is a random variable over all possible events.

Entropy is the average number of bits to encode the information contained in a random variable. When we have the exponentiation of the entropy it yields to the total amount of all possible information, the weighted average number of choices a random variable has. For example, if the average sentence in the test set could be coded in 100 bits, the model perplexity is 2 ¹⁰⁰ per sentence.

Development environment

You need to complete this assignment on CSE Lab2 machines. You don't necessarily need to go to the physical lab in person (although you could). CSE Lab2 has 54 machines, which you can remotely log into with your CSE account. Their hostnames range from csl2wk00.cse.ust.hk to csl2wk53.cse.ust.hk. Create your project under your home directory which is shared among all the CSE Lab2 machines.

For non-CSE students, visit the following link to create your CSE account. https://password.cse.ust.hk:8443/pass.html

In the registration form, there are three "Set the password of" checkboxes. Please check the second and the third checkboxes.

Free up space in your home directory

Your home directory in CSE Lab2 has only 100MB disk quota. To see a report of the disk usage in your home directory, you can run du command. It is recommended that you have at least 30MB available before you start this assignment.

Assignment Starting Pack

Please download the starting pack tarball, and extract it to your home directory on CSE Lab2 machine. This starting pack contains the skeleton code, the feedforward network library and the dataset. The starting pack has the following structure:

assignment1/
├── include/*
├──	lib/*
├──	part_demo/
│   ├── assignment.cpp
|   ├── assignment.hpp
│   ├── main.cpp
│   └── makefile
├──	part_a/
│   ├── report.xlsx    (your report goes here)
│   ├── assignment.cpp (your code goes here)
│   ├── assignment.hpp
│   ├── main.cpp
│   └── makefile
└──	part_b/
    ├── report.xlsx    (your report goes here)
    ├── assignment.cpp (your code goes here)
    ├── assignment.hpp
    ├── main.cpp
    └── makefile

The only two files you need to touch are report.xlsx and assignment.cpp. Please do not touch other files.

Extract the starting pack

After downloading the starting pack, you can run tar -xzvf COMP3211_2019Q1_a2.tgz to extract the starting pack

What does the command do?

tar collected all the files into one package, COMP3211_2019Q1_a2.tgz. The command does a couple things:

f: this must be the last flag of the command, and the tar file must be immediately after. It tells tar the name and path of the compressed file.
z: tells tar to decompress the archive using gzip.
x: tar can collect files or extract them. x does the latter.
v: makes tar talk a lot. Verbose output shows you all the files being extracted.

Verify your development environment

After you extract the starting pack, go into part_demodirectory and run make and main. These are what you will get:

csl2wk15:skumyol:111> make

      make: Warning: File `main' has modification time 20 s in the future

      make: `main' is up to date.

      make: warning:  Clock skew detected.  Your build may be incomplete.

      csl2wk15:skumyol:112> main

      training

      prediction saved to predict.xml

      total distance: 2779.15

Demo: A look at the sample code for an RNN for fixed input size 3

You are provided with a sample RNN architecture in your sample please carefully read the code and provided descriptions. You can take this model as a baseline and compare your outputs with this dummy model.

//
// Created by Dekai WU and YAN Yuchen on 20190424.
//

#include "assignment.hpp"
#include <make_transducer.hpp>
#include <shared_utils.hpp>

using namespace tg;
using namespace std;
namespace part_demo {

    unsigned const NUM_EPOCHS = 4;

    vector<symbol_t> vocab;
    static const unsigned EMBEDDING_SIZE = 32;
    transducer_t embedding_layer;
    transducer_t one_hot;
    transducer_t dense_feedfwd_softmax;
    transducer_t dense_feedfwd_tanh;

    void init(const vector<sentence_t> &training_set) {
      // first you need to assemble the vocab you need
      // in this simple model, the vocab is the top 1000 most frequent tokens in training set
      // we provide a frequent_token_collector utility,
      // that can count token frequencies and collect the top X most frequent tokens
      // all out-of-vocabulary tokens will be treated as "unknown token"
      frequent_token_collector vocab_collector;
      for (const auto &sentence:training_set) {
        for (const auto &token:sentence) {
          vocab_collector.add_occurence(token);
        }
      }
      //Here we are going to define our custom layers that we are going to use in each RNN layer
      //Since each RNN hidden layer is identical we only need to define each layer once.
      //You might see that we are not defining the layers dot_product, -log and sum. Because they are already
      //available in our library and doesn"t reqire a customization
      //since all layers are defined here we can make our rnn by declaring topology over the layers
      vocab = vocab_collector.list_frequent_tokens(1000);
      embedding_layer = make_embedding_lookup(EMBEDDING_SIZE, vocab); //stands for EMB layer in the figure
      one_hot = make_onehot(vocab);  //stands for 1-hot layer in the figure
      dense_feedfwd_softmax = make_dense_feedfwd(vocab.size() + 1, make_log_softmax());// stands for dense 1 log softmax in the figure
      dense_feedfwd_tanh = make_dense_feedfwd(EMBEDDING_SIZE, make_tanh()); // stands for dense 0 tanh
    }

    /**
     * construct an RNN that will evaluate the degree of goodness of a sentence
     * in this demo part, it evaluate the degree of goodness of the first 3 tokens of the sentence
     * \param training_set the set of all training tokens, will be used to construct a vocabulary
     * \return an RNN model
     */
    transducer_t make_rnn_recognizer_3_inputs() {
      auto initial_state = make_const(vector<scalar_t>(EMBEDDING_SIZE, 0));
      auto start_of_sentence = make_const(token_t("<s>"));

      auto cell = compose(make_concatenate(2), dense_feedfwd_tanh);
      auto copy = make_copy(2);

      auto input_pick = make_pick({0, 1, 0, 1, 2});

      auto timestep_0 = compose(group(initial_state, compose(start_of_sentence, embedding_layer)), cell, copy);
      auto timestep_0_to_1 = compose(group(timestep_0, embedding_layer), group(make_identity(), compose(cell, copy)));
      auto timestep_0_to_2 = compose(group(timestep_0_to_1, embedding_layer), group(make_identities(2), cell));

      auto output_pick = make_pick({0, 3, 1, 4, 2, 5});

      auto readout_loss = compose(
              group(compose(dense_feedfwd_softmax, make_tensor_neg()),
                    one_hot), make_dot_product());

      return compose(input_pick, group(timestep_0_to_2, make_identities(3)), output_pick,
                     group(readout_loss, readout_loss, readout_loss), make_tensor_add(3));
    }
}

You already know that in order to construct a feedforward layer, you need to specify the input dimension, output dimension and, activation function. But the sample code indicates a more complex design, now we are going to breakdown the given code into parts. If you read the comments of each layer you might have seen that the layers is shared among time steps therefore, shared by RNN layers as well. Since the logic of customized layers are already explained in code descriptions, we are going to explain the network topology and operations here. It important to mention that time step 0 to 2 is our whole construction and we are going to explain it by breaking it down to time steps. Below is the explanation steps:

The initial state and the embedding of the start of the sentence "<s>" are constituting the inputs to the first layer (time step 0), so the input to dense layer 0 is a concatenation of X0 and initial input. So after we compose the concatenate to dense we obtain a cell with output h₀. The output h₀ of the cell will be input to both Dense1 (recall the emission probabilities we obtain from the softmax in the in RNN example) and hidden layer of the next layer at time step 1, thus, we want two copies of the output and the last piece of the timestep_0 composition is the copy.
Time step 0 to 1 tells the story of how do we take two inputs and generate two outputs given h₀ and one-hot representation of the input token e₀. That is why it is not simply time step 1, it is comprehending the output of time step 0. Obviously, the input is time a copy of the output of time step 1 and 1-hot (feed into an embedding layer) and the output is again a copy of the output of dense 0 h₁ but here is the crux of the matter, the output of the time step 0 is not immediately passes to Dense 1 layer but it is actually part of time step 0 to1 and the time step 1 to 2 is part of time step 2. Thus, we need to add a make identity for output h₀ of the time step 0 to 1.
Time step 0 to 2 is our whole recursive story. It again takes two inputs, one is the output of previous time step h₂and the other is input token e₁. Now we need to make two identities for outputs h_o and h₁as well one output of the dense0. As you might remember from the previous assignment, we were using the one-hot vector of a word to pick its probability from the probability distribution we obtained from the softmax function, here we need to do the same. But since the input is a fixed sequence of the size of 3 we need to make copies of input to use same input again for picking, the make_pick is used for mapping and copying the sequences into the order we need. Here input pick maps 3 values to the time step 0 to 2 and readout loss layers.
We need a reordering at the output of time step 0 to 2 as well. As you can see while output h₀ is a direct mapping 0. input 1. output h1 is the 2. input to the readout loss and e0 is 3. element in the sequence is mapped to 1. in sequence. After having correct ordering of the inputs and the outputs, one-hot vector picks the -log probability of the e_t using dot product. The final layer is summation over output probabilities.

Overview of your tasks

Part A: RNN with fixed length input

You know the drill! You need to follow the exact research methodology as you did.
Seek the patterns.
Generate hypothesis.
Do experiments by building your own neural network.
In this case, we ask you to handle 4 token input for each sentence in the training data. As you might see in get_feature function in the demo you can ignore the token sequences in a sentence.
You can be inspired by our demo neural network architecture in designing your own RNN for input sequences for a fixed length of 4 tokens.
Report your results and iterate over using different parameters and/or RNN architecture for input sequences of length 4, based on your hypotheses.

Part B: RNN with variable length input

This part is almost identical with the part A. Again, you need to follow the exact research methodology as you did.
Seek the patterns.
Generate hypothesis.
Do experiments by building your own neural network.
In this case, we ask you to handle each sentence in the training data which are varying in their size. You might need to modify the get feature function as you did in your previous assignment i.e. bigram.
You can write your own make_rnn_recognizer_n_inputs function which is able to call a family of RNNs with different size inputs. You might need to find the maximum sized input in the dataset (lets say size of N). Then you need to define a topology for each input size 1 to N which will be your family of RNNs. Whenever an input of size n comes you will call the corresponding RNN from your RNN family. However such an approach might require you to design an enormous amount of RNNs, a better way to design an RNN is using recursion when you construct the topology of the network. In order to do so, you need do apply recursion to make_pick and compose layers for a given input.
Report your results and iterate over using different parameters and/or RNN architecture for input size 4 based on your hypotheses.

Part A: RNN with static input

Task Description

In this part, you have to build an RNN to score input sentences with fixed length 4 instead 3.

Analyzing data

The first step to scientific research is to observe the data. You are provided with a traindata_length_4.xml and a devdata_length_4.xml for each part of the assignment.

Link to training and test data for you to download:

Traindata: http://hltc.cs.ust.hk/COMP3211_2019Q1_a3/res/traindata_length_4.xml

Devdata: http://hltc.cs.ust.hk/COMP3211_2019Q1_a3/res/devdata_length_4.xml

Both training set and development set are in XML format, in which:

each sent element contains a sentence
each token element contains a token

You will be able to measure how successful your scorer is using development set.

An example of such an xml file is provided here:

<dataset>
    <sent>
        <token>RNNs</token>
        <token>are</token>
        <token>good</token>
        <token>models</token>
    </sent>
</dataset>

You don't need to worry about how to read these XML files into C++ data structure. This dirty part will be handled by our library.

Limiting the sentence size introduces a totally different search space than your previous assignment this dataset is might be more uniform and easier to model using a simple representation such as bigram. But it is not guaranteed, you need to observe data to get an intuition about what kind of language might be required for this task.

Seeking the patterns

Find insightful pattern from the data. The main difference in this part is that, now you have a data with sequences of 4. As you know from previous assignment, you need to observe the to get an insight for your design choice here the sequence limitation is also a part of the effect. Moreover, the distribution of your data is also a fact that you might like to observe where sequence limitation also has an effect. We can give you some hints about how to seek patterns, the number of tokens that the random variable at time step t is a pattern in the data and/or visually inspecting how sparse data is in sequential manner. You better try to answer the question of "what might be the an RNN model's advantages and disadvantages against an n-gram model?" after you see the pattern.

Generating hypothesis models

According to the patterns you have observed so far, please propose some your language model. You need to be able to relate your observations in data with your hypothesis. You should be remembering that one of the topic was the bias of the RNN models during the lecture. We strongly suggest for you to have a look at concept of word embeddings thought in the class to understand bias so you can explain your model's bias.

The next step is to compare those models theoretically, in the following aspects:

What language bias does your model have? Why is this language bias better than the other?
What search bias does your model have? Why is this search bias better than the other?

Write down your proposed hypotheses together with their pros and cons based on questions above.

Do experiments using your hypotheses

Here you need to design experiments that validates/invalidates the hypotheses you came up with.

Write a detailed explanation for your each model indicating that how are they reflecting their underlying hypothesis

Write your own C++ codes to define your model.

Building your own RNN

Build a RNN that reflects your hypothesis. You will be building your own RNN scorers to obtain score for given sentence. Recall that in assignment 2, you designed a feedforward neural network scorer which assigns score for given input. Here, the artificial network you are going to design will take a fixed size of input sequence of length 4. You can think of that your network is a sequence scorer but we still can reach the score of the each token at timestep t. An RNN for handling a fixed sized input is basically a sequence of 4 feedforward neural networks so you can design your neural network accordingly.

When you at the sample code, you can see that the recurrent part is called for each timestep and recurrent part is made of recycled layers. Those recycled layers constitute the hidden layer standing for timesteps 0,1 and 2, in your topology for input size 4 you need to add next timestep as well. So basically, you are going to recycle the same layer used in previous times once more.

Recall that the assignment.cpp, in the starting pack already contains an RNN to handle first 3 tokens of a sentence:
You can call this RNN as a transducer which takes some input (based on your design choice) and gives a score as output. Your model should output a distance given a token and its context. The distance of a sentence is defined as sum of emissions generated by your RNN model.

Part B: RNN with dynamic input

Task Description

In this part, you have to build an RNN to score input sentences having variable length.

Analyzing data

The first step to scientific research is to observe the data. You are provided with a traindata.xml and a devdata.xml for each part of the assignment. While part A is consists of sentences of length 4, part B has no restriction over the length of sentences.

Link to training and test data for you to download:

Traindata: http://hltc.cs.ust.hk/COMP3211_2019Q1_a3/res/traindata.xml

Devdata: http://hltc.cs.ust.hk/COMP3211_2019Q1_a3/res/devdata.xml

Both training set and development set are in XML format, in which:

each sent element contains a sentence
each token element contains a token

You will be able to measure how successful your scorer is using development set.

If you look at the below sentence, you see that the bigram and trigram models might fail to understand the relationship between the tokens, for example, you can represent the sequence "every farmer" but can not represent it's relation with "donkey beats it". The information we can obtain from such sequence is indeed has a weak representation due to language bias of the bigram and trigram language models. Therefore, the data itself gives us some ideas about what kind of language bias is necessary in order to create a better model. Now we can talk about the Part A, limiting the sentence length actually reduces the sparsity and the dependencies, consequently, your search space in this part is significantly different than the Part A.

<dataset>
    <sent>
        <token>every</token>
        <token>farmer</token>
        <token>has</token>
        <token>a</token>
        <token>donkey</token>
        <token>beats</token>
        <token>it</token>
    </sent>
</dataset>

You don't need to worry about how to read these XML files into C++ data structure. This dirty part will be handled by our library.

Seeking the patterns

Find insightful pattern from the data. Your data here is same with the previous assignment, you need to observe the to get an insight for your design choice. You need to seek the patterns in way that you get an insight about sequences. Please don't forget that the RNN task indicates long distance dependencies between tokens unlike bigram and trigram models. Thus, you might need to adjust your pattern seeking accordingly.

Generating hypothesis models

The next step is to compare those models theoretically, in the following aspects:

What language bias does your model have? Why is this language bias better than the other?
What search bias does your model have? Why is this search bias better than the other?

Write down your proposed hypotheses together with their pros and cons based on questions above.

Do experiments using your hypotheses

Here you need to design experiments that validates/invalidates the hypotheses you came up with.

Write a detailed explanation for your each model indicating that how are they reflecting their underlying hypothesis

Write your own C++ codes to define your model.

Building your own RNN

Build a RNN that reflects your hypothesis. You will be building your own RNN scorers to obtain score for given sentence. Recall that in assignment 2, you designed a feedforward neural network scorer which assigns score for given input. Here, the artificial network you are going to design will take a dynamic size of input sequence instead of length 4. You can think of that your network is a sequence scorer but we still can reach the score of the each token at timestep t. An RNN for handling a dynamic sized input is basically a sequence of feedforward neural networks of input size.

You have 2 ways to build a recursive network, the first way is hardcoding every possible RNN for different input size which would be challenging to do so another way is using recursion to generate each RNN on the fly. We are going to explain to you how to create recursive RNNs, but First of you need to understand that make_pick, timestep, and readout_loss are subject to recursive part and we will start from input_pick.

As input size is dynamic, you need to initialize input_pick based on the input size as well, in order to do so, you need to generate the flow vector on the fly. In the example provided, for input_size= 3, make_pick parameters are 0 1 0 1 2, so you first need to create a for-loop from 0 to input_size -1 and generate a vector by pushing the counter values to the vector (if input_size is 3 it will generate a vector like {0, 1}) then create another for-loop from 0 to input_size to add remaining values (if the input_size is 3 it will generate a vector like {0, 1, 0, 1, 2}). In order to have a better understanding of make_pick layer you need to understand the diagrams you see in previous sections, our suggestion is that try to draw the diagram of the pick and timestep layers for different sized inputs.

2. Timestep layers are actually only one layer (recurrent layer) with recursion except for the first layer for the initial state, thus, we suggest you handle the recursion based on the input size.

If input length is 1 you only need to compose the initial layer which will never be in recursion.
If input length is 2 you need to compose the initial layer (ie. timestep 0) then use this initial layer in your timestep0 to 1 composition, recall that the output of the timestep 0 is input to the next layer. If you are able to compose this structure, you can do the recursive part as well.
In recursive part, you also need initial layer as but here the recursive part is quite different. Before reading this part, if haven't understood the role of make_identity and timestep0_to_1 in the sample code, please have closer look at it then return back to reading this. Below code is almost identical with the sample code, timesteps are composed to themselves but the main problem is that at each timestep the number of output will increase as well, hence, you need to increase number identities at each timestep (recursion). This recursion should be until input length N - 1 so will compose the last layer of our recursive hidden layer.
Imagine that it is like the timestep0_to_2 compose in the sample code. All we need to do is using exaclty same syntax with a small change. In the sample code the last layer hidden recursive layer had input length - 1 make_identities here you need to compose your last layer exactly in same the fashion.

        timesteps = compose(group(timesteps, embedding_layer), group(make_identities(), compose(cell, copy)));

3. As you saw in the pick diagrams, we need to reorganize the information flow again to provide a valid information to the readout_loss layer. When you go back to sample code, you might see the logic of the output pick, it creates a cross-mapping between two layers. If we'd like to create such a vector on the fly, we need to think about it as a loop again. At each iteration, the iteration counter i should be pushed to the parameter vector of output_pick and i + input_length should also be pushed to the output_pick, in such way each input will be mapped to the exact correspondent identity. i.e input size 3 RNN in the sample code has a mapping of {0, 3, 1, 4, 2, 5} if it we had input size 5 RNN the output pick mappings will be {0, 5, 1, 6, 2, 7, 3, 8, 4, 9}.

4. Readout loss layer input size is always fixed by the output size of the hidden layer, no need for identities as well. So we can group the readout loss layer recursively as input size then use that grouped readout loss layer to compose all structure together. The recursion in the for loop here might look like; layer1 = group(layer1, layer2).

General information

How to compose transducers

Graph Representation	Pseudo Code
	A
	compose(A, B)
	compose(A, B, C) // or equivalently compose(compose(A, B), C)
	compose(group(A, B), C)
	compose(group(make_identity(), A), B)
	compose(A, group(B, C))
	compose(make_copy(2), group(A, B))
	compose(A, group(make_identity(), compose(B, C)))
	compose(A, group(B, C), D)

Defining your own logic on how to supply context

In assignment.cpp, it also contains an example on how supply a token together with one/two preceding token as context;

vector<feature_t> get_features(const vector<string> &sentence) {
  // this starting code demonstrate how to trim (or pad) a sentence into length 3

  // if the sentence is shorter than 3 tokens, add some padding at the end to make it length 3
  if(sentence.empty()) {
    return vector<feature_t>{"</s>","</s>","</s>"};
  }
  else if(sentence.size() == 1) {
    return vector<feature_t>{sentence[0],"</s>","</s>"};
  }
  else if(sentence.size() == 2) {
    return vector<feature_t>{sentence[0],sentence[1],"</s>"};
  }

    // if the sentence is longer than 3 tokens, take the first 3 tokens and trash the rest
  else {
    return vector<feature_t>{sentence[0],sentence[1],sentence[2]};
  }
}

Hardcode your student ID into a string literal

In assignment.cpp You need to put your student ID in the global variable STUDENT_ID.

const char* STUDENT_ID = "1900909";

Refining your model

After analyzing why your model went wrong, you should be able to make a better hypothesis. Now you have completed one iteration of scientific research. On the next iteration, you compare your new hypothesis theoretically and empirically again, doing more error analysis, and coming up with more awesome model.

Start another scientific method iteration.

Submission

You need to submit a tgz archive named assignment3.tgz via CASS. The tgz archive should ONLY contain two folders as part_a and part_b. Each folder should contain below files.

assignment.cpp that contains the implementation of your final improved model
report.xlsx contains the record of your scientific method
- This is a template for scientific method.
- After opening this Excel file, you will notice that there are four worksheets. Each worksheet corresponds to one scientific method iteration. You don't need to fill in all of them, and of course you can add more if needed.
- In each worksheet, there are four columns corresponding to 4 hypotheses. You don't have to fill in all of them, and of course you can add more if needed.
- In each worksheet, you don't necessarily have to fill in every blank. Fill in those applicable.
- To create a line break in Excel cell, press Alt + Enter.

Due date: 2019 May 12 23:59:00

Grading Scheme

The initial model that you get from directly modifying the demo code does not count towards a hypothesis. But we assume that the starting code is the hypothesis #1 so you need put the code of your initial model into report.xlsx iteration #1 hypothesis #1 as a record (although they don't count towards your grade). By observing the outcome of the starting code (hypothesis #1), you can form your own hypothesis i.e. hypothesis #2, it is important to denote that you are not required to do error analysis on hypothesis #1. Briefly, your new hypothesis should be based on your observations on the outcome of the previous iteration which you can find it out by running the starting code.

Part A

For each insightful pattern you observe you get 5 points. This way you can earn at most 10 points by having 2 insightful patterns.

For each hypothesis you proposed and evaluated (no matter which iteration it belongs to), you get up to 20 pts.
The pts for a hypothesis is broken down as explained below.
- hypothesis description - 1 pts
- hypothesis theoretical analysis - 4 pts
- model description - 1 pts
- code - 5 pts
- experiment score - 1 pts
- error analysis - 8 pts
  - 2 pts for each failure case
Maximum 50 points can be earned from this part

Part B

For each insightful pattern you observe you get 5 points. This way you can earn at most 10 points by having 2 insightful patterns.
For each hypothesis you proposed and evaluated (no matter which iteration it belongs to), you get up to 20 pts.
The pts for a hypothesis is broken down as explained below.

hypothesis description - 1 pts
hypothesis theoretical analysis - 4 pts
model description - 1 pts
code - 5 pts
experiment score - 1 pts
error analysis - 8 pts

2 pts for each failure case

Maximum 50 points can be earned from this part

Bonus Points

Implementing the LSTM model in any of the parts - 10 pts
Breaking down the RNN into timesteps and observe the behaviour of the each layer in both parts. Don't forget that the behaviour of each layer is depends on previous timesteps. Since you will observe the behaviour for 4 timesteps in part A you need to observe at least 4 more timesteps in part B so it will be timesteps 5 to 8. You need to report your observations in your Error analysis for both parts. - Part A 5 pts and Part B 5 pts