COMP4221 Assignment 2

Yuchen Yan

Release date: 2019 Apr 03

Your tasks

Report your observations about the data.
Write down your proposed hypotheses together with their pros and cons, their strengths and weaknesses.
Write a detailed explanation for each of your model indicating that how are they reflecting their underlying hypothesis.
Write your own C++ code to define your model.
Conduct experiments on your models.
Report experimental results
Report your error analysis
Do more iterations following the scientific method.

Objectives

Build some POS taggers with your own custom feedforward network and evaluate against other models (beginning with your experimental baseline model, which was the one from assignment 1).
Build some POS taggers with your own custom k nearest neighbors classifier and evaluate against other models.
Practice how to do modeling in machine learning, by following the scientific method to develop and improve models
- Do observations on the given data and propose hypotheses.
- Develop your own model(s) based on your hypotheses.
- Conduct experiments using your models.
- Do error analysis on the results of experiments.
- Propose better hypotheses, conduct experiments and do error analysis again. Repeat this procedure until you don't have a better hypothesis.

The scientific method

1. Observation (first things first! always look at the data)

Seek patterns in the data. If you have already begun coding just stop whatever you are doing and read this text. Imagine that you have a map and you want to head a location, first you need to check what path you need to take before heading.

https://www.oreilly.com/library/view/head-first-data/9780596806224/ch01.html

You might think that "I see a bunch of words that don't make sense to me". Just looking at the data itself is not enough to come up with the problems and solutions. The observation phase is all about identifying problems. Let us give you an insight: when you look at the development data, can you see a pattern between the input data and the oracle data? This stage is also called exploratory data analysis. It is a life-saving skill. It will guide you to generate hypotheses that are worthwhile to test.

2. Generate hypotheses

Generate hypotheses that would explain the pattern. After the previous step you probably have some ideas about what a possible hypothesis might be. In this assignment, your model choices are limited to feedforward neural networks and k nearest neighbors classifiers.

Now you need to hypothesize your own models. You have many options. Generate multiple competing hypotheses; don't just settle for the first option you think of. Some of them might be a good fit to your problem based on your observations. Some of them might not.

3. Compare hypotheses theoretically

Compare and contrast the hypotheses analytically. Now that you have generated a bunch of alternative competing hypotheses, you need to contrast them and discuss their plausibility before testing them. You are already familiar with the terms language bias and search bias. Explain your hypotheses in terms of:

language bias Why is one hypothesized model's language bias better than another hypothesized model's language bias?
search bias Why is one hypothesized model's search bias better than another hypothesized model's search bias?

https://yaboa.de/2018/06/183

4. Compare hypotheses empirically

Compare and contrast the hypotheses experimentally.

Experimental design: Design a methodology for how you will compare your hypothesized models. What metrics will you use to score each hypothesis? Why are those the right metrics? What data will you use? What is that a good choice of data? Etc.
Experimental implementation: Code each hypothesized model.
Experimental execution: Train each hypothesized model.
Experimental results: Test each hypothesized model and record the empirical scores.

5. Observe data

Do error analysis on the experimental results. What patterns did your hypotheses fail to capture? Why not?

6. Iterate

Repeat the scientific method cycle. You have finished one iteration of scientific method. Return to step 2, and generate new hypotheses.

Development environment

You need to complete this assignment on CSE Lab2 machines. You don't necessarily need to go to the physical lab in person (although you could). CSE Lab2 has 54 machines, which you can remotely log into with your CSE account. Their hostnames range from csl2wk00.cse.ust.hk to csl2wk53.cse.ust.hk. Create your project under your home directory which is shared among all the CSE Lab2 machines.

For non-CSE students, visit the following link to create your CSE account. https://password.cse.ust.hk:8443/pass.html

In the registration form, there are three "Set the password of" checkboxes. Please check the second and the third checkboxes.

Free up space in your home directory

Your home directory in CSE Lab2 has only 100MB disk quota. To see a report of the disk usage in your home directory, you can run du command. It is recommended that you have at least 30MB available before you start this assignment.

Assignment Starting Pack

Please download the starting pack tarball, and extract it to your home directory on CSE Lab2 machine. This starting pack contains the skeleton code, the feedforward network library and the dataset. The starting pack has the following structure:

assignment1/
├── include/*
├──	lib/*
└──	src/
    ├── report.xlsx    (your report goes here)
    ├── assignment.cpp (your code goes here)
    ├── traindata.xml
    ├── devdata.xml
    ├── assignment.hpp
	├── main.cpp
    ├── util.hpp
    └── makefile

The only two files you need to touch are report.xlsx and assignment.cpp. Please do not touch other files.

Extract the starting pack

After downloading the starting pack, you can run tar -xzvf COMP4221_2019Q1_a2.tgz to extract the starting pack

What does the command do?

tar collected all the files into one package, COMP4221_2019Q1_a2.tgz. The command does a couple things:

f: this must be the last flag of the command, and the tar file must be immediately after. It tells tar the name and path of the compressed file.
z: tells tar to decompress the archive using gzip.
x: tar can collect files or extract them. x does the latter.
v: makes tar talk a lot. Verbose output shows you all the files being extracted.

Verify your development environment

After you extract the starting pack, go into its src directory and run make and main. You should see something similar like this:

csl2wk14:yyanaa:355> make
g++8 -std=c++17 -I../include -c -o assignment.o assignment.cpp
g++8 -std=c++17 -I../include -o main main.cpp assignment.o -L../lib -lmake_transducer
csl2wk14:yyanaa:356> main
terminate called after throwing an instance of 'std::runtime_error'
  what():  please fill in your student ID
Abort

Analyzing Data

The first step in scientific research is to observe the data. You are provided with a traindata.xml and a devdata.xml. They represent the training set and the development test set respectively.

Both training set and development set are in XML format, in which:

Both the training set and development test set are in an XML format, in which:

each sent element contains a sentence
each nonterminal element on the leaf contains a token
the value attribute of a nonterminal element specifies the POS tag

An example of such an xml file is provided here:

<?xml version="1.0"?>
<dataset>
    <sent>
        <nonterminal value="TOP">
            <nonterminal value="NN">time</nonterminal>
            <nonterminal value="VB">flies</nonterminal>
            <nonterminal value="PP">like</nonterminal>
            <nonterminal value="DT">an</nonterminal>
            <nonterminal value="NN">arrow</nonterminal>
        </nonterminal>
    </sent>
    <sent>
        <nonterminal value="TOP">
            <nonterminal value="NN">fruit</nonterminal>
            <nonterminal value="NN">flies</nonterminal>
            <nonterminal value="VB">like</nonterminal>
            <nonterminal value="DT">an</nonterminal>
            <nonterminal value="NN">apple</nonterminal>
        </nonterminal>
    </sent>
</dataset>

You don't need to worry about how to read these XML files into C++ data structure. This dirty part will be handled by our library.

Seeking patterns in the data

Find insightful patterns in the data. In the last assignment, most of you just scrolled through the data and concluded something like "the data is from newspaper", which doesn't really help you to design a good model. This time your observations should be insightful so that they can be foundations for your hypotheses. Report the observations accordingly.

For example, an insightful pattern is like this: There are many tokens, such as XXX, whose POS tag depends on whether the sentence is a question or not. Furthermore, a sentence is a question or not can be identified by whether there is a question mark at the end of the sentence.

Good observations comes from statistical analysis. For example, here are some directions may be useful in finding patterns.

What are the frequent tokens? How many times did they occur (and with what relative percentage)? What are the possible POS tags that they can have?
What are the frequent POS tags? How many times did they occur (and with that relative percentage)?
How many tokens have only one POS tag? How many tokens have 2 possible POS tags? How many tokens have 3 possible POS tags? ...
For those tokens that have multiple possible POS tags, what context would be required to make a correct POS tag prediction?

Report your observations about the data. You might use the hints we gave you but you are not limited to them.

Generating hypothesis models

According to the pattern you observed, please propose some POS tagging models. The next step is to compare those models theoretically, in the following aspects:

What language bias does your model have? Why is this language bias better than the other?
What search bias does your model have? Why is this search bias better than the other?

Write down your proposed hypotheses together with their pros and cons based on questions above.

Do experiments using your hypotheses

Here you need to design experiments that validate/invalidate the hypotheses you came up with.

Write a detailed explanation for your each model indicating how are they reflecting their underlying hypothesis

Write your own C++ code to define your model.

Building your own feedforward network

Build a feedforward network that reflects your hypothesis. The assignment.cpp, in the starting pack already contains a fully functioning 2-layer feedforward network as an example:

/**
 * create your custom classifier by combining transducers
 * the input to your classifier will a list of tokens
 * \param vocab a list of all possible words
 * \param postags a list of all possible POS tags
 * \return
 */
transducer_t your_classifier(const vector<token_t> &vocab, const vector<postag_t> &postags) {

  // in this starting code, we demonstrates how to construct a 2-layer feedforward neural network
  // that takes 2 tokens as features

  // embedding lookup layer is mathematically equivalent to
  // a 1-hot layer followed by a dense layer with identity activation
  // but trains faster then composing those two separately
  auto embedding_lookup = make_embedding_lookup(64, vocab);

  // create a concatenation operation
  // this operation can concatenate the 2 tokens (in 1-hot representation) into a big vector feature
  auto concatenate = make_concatenate(2);

  // create the first feedforward layer,
  // with 64 output units and tanh as activation function
  auto dense0 = make_dense_feedfwd(64, make_tanh());

  // create another feedforward layer
  // this is the final layer. so this layer should return the predicted POS tag (in 1-hot representation)
  // the output dimension of your final layer should be the size of all POS tags
  // because each dimension corresponds to a particular choice
  auto dense1 = make_dense_feedfwd((unsigned) postags.size(), make_softmax());

  // this is the inverse 1-hot operation
  // it takes a 1-hot vector feature, and returns a token from a pre-defined vocabulary
  // the 1-hot vector feature doesn't have to be perfect 0 and 1 values.
  // but they have to sum up to 1 (just like probability distribution)
  // usually the this (approximated) 1-hot input comes from a softmax operation
  auto onehot_inverse = make_onehot_inverse(postags);

  // connect these layers together
  // composing A and B means first apply A, and then take the output of A and feed into B
  return compose(group(embedding_lookup, embedding_lookup), concatenate, dense1, onehot_inverse);
}

Conceptually, in order to construct a feedforward layer, you need to specify the input dimension, output dimension and activation function. But in the starting pack code, we can see that only the output dimension and activation function is explicitly specified. This is because the layer will eventually know the input dimension by looking at the input it receives anyway.

How to compose transducers

Graph Representation	Pseudo Code
	A
	compose(A, B)
	compose(A, B, C) // or equivalently compose(compose(A, B), C)
	compose(group(A, B), C)
	compose(group(make_identity(), A), B)
	compose(A, group(B, C))
	compose(make_copy(2), group(A, B))
	compose(A, group(make_identity(), compose(B, C)))
	compose(A, group(B, C), D)

Building your own k nearest neighborss classifier

Instead of doing classification using a feedforward network, you can also use k nearest neighborss. Here is an example on how you can do that.

transducer_t your_classifier_knn(const vector<token_t> &vocab, const vector<postag_t> &postags) {

  // in this starting code, we demonstrates how to construct a KNN that takes two tokens as features

  // a KNN classifier takes a real-valued vector feature and directly returns the predicted class
  auto knn = make_symbolic_k_nearest_neighbors_classifier(5, 2, postags);

  return knn;
}

In the starting pack, this k nearest neighborss classifier will not be invoked at all. This is because, our main.cpp will invoke your_classifier, but the function is named your_classifier_knn. To use this k nearest neighborss classifier, you need to rename your_classifier_knn into your_classifier, and of course you also need to rename/remove the previous feedforward version of your_classifier function.

Defining your own logic on how to supply context

The assignment.cpp in the starting pack also contains an example on how to supply a token together with one preceding token as context:

/**
 * besides the target token to classify, your model may also need other tokens as "context" input
 * this function defines the inputs that your model expects
 * \param sentence the sentence that the target token is coming from
 * \param token_index the position of the target token in the sentence
 * \return a list of tokens to feed to your model
*/
vector<token_t> get_features(const vector<token_t> &sentence, unsigned token_index) {

  // in these starting code, we demonstrate how to feed the target token,
  // together with its preceding token as context.

  if (token_index > 0) {

    // when the target token is not the first token of the sentence, we just need to
    // feed the target token and its previous token to the model
    return vector<token_t>{sentence[token_index], sentence[token_index - 1]};
  } else {

    // in case the target token is the first token, we need to invent a dummy previous token.
    // this is because a feedforward neural network expects consistent input dimensions.
    // if sometimes you give the feedforward neural network 1 token as input,
    // sometimes you give it 2 tokens as input, then the feedforward neural network will be angry.

    // there is nothing special about the string "<s>". you can pick whatever you want as long as
    // it doesn't appear in the vocabulary
    return vector<token_t>{sentence[token_index], "<s>"};
  }
}

Hardcode your student ID into a string literal

In assignment.cpp you need to put your student ID in the global variable STUDENT_ID.

const char* STUDENT_ID = "00000000";

Doing experiments and testing

After you have built one of your proposed models in assignment.cpp, you can run the command make and ./main to compile and run your model. You will see your model's accuracy. Furthermore, you will get a model_prediction.xml, which contains your model prediction on the development test dataset, in exactly the same format as all other data files. Please observe your model output, find out which tokens your model failed to tag, and most importantly, explain why.

Write down your model accuracy in the report.

Find out which test examples your model failed to classify, and why your model failed to classify them. If there are more than 4 failures, you only need to report 4.

Refining your model

After analyzing why your model went wrong, you should be able to make a better hypothesis. Now you have completed one iteration of scientific research. On the next iteration, you compare your new hypothesis theoretically and empirically again, doing more error analysis, and coming up with more awesome model.

Start another scientific method iteration.

Submission

You need to submit a tgz archive named assignment2.tgz via CASS. The tgz archive should ONLY contain the following two files:

assignment.cpp that contains the implementation of your final improved model
report.xlsx contains the record of your scientific method
- This is a template for scientific method.
- After opening this Excel file, you will notice that there are four worksheets. Each worksheet corresponds to one scientific method iteration. You don't need to fill in all of them, and of course you can add more if needed.
- In each worksheet, there are four columns coresponding to 4 hypotheses. You don't have to fill in all of them, and of course you can add more if needed.
- In each worksheet, you don't necessarily have to fill in every blank. Fill in those applicable.
- To create a line break in Excel cell, press Alt + Enter.

The tgz file can be generated using the following command: tar -zcvf assignment2.tgz assignment.cpp report.xlsx

Due date: April 10 23:00

Grading Scheme

For each insightful pattern you observe (no matter which iteration it belongs to), you get 4 pts. After you have obtained 20 pts in this way, you get 1 bonus pts instead.
For each hypothesis you proposed and evaluated (no matter which iteration it belongs to), you get up to 20 pts. After you have obtained 80 pts in this way, you get 5 bonus pts instead. The pts for a hypothesis is broken down as explained below.
- hypothesis description - 1 pts
- hypothesis theoretical analysis - 4 pts
- model description - 1 pts
- code - 5 pts
- experiment score - 1 pts
- error analysis - 8 pts
  - 2 pts for each failure case
If your submission is not in the required format, for example, not in tgz format or contains additional files, then a -5 pts penalty will apply.
You can get a maximum amount of 25 bonus pts