Seek patterns in the data. If you have already begun coding just stop whatever you are doing and read this text. Imagine that you have a map and you want to head a location, first you need to check what path you need to take before heading.
You might think that "I see a bunch of words that don't make sense to me". Just looking at the data itself is not enough to come up with the problems and solutions. The observation phase is all about identifying problems. Let us give you an insight: when you look at the development data, can you see a pattern between the input data and the oracle data? This stage is also called exploratory data analysis. It is a life-saving skill. It will guide you to generate hypotheses that are worthwhile to test.
Generate hypotheses that would explain the pattern. After the previous step you probably have some ideas about what a possible hypothesis might be. In this assignment, your model choices are limited to feedforward neural networks and k nearest neighbors classifiers.
Now you need to hypothesize your own models. You have many options. Generate multiple competing hypotheses; don't just settle for the first option you think of. Some of them might be a good fit to your problem based on your observations. Some of them might not.
Compare and contrast the hypotheses analytically. Now that you have generated a bunch of alternative competing hypotheses, you need to contrast them and discuss their plausibility before testing them. You are already familiar with the terms language bias and search bias. Explain your hypotheses in terms of:
Compare and contrast the hypotheses experimentally.
Do error analysis on the experimental results. What patterns did your hypotheses fail to capture? Why not?
Repeat the scientific method cycle. You have finished one iteration of scientific method. Return to step 2, and generate new hypotheses.
You need to complete this assignment on CSE Lab2 machines. You don't necessarily need to go to the physical lab in person (although you could). CSE Lab2 has 54 machines, which you can remotely log into with your CSE account. Their hostnames range from csl2wk00.cse.ust.hk to csl2wk53.cse.ust.hk. Create your project under your home directory which is shared among all the CSE Lab2 machines.
For non-CSE students, visit the following link to create your CSE account. https://password.cse.ust.hk:8443/pass.html
In the registration form, there are three "Set the password of" checkboxes. Please check the second and the third checkboxes.
Your home directory in CSE Lab2 has only 100MB disk quota. To see a report of the disk usage in your home directory, you can run du command. It is recommended that you have at least 30MB available before you start this assignment.
Please download the starting pack tarball, and extract it to your home directory on CSE Lab2 machine. This starting pack contains the skeleton code, the feedforward network library and the dataset. The starting pack has the following structure:
assignment1/ ├── include/* ├── lib/* └── src/ ├── report.xlsx (your report goes here) ├── assignment.cpp (your code goes here) ├── traindata.xml ├── devdata.xml ├── assignment.hpp ├── main.cpp ├── util.hpp └── makefile
The only two files you need to touch are report.xlsx and assignment.cpp. Please do not touch other files.
After downloading the starting pack, you can run tar -xzvf COMP4221_2019Q1_a2.tgz to extract the starting pack
tar collected all the files into one package, COMP4221_2019Q1_a2.tgz. The command does a couple things:
After you extract the starting pack, go into its src directory and run make and main. You should see something similar like this:
csl2wk14:yyanaa:355> make g++8 -std=c++17 -I../include -c -o assignment.o assignment.cpp g++8 -std=c++17 -I../include -o main main.cpp assignment.o -L../lib -lmake_transducer csl2wk14:yyanaa:356> main terminate called after throwing an instance of 'std::runtime_error' what(): please fill in your student ID Abort
The first step in scientific research is to observe the data. You are provided with a traindata.xml and a devdata.xml. They represent the training set and the development test set respectively.
Both training set and development set are in XML format, in which:
Both the training set and development test set are in an XML format, in which:
An example of such an xml file is provided here:
<?xml version="1.0"?> <dataset> <sent> <nonterminal value="TOP"> <nonterminal value="NN">time</nonterminal> <nonterminal value="VB">flies</nonterminal> <nonterminal value="PP">like</nonterminal> <nonterminal value="DT">an</nonterminal> <nonterminal value="NN">arrow</nonterminal> </nonterminal> </sent> <sent> <nonterminal value="TOP"> <nonterminal value="NN">fruit</nonterminal> <nonterminal value="NN">flies</nonterminal> <nonterminal value="VB">like</nonterminal> <nonterminal value="DT">an</nonterminal> <nonterminal value="NN">apple</nonterminal> </nonterminal> </sent> </dataset>
You don't need to worry about how to read these XML files into C++ data structure. This dirty part will be handled by our library.
Find insightful patterns in the data. In the last assignment, most of you just scrolled through the data and concluded something like "the data is from newspaper", which doesn't really help you to design a good model. This time your observations should be insightful so that they can be foundations for your hypotheses. Report the observations accordingly.
For example, an insightful pattern is like this: There are many tokens, such as XXX, whose POS tag depends on whether the sentence is a question or not. Furthermore, a sentence is a question or not can be identified by whether there is a question mark at the end of the sentence.
Good observations comes from statistical analysis. For example, here are some directions may be useful in finding patterns.
Report your observations about the data. You might use the hints we gave you but you are not limited to them.
According to the pattern you observed, please propose some POS tagging models. The next step is to compare those models theoretically, in the following aspects:
Write down your proposed hypotheses together with their pros and cons based on questions above.
Here you need to design experiments that validate/invalidate the hypotheses you came up with.
Write a detailed explanation for your each model indicating how are they reflecting their underlying hypothesis
Write your own C++ code to define your model.
Build a feedforward network that reflects your hypothesis. The assignment.cpp, in the starting pack already contains a fully functioning 2-layer feedforward network as an example:
/** * create your custom classifier by combining transducers * the input to your classifier will a list of tokens * \param vocab a list of all possible words * \param postags a list of all possible POS tags * \return */ transducer_t your_classifier(const vector<token_t> &vocab, const vector<postag_t> &postags) { // in this starting code, we demonstrates how to construct a 2-layer feedforward neural network // that takes 2 tokens as features // embedding lookup layer is mathematically equivalent to // a 1-hot layer followed by a dense layer with identity activation // but trains faster then composing those two separately auto embedding_lookup = make_embedding_lookup(64, vocab); // create a concatenation operation // this operation can concatenate the 2 tokens (in 1-hot representation) into a big vector feature auto concatenate = make_concatenate(2); // create the first feedforward layer, // with 64 output units and tanh as activation function auto dense0 = make_dense_feedfwd(64, make_tanh()); // create another feedforward layer // this is the final layer. so this layer should return the predicted POS tag (in 1-hot representation) // the output dimension of your final layer should be the size of all POS tags // because each dimension corresponds to a particular choice auto dense1 = make_dense_feedfwd((unsigned) postags.size(), make_softmax()); // this is the inverse 1-hot operation // it takes a 1-hot vector feature, and returns a token from a pre-defined vocabulary // the 1-hot vector feature doesn't have to be perfect 0 and 1 values. // but they have to sum up to 1 (just like probability distribution) // usually the this (approximated) 1-hot input comes from a softmax operation auto onehot_inverse = make_onehot_inverse(postags); // connect these layers together // composing A and B means first apply A, and then take the output of A and feed into B return compose(group(embedding_lookup, embedding_lookup), concatenate, dense1, onehot_inverse); }
Conceptually, in order to construct a feedforward layer, you need to specify the input dimension, output dimension and activation function. But in the starting pack code, we can see that only the output dimension and activation function is explicitly specified. This is because the layer will eventually know the input dimension by looking at the input it receives anyway.
Graph Representation | Pseudo Code |
A |
|
compose(A, B) |
|
compose(A, B, C) // or equivalently compose(compose(A, B), C) |
|
compose(group(A, B), C) |
|
compose(group(make_identity(), A), B) |
|
compose(A, group(B, C)) |
|
compose(make_copy(2), group(A, B)) |
|
compose(A, group(make_identity(), compose(B, C))) |
|
compose(A, group(B, C), D) |
Instead of doing classification using a feedforward network, you can also use k nearest neighborss. Here is an example on how you can do that.
transducer_t your_classifier_knn(const vector<token_t> &vocab, const vector<postag_t> &postags) { // in this starting code, we demonstrates how to construct a KNN that takes two tokens as features // a KNN classifier takes a real-valued vector feature and directly returns the predicted class auto knn = make_symbolic_k_nearest_neighbors_classifier(5, 2, postags); return knn; }
In the starting pack, this k nearest neighborss classifier will not be invoked at all. This is because, our main.cpp will invoke your_classifier, but the function is named your_classifier_knn. To use this k nearest neighborss classifier, you need to rename your_classifier_knn into your_classifier, and of course you also need to rename/remove the previous feedforward version of your_classifier function.
The assignment.cpp in the starting pack also contains an example on how to supply a token together with one preceding token as context:
/** * besides the target token to classify, your model may also need other tokens as "context" input * this function defines the inputs that your model expects * \param sentence the sentence that the target token is coming from * \param token_index the position of the target token in the sentence * \return a list of tokens to feed to your model */ vector<token_t> get_features(const vector<token_t> &sentence, unsigned token_index) { // in these starting code, we demonstrate how to feed the target token, // together with its preceding token as context. if (token_index > 0) { // when the target token is not the first token of the sentence, we just need to // feed the target token and its previous token to the model return vector<token_t>{sentence[token_index], sentence[token_index - 1]}; } else { // in case the target token is the first token, we need to invent a dummy previous token. // this is because a feedforward neural network expects consistent input dimensions. // if sometimes you give the feedforward neural network 1 token as input, // sometimes you give it 2 tokens as input, then the feedforward neural network will be angry. // there is nothing special about the string "<s>". you can pick whatever you want as long as // it doesn't appear in the vocabulary return vector<token_t>{sentence[token_index], "<s>"}; } }
In assignment.cpp you need to put your student ID in the global variable STUDENT_ID.
const char* STUDENT_ID = "00000000";
After you have built one of your proposed models in assignment.cpp, you can run the command make and ./main to compile and run your model. You will see your model's accuracy. Furthermore, you will get a model_prediction.xml, which contains your model prediction on the development test dataset, in exactly the same format as all other data files. Please observe your model output, find out which tokens your model failed to tag, and most importantly, explain why.
Write down your model accuracy in the report.
Find out which test examples your model failed to classify, and why your model failed to classify them. If there are more than 4 failures, you only need to report 4.
After analyzing why your model went wrong, you should be able to make a better hypothesis. Now you have completed one iteration of scientific research. On the next iteration, you compare your new hypothesis theoretically and empirically again, doing more error analysis, and coming up with more awesome model.
Start another scientific method iteration.
You need to submit a tgz archive named assignment2.tgz via CASS. The tgz archive should ONLY contain the following two files:
The tgz file can be generated using the following command: tar -zcvf assignment2.tgz assignment.cpp report.xlsx
Due date: April 10 23:00