COMP3211 Assignment 1

Serkan Kumyol

Release date: 2019 Mar 12

Objective

In this assignment, you have to write a program to find top five anagrams for given input characters using Dijkstra's algorithm. Program should be able to take a set of characters from terminal and output 5 anagrams with their distances.

Development Environment

You need to complete this assignment on CSE Lab2 Machine. You don't necessarily need to go to the physical lab in person (although you could). CSE Lab2 has 54 machines, which you can remotely log into with your CSE account. Their hostnames range from csl2wk00.cse.ust.hk to csl2wk53.cse.ust.hk. Create your project under your home directory which is shared among all the CSE Lab2 machines. Your home directory has a storage limit of 100MB, which is enough for you to do this assignment.

For non-CSE students, visit the following link to create your CSE account. https://password.cse.ust.hk:8443/pass.html

In the registration form, there are three "Set the password of" checkboxes. Please check the second and the third checkboxes.

Assignment Starting Pack

Please download the starting pack, and unzip it to your home directory on CSE Lab2 machine. This starting pack contains the skeleton code and bigram model. The starting pack has the following structure:

assignment1/
├── include/
│   └── language_model.hpp
├──	lib/
│   └── language_model.a
└──	src/
    ├── report.txt

            (your report goes here)
    ├── assignment.cpp
            (your code goes here)
    ├── assignment.hpp
	├── main.cpp
    ├── obj/
    └── makefile

The only file you need to modify is the assignment.cpp. Please do not touch other files.

tar -xvzf COMP3211_2019Q1_a1_20190311.tgz

To explain a little further, tar collected all the files into one package, COMP3211_2019Q1_a1_20190311.tgz. The gzip program applied compression, hence the gz extension. So the command does a couple things:

f: this must be the last flag of the command, and the tar file must be immediately after. It tells tar the name and path of the compressed file.
z: tells tar to decompress the archive using gzip.
x: tar can collect files or extract them. x does the latter.
v: makes tar talk a lot. Verbose output shows you all the files being extracted.

Verify your development environment

After you unzip the starting pack, go into its src directory and run make. This is what you will get:

csl2wk14:skumyol:106> make
g++8 -c -o obj/assignment.o assignment.cpp -std=c++17 -I../include -L../lib -llanguage_model
assignment.cpp: In function ;std::vector<std::__cxx11::basic_string<char> > test(const std::vector<std::__cxx11::basic_string<char> >&)’:
assignment.cpp:45:1: warning: no return statement in function returning non-void [-Wreturn-type]
 }
 ^
g++8 -o main -std=c++17 -I../include -L../lib main.cpp obj/assignment.o -llanguage_model

As you can see, the compiler is complaining that some function does not have a return statement. This is as expected, because it is your assignment to implement that function. But anyway, the compiler generates a binary called main in this src directory.

csl2wk14:skumyol:107> ls
assignment.cpp  assignment.hpp  devdata.xml


            main  main.cpp  makefile  obj  traindata.xml

Although running this binary will throw an error for now, but the existence of this binary indicates that your development is good to go.

Training and Testing Dataset

You are provided with a traindata.xml

Training set is in XML format, in which:

dataset element contains s elements
each s element contains tokens of a word

An example of such an xml file is provided here:

<dataset>
    <s>
        e a t
    </s>
    <s>
        c i n e m a
    </s>
</dataset>

You don't need to worry about how to read these XML files into C++ data structure. This dirty part will be handled by our library. However, you need to write a few lines of comments about "what you think about the dataset" in your report.

Step by Step Implementation Guide

You are already provided with language model therefore your focus is about how to make use of them in calculating degree of goodness of a sequence using the formula presented in the lecture: $d (e_{0} e_{1} e_{2} . . . e_{T - 1}) \approx d (e_{0}) + d (e_{1} | e_{0}) + d (e_{2} | e_{0}) + . . . + d (e_{T - 1} | e_{T - 2})$

Since we have already implemented the language model in language_model.hpp. You can call the related functions using provided library.

//
//
// Created by Dekai WU and Serkan KUMYOL on 20190308.
//

#ifndef _LANGUAGE_MODEL_HPP_
#define _LANGUAGE_MODEL_HPP_

#include <unordered_set>
    #include <vector>
    #include <string>

    /** token_T is designed for holding only one character.
    * The reason token_T is defined as std::string, instead of char,
    * is so that the character can be a Unicode character instead of just an ASCII character.
    * The Unicode character is stored in a std::string by encoding it using UTF-8.
    * */
    using token_T = std::string;

    /**
    * Semiring is the way you choose how to measure distance between tokens.
    */
    enum semiring_t {NEG_LOG_10};

    using std::vector;
    /**
    * a model that takes a token and predicts a label
    */
    class language_model {
    std::string m_id;
    public:
    language_model();
    language_model(const language_model&) = delete;
    language_model(language_model&&) noexcept = default;
    language_model &operator=(const language_model&) = delete;
    language_model &operator=(language_model&&) noexcept = default;

    /**
    * train your model with a training set
    * \param dataset vector of all training data where each s is in vector of strings
    */
    void lm_train(const vector<vector<token_T>> &words);

    /**
    * given compute d(e1 | e0)
    * \param token e1, the token to score
    * \param prefix the context of the token. in the bigram case, it contains only the preceding token e0
    * \param semiring the number domain. your only choice is negative log 10 domain
    * \return score value
    */
    double lm_score_suffix(const token_T &token, const std::vector<token_T> &prefix = {}, semiring_t semiring=NEG_LOG_10) const;

        };

        // below are utility functions that you don't need in assignment 1

        /**
        * Takes vector of pairs of token and scores and prints.
        * \param top5anagrams
        */
        void print_top5anagrams(const std::vector<std::pair<vector<token_T>, double>> &top5anagrams);



            std::vector<std::vector<token_T>> read_dataset(std::istream& is);
#endif

After reading through language_model.hpp, you can start implementing the training workflow. The training workflow has 3 steps: initialization, training and searching. You need to implement them in assignment.cpp.

Step 0: Hardcode your student ID into a string literal

You need to put your student ID in the global variable STUDENT_ID.

// TODO: put your student ID in this variable
const char* STUDENT_ID = "20026012";

Step 1: Initialization function

In the initialization function, create a language_model object without parameters which returns a pointer to the model you created. Note that you don't need to parse the training set XML file by yourself. After parsing xml file using provided library you need to give it as input to model.train function in the form of vector of strings. You can use standart C++ libraries to create such data structure.

/**
 * initialize your model. this function will be called before the "train" function
 *
 */
Void init() {
  // TODO: do whatever necessary to initialize your new model. You can use a global variable to access your model globally.
};

Step 2: Training function

In the training function, use your language_model object to do training on the given training set. Training function takes a vector made of vector of tokens as input which you will obtain from our xml parser and uses them to calculate scores of unigram and bigrams

/**
 * train your model with a training set
 * \param words a vector of strings obtained from training set
 */
void train(const std::vector<std::vector<token_T>> &dataset) {
  // TODO: complete this training function

};

Step 3: Searching

You need to implement Dijkstra's Algorithm to find top 5 anagrams for a given word from terminal.

 /**
 /**
 * function to calculate shortest path of given input word using dijkstra's algorithm
 * \param input word
 * \return the list of top 5 anagrams paired with their scores
 */
	std::vector<std::pair<token_T, double>> dijkstra(std::vector<token_T> &word) {

  // TODO: complete dijkstra's algorithm to find top5 anagrams function
  //
}

Further remarks

Don't worry about number precision All numeric results that differs from the correct answer by less than 0.01 are considered correct.

Your output may contain the original input word If the input word happens to be among the top 5, just put it there. You don't need to rule it out.

It doesn't matter how your break a tie Words that have the same score do not have any preferred order.

Beware of duplicates Every character from the input word must appear exactly once in the anagrams. For example, if the input word is "potato", its' anagrams must contain one b, two o's, two t's, and one a. Note that in the "potato" example, some characters appears more than once. This may cause duplicates in your output. You should pay special attention to them.

Submission

You need to submit a tgz archive named assignment1.tgz via CASS. The zip archive should ONLY contain the following two files:

assignment.cpp
report.txt containing answers to the following questions:
- What you think about the dataset? 10 pts
- What is your observation about the performance of Dijksta's algorithm(pros and cons) 15 pts
- How would you improve the algorithm? 10 pts

Due date: 2019 Mar 20 23:59:00

Grading Scheme

Below you see example scores

We only count the runtime of Dijkstra's algorithm

a working binary that:
- Handle 11-character words under 40 seconds = 65 points
- Handle 10-character words under 20 seconds = 55 points
- Handle 9-character words under 10 seconds = 45 points
- A working binary handling 6 to 8 characters = 35 points
report - 35pts