精英团队10年代写经验
专业CS编程代写服务

data_mining_assignment_sample

Overview

In this assignment, you will solve a real-world data mining problem. This assignment requires you to understand the theory discussed in the workshops, conduct some research into the data mining problem to solve, and use the skills that you should have developed through completing practical exercises to perform various data mining tasks.

Problem Description

In this assignment, you will perform predictive analytics. You are given an sqlite3 database file (Assignment2023.sqlite) which contains a total of 5500 samples across two tables. Table “train” contains 5000 samples have already been categorized into three classes. You are asked to predict the class labels of the 500 samples in the “test” table. You are given the following information:

  • The attribute Class indicates the class label. For the “train” table the class label is either 0, 1 or 2. For the “test” table the class label is missing (NULL).
  • Attributes are either categorical or numeric. Note that some attributes may appear numeric. You will need to decide whether to treat them as numeric or categorical and justify your action.
  • The data is known to contain imperfections:

o There are missing/corrupted entries in the data set.
o There may be duplicate instances and duplicate attributes.
o There are irrelevant attributes that do not contain any useful information useful for the

classification task.
o The labelled data is imbalanced: there is a considerable difference between the number

of samples from each class.

Note that the attribute names and their values have been obfuscated. Any pre-processing and analytical steps to the data need to be based entirely on the values of the attributes. No domain specific knowledge is available.

Attempt the following:

  • Data Preparation: In this phase, you will need to study the data and address the issues present in the data. At the end of this phase, you will need to obtain a processed version of the original data ready for classification, and suitably divide the data into two subsets: a training set and a test set.
  • Data Classification: In this phase, you will perform analytical processing of the training data, build suitable predictive models, test and validate the models, select the models that you believe the most suitable for the given data, and then predict the missing labels.
  • Report: You will need to write a complete report documenting the steps taken, from data preparation to classification. In addition, you should also give comments or explain your choice/decision at every step. For example, if an attribute has missing entries, you must describe what strategy was taken to address them, and why you employ that strategy based on the observation of the data. Importantly, the report must also include your prediction of the missing labels.

Tasks

Data Preparation

In this first task, you will examine all data attributes and identify issues present in the data. For each of the issues that you have identified, choose and perform necessary actions to address it. Note that you will need to apply these actions to both the training and test data at the same time. At the end of this phase, you will have two data sets: one for training and one for the final testing task. Your marks for this task will depend on how well you identify the issues and address them. Below is a list of data preparation issues that you need to address

  • Identify and remove irrelevant attributes.
  • Detect and handle missing entries.
  • Detect and handle duplicates (both instances and attributes).
  • Select suitable data types for attributes.
  • Perform data transformation (such as scaling/standardization) if needed.
  • Perform other data preparation operations (This is optional, bonus marks will be awarded for

    novel ideas).
    For each of the above issues your report should:

  • Describe the relevant issue in your own words and explain why it is important to address it. Your explanation must consider the classification task that you will undertake subsequently.
  • Demonstrate clearly that such an issue exists in the data with suitable illustration/evidence.
  • Clearly state and explain your choice of action to address such an issue.
  • Demonstrate convincingly that your action has addressed the issue satisfactorily.

    Where applicable, you should provide references to support your arguments.

    Data Classification

    For this task, you will demonstrate convincingly how you select, train, and fine tune your predictive models to predict the missing labels. You must use at least the three (3) classifiers that have been discussed in the workshops, namely k-NN, Naive Bayes, and Decision Trees. You can also select additional classifiers (both base classifiers and meta-classifiers). Attempt and report the following:

  • Class imbalance: the original labelled data is not equally distributed between the three classes. You need to demonstrate that such an issue exists within the data, explain the importance of this issue, and describe how you address this problem.
  • Model training and tuning: Every classifier typically has hyperparameters to tune in order. For each classifier, you need to select (at least one) and explain the tuning hyperparameters of your choice. You must select and describe a suitable cross-validation/validation scheme that can measure the performance of your model on labelled data well and can address the class imbalance issue. Then you will need to conduct the actual tuning of your model and report the tuning results in detail. You are expected to look at several classification performance metrics and make comments on the classification performance of each model. Finally, you will need to clearly indicate and justify the selected values of the tuning hyperparameters of each model.
  • Model comparison: Once you have finished tuning all models, you will need to compare them and explain how you select the best two models for producing the prediction on the 200 test samples.
  • Prediction:

o Use the best two (2) models that you have identified in the previous step to predict the

missing class labels of the test samples. Clearly explain in detail how you arrive at the

prediction.
o Produce an sqlite3 database file with the name Answers.sqlite that contains your

prediction in the format: the first column is the index corresponding to the ‘test’ table, the second and third columns are the predicted class labels. All columns should be integers. This file must be submitted electronically with the electronic copy of the report via Blackboard. An example of such a file is given below:

            index,Predict1,Predict2
            5000,1,1
            5001,1,0
            5002,0,0

… 5499,0,1

o You must also indicate clearly in the report your estimated prediction accuracy for each selected model and explain how you arrive at these estimates.

• Other inventive steps: You may also conduct and report other inventive steps not mentioned above (bonus marks will be awarded for novel ideas).

Reporting

You will also need to submit a written report. It should serve the following objectives:

  • It demonstrates your understanding of the problem, your research skills, and the necessary steps you have attempted to solve the tasks.
  • It contains information necessary for marking your work.
  • It conforms to the page limit of 20 pages. Anything beyond this limit will not be graded.

What you should include in the report:

• Structure of the report
o Cover page: this must show your identity.
o Summary: briefly list the major findings (data preparation and classification) and the

lessons you have learned.
o Methodology: address the requirements described above for

▪ Data preparation

▪ Data classification
o Conclusion: concluding remarks and other comments. o References: list any relevant work that you refer to.
o Appendices: important things not mentioned above.

• Visual illustration to support your analysis which may include tables, figures, plots, diagrams, and screenshots.

Source Code

In addition to the main report which details your analysis of the assignment tasks, you will also need to submit a Jupyter Notebook that will reproduce your prediction. The notebook will:

  • Run without error on Google Colaboratory when you select “restart and run all”. You should assume that the data file for this assignment is in your root directory.
  • Contain a combination of Text and Code cells that describe the tasks that you are attempting. See the practicals for examples of how to do this.
  • Produce the Answers.sqlite file without need for further modification.
  • Be named notebook_surname_SID.ipynb (example notebook_hancock_12345678.ipynb)

    Once you have completed your notebook please go to “Edit -> Clear all outputs”, and then “File -> Download -> Download .ipynb”.

    Any notebooks that fail to execute completely, or not reproduce the submitted prediction file, will lose marks.

赞(0)

精英团队10年代写经验,专业CS编程作业代写服务

微信: cscodinghelp
邮箱: info@cscoding.net

联系我们交易流程