Topic: Data Analysis


  1. Your ability to use correctly the tools that we covered in the course
  2. Your ability to draw and justify the correct conclusions from these tools
  3. Your ability to justify how conclusions/ findings from previous (modelling) steps lead you to the
    actions/ choices you made in subsequent (modelling) steps, or the revision of previous decisions you
    have made. Reports that don’t document the steps followed and the reasons why these were chosen
    will receive minimal marks, even if the final answer/ recommended model is sensible. Explain your
    reasoning clearly and in good English.
  4. Your ability to address the questions posed in the coursework based on an intelligent interpretation of
    the evidence provided in the previous two steps.
  5. Your ability to express and justify your key findings succinctly (that is within the page limit specified)
    rather than report every possible model or figure you have created.
    For all the above reasons: In your report do not just replicate the process followed during the workshops!
    The objective of the workshops is to introduce you to the different techniques discussed during the lectures,
    and not to give you a roadmap on how to answer the coursework.
    You will not be assessed on your knowledge of R or any other software. For this reason do not include
    screenshots from any software or any other information about commands you used, or options to functions,
    or how you drew figures etc. You will be simply wasting valuable space.
    You are free to use any software you like to do the coursework. However, you can’t use as an excuse
    the fact that you couldn’t do a particular task because the software you chose doesn’t offer a particular
    capability which we covered in the workshops.
    Page limits
    Your report must be submitted as a PDF file that does not exceed 11 pages, with at least 11 point
    typeface. This limit is strict and it includes appendices (which I strongly recommend that you do not use).
    If your report exceeds the page limit your mark will be affected negatively as you will be failing on the last
    assessment criterion (see above).
    Report Structure
  6. This is not a business report and as such it does not require an executive summary, a cover page, table
    of contents, or even an introduction describing the context of the task.
  7. You do not need to outline the CRISP-DM process, discuss expected project benefits and risks, create a
    cognitive map etc. This is not relevant in this case since you will not be collaborating with a “problemowner” in the process of specifying a data mining project and assessing its feasibility. In the present
    setting the problem is already specified for you.
  8. It is essential to provide a Conclusions section that summarises your findings and how these relate to
    the coursework objectives.
    Plagiarism This is an individual piece of assessment, and you should ensure that your report reflects your
    own work exclusively. All reports go through automated software to detect plagiarism from a variety
    of sources (including past and current students’ reports as well as online resources, conference and journal
    publications etc.) The consequences of plagiarism are very serious.
    • Deadline at 10:00 on the 7th of May
    • Late work will be penalised according to the department code of practice. Any request for an extension
    beyond the deadline will be accepted if appropriate justification is provided in advance.
    Problem Description
    You are asked as a data analyst working within the credit scoring division of a bank to develop a model that
    will be used to accept or reject applicants for mortgages. The bank has collected data from past applicants
    and has also obtained additional information from a credit bureau. This information is provided in a dataset
    that is unique to each student, and which you will receive through email (by Helena Greenwood).
    A description of the variables is provided in Table 1. In addition to the variables described in the table,
    the following four variables: liabilities NA, purpose NA, fraud NA and any missing, are also contained
    in the dataset you have received. These were created because the original dataset contained a number
    of missing values for the variables liabilities, purpose, and fraud. For each variable with missing
    values in the original data (e.g. liabilities) the missing values were replaced, and a binary variable
    (liabilities NA) was created to indicate whether each value of this variable was originally missing (in
    which case liabilities NA=1) or it corresponds to an actual value (in which case liabilities NA=0). In
    other words, the value of a variable like liabilities is the actual, observed, value when liabilities NA=0.
    When liabilities NA=1 the value of the variable liabilities has been predicted through a statistical
    model (and therefore it is not actually observed). The last binary variable any missing is equal to 1 if for
    a particular observation (row of your data table), at least one variable has been originally missing.
    The bank is primarily interested in understanding what are the main factors that influence whether individuals/ customers default on their mortgage debt. They are very interested in obtaining actionable insights
    (that is insights that can be used to attract “good” customers and avoid “bad” ones). The bank managers
    are interested in whether it is feasible, and if so what is the best way, for the bank to use a statistical model
    to maximise the number of loans that it offers subject to the constraint that at least 85% of the customers
    that fail to repay their mortgage are denied loans. If the previous goal was not specified which statistical
    model would you recommend, and why? Compare this model to the one(s) recommended based on the
    previous objective, and discuss similarities and differences. How many and which are the most important
    variables that determine the repayment behaviour of mortgage customers. (Do these differ depending on the
    objective, and/ or the classification method used?)
    The below list contains tasks/ issues that you should consider and be able to answer but it is very important
    to understand that the list is not exhaustive. This means that the data may contain other interesting
    features that are not mentioned or implied by any of these questions. It is your responsibility to identify
    (any of) those.
    • Exploratory Data Analysis (40 marks).
    Consider each variable and answer the following questions:
    – Does this variable appear to be important for the task at hand, and why? Support your claims
    with appropriate visualisations that document whether and how important each variable is.
    – Is it possible to combine variables or consider interactions between variables in order to obtain a
    better understanding of the relationships in the data, and/ or what causes default?
    – Are different variables related, and which variables convey information similar to that provided
    in other variable(s)?
    – Do you find identify issues with data quality (e.g. incorrect observations, outliers)?
    – For which variables (or combinations of variables) is the fact that specific values were missing in
    the original dataset informative, and what are the implications of this?
    • Statistical Modelling (60 marks)
    – For the two types of classifiers: logistic regression, and decision trees discuss the approach and
    the different settings you used to develop appropriate models. Emphasis should be placed on
    explaining why you considered your choices important during the model development phase.
    (Consider the choice of variable selection method as part of this question also.)
    – For each classification method develop one or a few candidate models that you think are promising before providing a final recommendation of the most appropriate model. You do not need
    to discuss every model you tried in detail, but you must include the results for the important
    steps in the process that led you to the final recommendations. I am particularly interested in
    understanding the steps you followed and the justification for these. (Refer to the CRISP data
    mining process discussed during the lectures and in Chapter 1 of the Guide to Intelligent Data
    – Comment on the generalisation performance of the model(s) you recommend for each type of
    classifier. In other words, how well do you believe these models will perform when they are
    The coursework requires you to write a report explaining your findings. This means that you need to
    explain each figure, table or “number” you include in the report. In other words including a relevant figure
    but not explaining what are the conclusions from it will get you no marks.