Topic: Data Analysis - Essay Paper

Assessment

Your ability to use correctly the tools that we covered in the course
Your ability to draw and justify the correct conclusions from these tools
Your ability to justify how conclusions/ findings from previous (modelling) steps lead you to the
actions/ choices you made in subsequent (modelling) steps, or the revision of previous decisions you
have made. Reports that don’t document the steps followed and the reasons why these were chosen
will receive minimal marks, even if the final answer/ recommended model is sensible. Explain your
reasoning clearly and in good English.
Your ability to address the questions posed in the coursework based on an intelligent interpretation of
the evidence provided in the previous two steps.
Your ability to express and justify your key findings succinctly (that is within the page limit specified)
rather than report every possible model or figure you have created.
For all the above reasons: In your report do not just replicate the process followed during the workshops!
The objective of the workshops is to introduce you to the different techniques discussed during the lectures,
and not to give you a roadmap on how to answer the coursework.
You will not be assessed on your knowledge of R or any other software. For this reason do not include
screenshots from any software or any other information about commands you used, or options to functions,
or how you drew figures etc. You will be simply wasting valuable space.
You are free to use any software you like to do the coursework. However, you can’t use as an excuse
the fact that you couldn’t do a particular task because the software you chose doesn’t offer a particular
capability which we covered in the workshops.
Page limits
Your report must be submitted as a PDF file that does not exceed 11 pages, with at least 11 point
typeface. This limit is strict and it includes appendices (which I strongly recommend that you do not use).
If your report exceeds the page limit your mark will be affected negatively as you will be failing on the last
assessment criterion (see above).
Report Structure
This is not a business report and as such it does not require an executive summary, a cover page, table
of contents, or even an introduction describing the context of the task.
You do not need to outline the CRISP-DM process, discuss expected project benefits and risks, create a
cognitive map etc. This is not relevant in this case since you will not be collaborating with a “problemowner” in the process of specifying a data mining project and assessing its feasibility. In the present
setting the problem is already specified for you.
1
It is essential to provide a Conclusions section that summarises your findings and how these relate to
the coursework objectives.
Plagiarism This is an individual piece of assessment, and you should ensure that your report reflects your
own work exclusively. All reports go through automated software to detect plagiarism from a variety
of sources (including past and current students’ reports as well as online resources, conference and journal
publications etc.) The consequences of plagiarism are very serious.
Deadline
• Deadline at 10:00 on the 7th of May
• Late work will be penalised according to the department code of practice. Any request for an extension
beyond the deadline will be accepted if appropriate justification is provided in advance.
Problem Description
You are asked as a data analyst working within the credit scoring division of a bank to develop a model that
will be used to accept or reject applicants for mortgages. The bank has collected data from past applicants
and has also obtained additional information from a credit bureau. This information is provided in a dataset
that is unique to each student, and which you will receive through email (by Helena Greenwood).
A description of the variables is provided in Table 1. In addition to the variables described in the table,
the following four variables: liabilities NA, purpose NA, fraud NA and any missing, are also contained
in the dataset you have received. These were created because the original dataset contained a number
of missing values for the variables liabilities, purpose, and fraud. For each variable with missing
values in the original data (e.g. liabilities) the missing values were replaced, and a binary variable
(liabilities NA) was created to indicate whether each value of this variable was originally missing (in
which case liabilities NA=1) or it corresponds to an actual value (in which case liabilities NA=0). In
other words, the value of a variable like liabilities is the actual, observed, value when liabilities NA=0.
When liabilities NA=1 the value of the variable liabilities has been predicted through a statistical
model (and therefore it is not actually observed). The last binary variable any missing is equal to 1 if for
a particular observation (row of your data table), at least one variable has been originally missing.
Objectives
The bank is primarily interested in understanding what are the main factors that influence whether individuals/ customers default on their mortgage debt. They are very interested in obtaining actionable insights
(that is insights that can be used to attract “good” customers and avoid “bad” ones). The bank managers
are interested in whether it is feasible, and if so what is the best way, for the bank to use a statistical model
to maximise the number of loans that it offers subject to the constraint that at least 85% of the customers
that fail to repay their mortgage are denied loans. If the previous goal was not specified which statistical
model would you recommend, and why? Compare this model to the one(s) recommended based on the
previous objective, and discuss similarities and differences. How many and which are the most important
variables that determine the repayment behaviour of mortgage customers. (Do these differ depending on the
objective, and/ or the classification method used?)
Tasks
The below list contains tasks/ issues that you should consider and be able to answer but it is very important
to understand that the list is not exhaustive. This means that the data may contain other interesting
2
features that are not mentioned or implied by any of these questions. It is your responsibility to identify
(any of) those.
• Exploratory Data Analysis (40 marks).
Consider each variable and answer the following questions:
– Does this variable appear to be important for the task at hand, and why? Support your claims
with appropriate visualisations that document whether and how important each variable is.
– Is it possible to combine variables or consider interactions between variables in order to obtain a
better understanding of the relationships in the data, and/ or what causes default?
– Are different variables related, and which variables convey information similar to that provided
in other variable(s)?
– Do you find identify issues with data quality (e.g. incorrect observations, outliers)?
– For which variables (or combinations of variables) is the fact that specific values were missing in
the original dataset informative, and what are the implications of this?
• Statistical Modelling (60 marks)
– For the two types of classifiers: logistic regression, and decision trees discuss the approach and
the different settings you used to develop appropriate models. Emphasis should be placed on
explaining why you considered your choices important during the model development phase.
(Consider the choice of variable selection method as part of this question also.)
– For each classification method develop one or a few candidate models that you think are promising before providing a final recommendation of the most appropriate model. You do not need
to discuss every model you tried in detail, but you must include the results for the important
steps in the process that led you to the final recommendations. I am particularly interested in
understanding the steps you followed and the justification for these. (Refer to the CRISP data
mining process discussed during the lectures and in Chapter 1 of the Guide to Intelligent Data
Analysis).
– Comment on the generalisation performance of the model(s) you recommend for each type of
classifier. In other words, how well do you believe these models will perform when they are
deployed.
The coursework requires you to write a report explaining your findings. This means that you need to
explain each figure, table or “number” you include in the report. In other words including a relevant figure
but not explaining what are the conclusions from it will get you no marks.

Don't use plagiarized sources. Get Your Custom Essay on

Topic: Data Analysis

Just from $13/Page

Order Essay