jpg

This project is done to practice data wrangling (or data munging). The truth is that data cannot always be used as it is and needs preparation in a way so that it can be used. Some cleaning and wrangling needs to be done before we can do analysis on them. In this project, we will make use of the the Death Statistics 2019 data obtained from Department of Statistics Malaysia. Also included is a simple visualization to give some insights about the data. All of these process are made using Python.

1. Questions

This analysis will try to answer the following questions:

2. Measurement Priorities

3. Data Collection


Data is obtained from:

JADUAL PENERBITAN PERANGKAAN SEBAB KEMATIAN, MALAYSIA, 2020

Statistics on Causes of Death, Malaysia, 2020 presents statistics on major causes of death by classification for 2019. The Department of Statistics Malaysia (DOSM) uses the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10) published by the World Health Organization (2010 ) as reference material for coding the cause of death.

Import Data

import pandas as pd
xls = pd.ExcelFile('dataset\death_cause_malaysia_2020.xlsx')

Data Cleaning

Table 1.1: Deaths (number and rate) by state and sex, Malaysia, 2019

Preview table

df_gen = pd.read_excel(xls,sheet_name="1.1")
df_gen
Jadual 1.1: Kematian (bilangan dan kadar) mengikut negeri dan jantina, Malaysia, 2019 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 Table 1.1: Deaths (number and rate) by state a... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN Kematian NaN NaN NaN Kadar kematian kasar NaN NaN NaN NaN
3 NaN NaN Deaths NaN NaN NaN Crude death rate NaN NaN NaN NaN
4 NaN NaN Jumlah Lelaki Perempuan NaN Jumlah Lelaki Perempuan NaN NaN
5 NaN NaN Total Male Female NaN Total Male Female NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 MALAYSIA NaN 173746 99681 74065 NaN 5.342242 5.945819 4.700105 NaN NaN
8 NaN Johor 21776 12743 9033 NaN 5.789692 6.453459 5.056066 NaN NaN
9 NaN Kedah 14292 8027 6265 NaN 6.574828 7.287441 5.842794 NaN NaN
10 NaN Kelantan 11389 6436 4953 NaN 6.045617 6.759489 5.316083 NaN NaN
11 NaN Melaka 5542 3084 2458 NaN 5.96941 6.619759 5.314342 NaN NaN
12 NaN Negeri Sembilan 7157 4097 3060 NaN 6.355022 7.056639 5.608422 NaN NaN
13 NaN Pahang 9165 5381 3784 NaN 5.483302 6.067542 4.822913 NaN NaN
14 NaN Perak 18712 10554 8158 NaN 7.458543 8.322346 6.575591 NaN NaN
15 NaN Perlis 1994 1124 870 NaN 7.84891 8.886218 6.82032 NaN NaN
16 NaN Pulau Pinang 10879 6101 4778 NaN 6.150355 6.862011 5.431132 NaN NaN
17 NaN Sabah 14493 8439 6054 NaN 3.711919 4.142434 3.242217 NaN NaN
18 NaN Sarawak 13990 7888 6102 NaN 4.985706 5.401526 4.534463 NaN NaN
19 NaN Selangor 27700 16142 11558 NaN 4.257525 4.782697 3.69142 NaN NaN
20 NaN Terengganu 7334 4201 3133 NaN 5.893177 6.634565 5.125219 NaN NaN
21 NaN W.P. Kuala Lumpur 8819 5168 3651 NaN 4.94744 5.646853 4.209432 NaN NaN
22 NaN W.P. Labuan 296 182 114 NaN 2.981166 3.637308 2.31458 NaN NaN
23 NaN W.P. Putrajaya 208 114 94 NaN 2.006715 2.347853 1.706082 NaN NaN
24 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
26 Kadar adalah bagi setiap 1,000 penduduk NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 The rates are per 1,000 population NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Remove unwanted rows and columns

df_gen = df_gen.iloc[8: , :]
df_gen = df_gen.iloc[:-4]
df_gen = df_gen.iloc[:,1:]
df_gen = df_gen.iloc[:,:-2]
df_gen.drop('Unnamed: 5',axis=1,inplace=True)
df_gen
Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 6 Unnamed: 7 Unnamed: 8
8 Johor 21776 12743 9033 5.789692 6.453459 5.056066
9 Kedah 14292 8027 6265 6.574828 7.287441 5.842794
10 Kelantan 11389 6436 4953 6.045617 6.759489 5.316083
11 Melaka 5542 3084 2458 5.96941 6.619759 5.314342
12 Negeri Sembilan 7157 4097 3060 6.355022 7.056639 5.608422
13 Pahang 9165 5381 3784 5.483302 6.067542 4.822913
14 Perak 18712 10554 8158 7.458543 8.322346 6.575591
15 Perlis 1994 1124 870 7.84891 8.886218 6.82032
16 Pulau Pinang 10879 6101 4778 6.150355 6.862011 5.431132
17 Sabah 14493 8439 6054 3.711919 4.142434 3.242217
18 Sarawak 13990 7888 6102 4.985706 5.401526 4.534463
19 Selangor 27700 16142 11558 4.257525 4.782697 3.69142
20 Terengganu 7334 4201 3133 5.893177 6.634565 5.125219
21 W.P. Kuala Lumpur 8819 5168 3651 4.94744 5.646853 4.209432
22 W.P. Labuan 296 182 114 2.981166 3.637308 2.31458
23 W.P. Putrajaya 208 114 94 2.006715 2.347853 1.706082

Rename Columns

dict = {'Unnamed: 1': 'state_death',
        'Unnamed: 2': 'total_death',
        'Unnamed: 3': 'male_death',
        'Unnamed: 4': 'female_death',
        'Unnamed: 6': 'total_death_rate',
        'Unnamed: 7': 'male_death_rate',
        'Unnamed: 8': 'female_death_rate',

       }

df_gen.rename(columns=dict,
          inplace=True)

Reset Index

df_gen.reset_index(drop=True,inplace=True)
df_gen
state_death total_death male_death female_death total_death_rate male_death_rate female_death_rate
0 Johor 21776 12743 9033 5.789692 6.453459 5.056066
1 Kedah 14292 8027 6265 6.574828 7.287441 5.842794
2 Kelantan 11389 6436 4953 6.045617 6.759489 5.316083
3 Melaka 5542 3084 2458 5.96941 6.619759 5.314342
4 Negeri Sembilan 7157 4097 3060 6.355022 7.056639 5.608422
5 Pahang 9165 5381 3784 5.483302 6.067542 4.822913
6 Perak 18712 10554 8158 7.458543 8.322346 6.575591
7 Perlis 1994 1124 870 7.84891 8.886218 6.82032
8 Pulau Pinang 10879 6101 4778 6.150355 6.862011 5.431132
9 Sabah 14493 8439 6054 3.711919 4.142434 3.242217
10 Sarawak 13990 7888 6102 4.985706 5.401526 4.534463
11 Selangor 27700 16142 11558 4.257525 4.782697 3.69142
12 Terengganu 7334 4201 3133 5.893177 6.634565 5.125219
13 W.P. Kuala Lumpur 8819 5168 3651 4.94744 5.646853 4.209432
14 W.P. Labuan 296 182 114 2.981166 3.637308 2.31458
15 W.P. Putrajaya 208 114 94 2.006715 2.347853 1.706082

Convert to appropriate data types

cols = df_gen.columns.drop('state_death')
df_gen[cols] = df_gen[cols].apply(pd.to_numeric, errors='coerce')
df_gen.dtypes
state_death           object
total_death            int64
male_death             int64
female_death           int64
total_death_rate     float64
male_death_rate      float64
female_death_rate    float64
dtype: object

Export cleaned table

df_gen
state_death total_death male_death female_death total_death_rate male_death_rate female_death_rate
0 Johor 21776 12743 9033 5.789692 6.453459 5.056066
1 Kedah 14292 8027 6265 6.574828 7.287441 5.842794
2 Kelantan 11389 6436 4953 6.045617 6.759489 5.316083
3 Melaka 5542 3084 2458 5.969410 6.619759 5.314342
4 Negeri Sembilan 7157 4097 3060 6.355022 7.056639 5.608422
5 Pahang 9165 5381 3784 5.483302 6.067542 4.822913
6 Perak 18712 10554 8158 7.458543 8.322346 6.575591
7 Perlis 1994 1124 870 7.848910 8.886218 6.820320
8 Pulau Pinang 10879 6101 4778 6.150355 6.862011 5.431132
9 Sabah 14493 8439 6054 3.711919 4.142434 3.242217
10 Sarawak 13990 7888 6102 4.985706 5.401526 4.534463
11 Selangor 27700 16142 11558 4.257525 4.782697 3.691420
12 Terengganu 7334 4201 3133 5.893177 6.634565 5.125219
13 W.P. Kuala Lumpur 8819 5168 3651 4.947440 5.646853 4.209432
14 W.P. Labuan 296 182 114 2.981166 3.637308 2.314580
15 W.P. Putrajaya 208 114 94 2.006715 2.347853 1.706082
df_gen.to_csv('cleaned_data/death_rate_by_state_and_sex_2019.csv',index=False)

Table 1.3: Deaths by age group and sex, Malaysia, 2019

Preview table

df = pd.read_excel(xls,sheet_name="1.3")
df
Jadual 1.3: Kematian mengikut kumpulan umur dan jantina, Malaysia, 2019 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11
0 Table 1.3: Deaths by age group and sex, Malays... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Kumpulan umur Jumlah Lelaki Perempuan NaN NaN NaN NaN NaN NaN NaN NaN
4 Age group Total Male Female NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 Jumlah/Total 173746 99681 74065 NaN NaN NaN NaN NaN NaN NaN NaN
8 Kurang 1 tahun/Under 1 year 3117 1711 1406 NaN NaN NaN NaN NaN NaN NaN NaN
9 Kurang 1 hari/Under 1 day 652 351 301 NaN NaN NaN NaN NaN NaN NaN NaN
10 1 hari hingga < 1 minggu/1 day to < 1 week 842 467 375 NaN NaN NaN NaN NaN NaN NaN NaN
11 1 minggu hingga < 4 minggu/1 week to < 4 weeks 486 267 219 NaN NaN NaN NaN NaN NaN NaN NaN
12 4 minggu hingga < 3 bulan/4 weeks to < 3 months 459 248 211 NaN NaN NaN NaN NaN NaN NaN NaN
13 3 bulan hingga < 6 bulan/3 months to < 6 months 349 195 154 NaN NaN NaN NaN NaN NaN NaN NaN
14 6 bulan hingga < 9 bulan/6 months to < 9 months 203 120 83 NaN NaN NaN NaN NaN NaN NaN NaN
15 9 bulan hingga < 12 bulan/9 months to < 12 mon... 126 63 63 NaN NaN NaN NaN NaN NaN NaN NaN
16 1 tahun/1 year 167 87 80 NaN NaN NaN NaN NaN NaN NaN NaN
17 2 tahun/2 years 233 135 98 NaN NaN NaN NaN NaN NaN NaN NaN
18 3 tahun/3 years 133 68 65 NaN NaN NaN NaN NaN NaN NaN NaN
19 4 tahun/4 years 115 71 44 NaN NaN NaN NaN NaN NaN NaN NaN
20 0-4 3765 2072 1693 NaN NaN NaN NaN NaN NaN NaN NaN
21 5-9 596 352 244 NaN NaN NaN NaN NaN NaN NaN NaN
22 10-14 656 384 272 NaN NaN NaN NaN NaN NaN NaN NaN
23 15-19 1591 1215 376 NaN NaN NaN NaN NaN NaN NaN NaN
24 20-24 2136 1615 521 NaN NaN NaN NaN NaN NaN NaN NaN
25 25-29 2423 1718 705 NaN NaN NaN NaN NaN NaN NaN NaN
26 30-34 3017 2081 936 NaN NaN NaN NaN NaN NaN NaN NaN
27 35-39 4144 2803 1341 NaN NaN NaN NaN NaN NaN NaN NaN
28 40-44 5591 3778 1813 NaN NaN NaN NaN NaN NaN NaN NaN
29 45-49 7436 4944 2492 NaN NaN NaN NaN NaN NaN NaN NaN
30 50-54 10717 6919 3798 NaN NaN NaN NaN NaN NaN NaN NaN
31 55-59 14572 9334 5238 NaN NaN NaN NaN NaN NaN NaN NaN
32 60-64 18046 11275 6771 NaN NaN NaN NaN NaN NaN NaN NaN
33 65-69 19539 11855 7684 NaN NaN NaN NaN NaN NaN NaN NaN
34 70-74 20351 11901 8450 NaN NaN NaN NaN NaN NaN NaN NaN
35 75-79 19672 10737 8935 NaN NaN NaN NaN NaN NaN NaN NaN
36 80-84 19479 9201 10278 NaN NaN NaN NaN NaN NaN NaN NaN
37 85 dan lebih/85 and over 20015 7497 12518 NaN NaN NaN NaN NaN NaN NaN NaN

Remove unwanted rows and columns

df = df.iloc[8: , :]
df.head()
Jadual 1.3: Kematian mengikut kumpulan umur dan jantina, Malaysia, 2019 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11
8 Kurang 1 tahun/Under 1 year 3117 1711 1406 NaN NaN NaN NaN NaN NaN NaN NaN
9 Kurang 1 hari/Under 1 day 652 351 301 NaN NaN NaN NaN NaN NaN NaN NaN
10 1 hari hingga < 1 minggu/1 day to < 1 week 842 467 375 NaN NaN NaN NaN NaN NaN NaN NaN
11 1 minggu hingga < 4 minggu/1 week to < 4 weeks 486 267 219 NaN NaN NaN NaN NaN NaN NaN NaN
12 4 minggu hingga < 3 bulan/4 weeks to < 3 months 459 248 211 NaN NaN NaN NaN NaN NaN NaN NaN
df = df.iloc[:,:-8]
dict = {'Jadual 1.3: Kematian mengikut kumpulan umur dan jantina, Malaysia, 2019': 'age_group',
        'Unnamed: 1': 'total_death_group',
        'Unnamed: 2': 'male_death_age',
        'Unnamed: 3': 'female_death_age',
        }

df.rename(columns=dict,
          inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()
age_group total_death_group male_death_age female_death_age
0 Kurang 1 tahun/Under 1 year 3117 1711 1406
1 Kurang 1 hari/Under 1 day 652 351 301
2 1 hari hingga < 1 minggu/1 day to < 1 week 842 467 375
3 1 minggu hingga < 4 minggu/1 week to < 4 weeks 486 267 219
4 4 minggu hingga < 3 bulan/4 weeks to < 3 months 459 248 211
df['age_group']=df.age_group.str.replace(".*/", '')
<ipython-input-219-bda0008f5b22>:1: FutureWarning: The default value of regex will change from True to False in a future version.
  df['age_group']=df.age_group.str.replace(".*/", '')
df.head()
age_group total_death_group male_death_age female_death_age
0 Under 1 year 3117 1711 1406
1 Under 1 day 652 351 301
2 1 day to < 1 week 842 467 375
3 1 week to < 4 weeks 486 267 219
4 4 weeks to < 3 months 459 248 211
df.to_csv('cleaned_data/death_rate_by_age_group_and_sex_2019.csv',index=False)

Table 1.6: Ten principal causes of death (medically certified and non-medically certified) by sex, Malaysia, 2019

Preview table

df = pd.read_excel(xls,sheet_name=r"1.6(1)")
df
Jadual 1.6: Sepuluh sebab kematian utama (disahkan dan tidak disahkan secara perubatan) Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14
0 NaN mengikut jantina, Malaysia, 2019 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Table 1.6: Ten principal causes of death (medi... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Jumlah/Total NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN Disahkan secara perubatan Bil. NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN Medically certified No. % NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 1. Ischaemic heart diseases 16325 14.954564 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 2. Pneumonia 13356 12.234803 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 3. Cerebrovascular diseases 8691 7.961416 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 4. Transport accidents 4198 3.84559 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 5. Malignant neoplasm of trachea, bronchus and lung 2602 2.38357 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 6. Chronic lower respiratory diseases 2312 2.117914 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 7. Diabetes mellitus 2008 1.839434 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 8. Malignant neoplasm of colon, rectum and anus 1873 1.715767 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 9. Malignant neoplasm of breast 1869 1.712103 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 10. Diseases of the liver 1601 1.466601 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 NaN Keseluruhan sebab ... 109164 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 NaN Tidak disahkan secara perubatan Bil. NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20 NaN Non-medically certified No. % NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
21 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22 1. Sakit tua 65 tahun dan lebih ... 30665 47.482271 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
23 2. Darah tinggi ... 2393 3.705367 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24 3. Kencing manis ... 1997 3.092193 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 4. Penyakit serebrovaskular ... 1708 2.6447 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
26 5. Lelah ... 1076 1.66455 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 6. Penyakit jantung iskemia ... 1023 1.584033 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 7. Barah trakea, bronkus dan paru-paru ... 742 1.148927 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
29 8. Barah kolon, rektum dan dubur ... 715 1.10712 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
30 9. Barah hati ... 533 0.825307 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
31 10. Barah payu dara ... 524 0.811372 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
32 NaN Keseluruhan sebab ... 64582 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Remove unwanted rows and columns

df = df.iloc[:,1:]
df = df.iloc[:,:-11]
df = df.iloc[7: , :]
df
Unnamed: 1 Unnamed: 2 Unnamed: 3
7 Ischaemic heart diseases 16325 14.954564
8 Pneumonia 13356 12.234803
9 Cerebrovascular diseases 8691 7.961416
10 Transport accidents 4198 3.84559
11 Malignant neoplasm of trachea, bronchus and lung 2602 2.38357
12 Chronic lower respiratory diseases 2312 2.117914
13 Diabetes mellitus 2008 1.839434
14 Malignant neoplasm of colon, rectum and anus 1873 1.715767
15 Malignant neoplasm of breast 1869 1.712103
16 Diseases of the liver 1601 1.466601
17 Keseluruhan sebab ... 109164 NaN
18 NaN NaN NaN
19 Tidak disahkan secara perubatan Bil. NaN
20 Non-medically certified No. %
21 NaN NaN NaN
22 Sakit tua 65 tahun dan lebih ... 30665 47.482271
23 Darah tinggi ... 2393 3.705367
24 Kencing manis ... 1997 3.092193
25 Penyakit serebrovaskular ... 1708 2.6447
26 Lelah ... 1076 1.66455
27 Penyakit jantung iskemia ... 1023 1.584033
28 Barah trakea, bronkus dan paru-paru ... 742 1.148927
29 Barah kolon, rektum dan dubur ... 715 1.10712
30 Barah hati ... 533 0.825307
31 Barah payu dara ... 524 0.811372
32 Keseluruhan sebab ... 64582 NaN

Rename Columns

dict = {'Unnamed: 1': 'death_cause',
        'Unnamed: 2': 'death_count',
        'Unnamed: 3': 'percentage',
       }

df.rename(columns=dict,
          inplace=True)

Split tables

df_med = df.iloc[0:10 , :]
df_xmed = df.iloc[15:25 , :]
df
death_cause death_count percentage
7 Ischaemic heart diseases 16325 14.954564
8 Pneumonia 13356 12.234803
9 Cerebrovascular diseases 8691 7.961416
10 Transport accidents 4198 3.84559
11 Malignant neoplasm of trachea, bronchus and lung 2602 2.38357
12 Chronic lower respiratory diseases 2312 2.117914
13 Diabetes mellitus 2008 1.839434
14 Malignant neoplasm of colon, rectum and anus 1873 1.715767
15 Malignant neoplasm of breast 1869 1.712103
16 Diseases of the liver 1601 1.466601
17 Keseluruhan sebab ... 109164 NaN
18 NaN NaN NaN
19 Tidak disahkan secara perubatan Bil. NaN
20 Non-medically certified No. %
21 NaN NaN NaN
22 Sakit tua 65 tahun dan lebih ... 30665 47.482271
23 Darah tinggi ... 2393 3.705367
24 Kencing manis ... 1997 3.092193
25 Penyakit serebrovaskular ... 1708 2.6447
26 Lelah ... 1076 1.66455
27 Penyakit jantung iskemia ... 1023 1.584033
28 Barah trakea, bronkus dan paru-paru ... 742 1.148927
29 Barah kolon, rektum dan dubur ... 715 1.10712
30 Barah hati ... 533 0.825307
31 Barah payu dara ... 524 0.811372
32 Keseluruhan sebab ... 64582 NaN

Certification Label

df_med['certification'] = 'medical'
df_xmed['certification'] = 'non-medical'
<ipython-input-231-738d62170e63>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_med['certification'] = 'medical'
<ipython-input-231-738d62170e63>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_xmed['certification'] = 'non-medical'
df_xmed['death_cause'] = df_xmed.death_cause.str.replace('  +', '/')
<ipython-input-232-58ca563d90ad>:1: FutureWarning: The default value of regex will change from True to False in a future version.
  df_xmed['death_cause'] = df_xmed.death_cause.str.replace('  +', '/')
<ipython-input-232-58ca563d90ad>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_xmed['death_cause'] = df_xmed.death_cause.str.replace('  +', '/')
df_xmed['death_cause']=df_xmed.death_cause.str.replace(".*/", '')
<ipython-input-233-d35e3569a3bb>:1: FutureWarning: The default value of regex will change from True to False in a future version.
  df_xmed['death_cause']=df_xmed.death_cause.str.replace(".*/", '')
<ipython-input-233-d35e3569a3bb>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_xmed['death_cause']=df_xmed.death_cause.str.replace(".*/", '')

Combine DataFrames

df = df_med.append(df_xmed,ignore_index = True)

Reset Index

df.reset_index(drop=True,inplace=True)
df
death_cause death_count percentage certification
0 Ischaemic heart diseases 16325 14.954564 medical
1 Pneumonia 13356 12.234803 medical
2 Cerebrovascular diseases 8691 7.961416 medical
3 Transport accidents 4198 3.84559 medical
4 Malignant neoplasm of trachea, bronchus and lung 2602 2.38357 medical
5 Chronic lower respiratory diseases 2312 2.117914 medical
6 Diabetes mellitus 2008 1.839434 medical
7 Malignant neoplasm of colon, rectum and anus 1873 1.715767 medical
8 Malignant neoplasm of breast 1869 1.712103 medical
9 Diseases of the liver 1601 1.466601 medical
10 Old age 65 years and over 30665 47.482271 non-medical
11 Hypertension 2393 3.705367 non-medical
12 Diabetes mellitus 1997 3.092193 non-medical
13 Cerebrovascular diseases 1708 2.6447 non-medical
14 Asthma 1076 1.66455 non-medical
15 Ischaemic heart diseases 1023 1.584033 non-medical
16 Trachea, bronchus and lung cancer 742 1.148927 non-medical
17 Colon, rectum and anus cancer 715 1.10712 non-medical
18 Liver cancer 533 0.825307 non-medical
19 Breast cancer 524 0.811372 non-medical

Export Cleaned Table

df.to_csv('cleaned_data/death_causes_2019.csv',index=False)

Visualization

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

Death by gender in 2019

df = pd.read_csv('cleaned_data/death_rate_by_age_group_and_sex_2019.csv')
df.head()
age_group total_death_group male_death_age female_death_age
0 Under 1 year 3117 1711 1406
1 Under 1 day 652 351 301
2 1 day to < 1 week 842 467 375
3 1 week to < 4 weeks 486 267 219
4 4 weeks to < 3 months 459 248 211
import matplotlib.pyplot as plt
import seaborn as sns
fig = plt.figure(figsize=(10,5),facecolor='1')
plt.title("Death by age group in Malaysia")


sns.barplot(data=df, x='age_group',y='total_death_group')
plt.xticks(rotation=90)
plt.show()

png

Description: From the graph, we can see the distribution of total death count by age group in Malaysia. From the graph, we can see that people in Malaysia mostly died at the age 70-74 which is nearly 20k deaths.

fig = plt.figure(figsize=(10,5),facecolor='1')
plt.title("Male Death Age in Malaysia")


sns.barplot(data=df, x='age_group',y='male_death_age',palette='mako')
plt.xticks(rotation=90)
plt.show()

png

Description: From the graph, we can see the distribution of death count of males by age group in Malaysia. From the graph, we can see that men in Malaysia mostly died at the age 70-74 which is nearly 12k deaths.

fig = plt.figure(figsize=(10,5),facecolor='1')
plt.title("Female Death Age in Malaysia")


sns.barplot(data=df, x='age_group',y='female_death_age',palette='plasma')
plt.xticks(rotation=90)
plt.show()

png

Description: From the graph, we can see the distribution of death count of females by age group in Malaysia. From the graph, we can see that women in Malaysia mostly died at the age 85 and over which is more than 12k deaths.

max_male_age = df.loc[df.male_death_age == df.male_death_age.max(), 'age_group'].values[0]
max_female_age = df.loc[df.female_death_age == df.female_death_age.max(), 'age_group'].values[0]

print('Most male in Malaysia died between '+max_male_age)
print('Most female in Malaysia died between '+max_female_age)
Most male in Malaysia died between 70-74 
Most female in Malaysia died between 85 and over
df['male_death_sum']=df['male_death_age'].sum()
df['female_death_sum']=df['female_death_age'].sum()
fig = plt.figure(figsize=(10,5),facecolor='1')
plt.title("Death age by gender in Malaysia")
sns.barplot('age_group', 'value', hue='variable', palette='mako',
             data=pd.melt(df[['male_death_age','female_death_age','age_group']], 'age_group'))
plt.xticks(rotation=90)
plt.show()
C:\Users\Haziq\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

Description: From the graph, we can see the comparison of death count of females and males by age group in Malaysia. From the graph, we can see that more men died and live shorter in comparison women in Malaysia.

Death by gender in 2019

df = pd.read_csv('cleaned_data/death_causes_2019.csv')
df.tail()
death_cause death_count percentage certification
15 Ischaemic heart diseases 1023 1.584033 non-medical
16 Trachea, bronchus and lung cancer 742 1.148927 non-medical
17 Colon, rectum and anus cancer 715 1.107120 non-medical
18 Liver cancer 533 0.825307 non-medical
19 Breast cancer 524 0.811372 non-medical
fig = plt.figure(figsize=(10,5),facecolor='1')
plt.title("Death causes by medical certification in Malaysia")
sns.barplot(data=df, x='death_count',y='death_cause',hue='certification',palette='mako')
plt.xticks(rotation=0)
plt.show()

png

Description: The bar chart shows the major cause of death in Malaysia, grouped into medical and non-medical certified cases. Here, we can see that most of the medically certified death is caused by isochaemic heart diseases. Old age 65 years and over is the most non-medically certified death in Malaysia.

df_med = df[df['certification']=='medical']
df_nmed = df[df['certification']=='non-medical']
df_nmed
death_cause death_count percentage certification
10 Old age 65 years and over 30665 47.482271 non-medical
11 Hypertension 2393 3.705367 non-medical
12 Diabetes mellitus 1997 3.092193 non-medical
13 Cerebrovascular diseases 1708 2.644700 non-medical
14 Asthma 1076 1.664550 non-medical
15 Ischaemic heart diseases 1023 1.584033 non-medical
16 Trachea, bronchus and lung cancer 742 1.148927 non-medical
17 Colon, rectum and anus cancer 715 1.107120 non-medical
18 Liver cancer 533 0.825307 non-medical
19 Breast cancer 524 0.811372 non-medical
import squarify

df_med.sort_values('death_count', ascending=False, inplace=True)

fig, ax = plt.subplots(1, figsize = (26,8))
squarify.plot(sizes=df_med['death_count'], 
                label=df_med['death_cause'][:10], 
                alpha=.8)
plt.axis('off')
plt.show()
<ipython-input-253-49e775037926>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_med.sort_values('death_count', ascending=False, inplace=True)

png

import squarify

df_nmed.sort_values('death_count', ascending=False, inplace=True)

fig, ax = plt.subplots(1, figsize = (26,8))
squarify.plot(sizes=df_nmed['death_count'], 
                label=df_nmed['death_cause'][:10], 
                alpha=.8)
plt.axis('off')
plt.show()
<ipython-input-254-61ce348f60b8>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nmed.sort_values('death_count', ascending=False, inplace=True)

png

Death by state in 2019

df = pd.read_csv('cleaned_data/death_rate_by_state_and_sex_2019.csv')
df.head()
state_death total_death male_death female_death total_death_rate male_death_rate female_death_rate
0 Johor 21776 12743 9033 5.789692 6.453459 5.056066
1 Kedah 14292 8027 6265 6.574828 7.287441 5.842794
2 Kelantan 11389 6436 4953 6.045617 6.759489 5.316083
3 Melaka 5542 3084 2458 5.969410 6.619759 5.314342
4 Negeri Sembilan 7157 4097 3060 6.355022 7.056639 5.608422
fig = plt.figure(figsize=(10,5),facecolor='1')
plt.title(r"Death age by state and gender in Malaysia (2019)")
sns.barplot('state_death', 'value', hue='variable', palette='mako',
             data=pd.melt(df[['male_death','female_death','state_death']], 'state_death'))
plt.xticks(rotation=50)
plt.show()
C:\Users\Haziq\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

Description: From the chart, we can se that Selangor recorded the most death cases in Malaysia.In contrast, W.P. Putrajaya recorded the least death case. Overall, we can see that all of the states recorded more death for men than women.

fig = plt.figure(figsize=(10,5),facecolor='1')
plt.title(r"Death rate per 1000 people by state and gender in Malaysia (2019)")

sns.barplot('state_death', 'value', hue='variable', palette='mako',
             data=pd.melt(df[['male_death_rate','female_death_rate','total_death_rate','state_death']], 'state_death'))
plt.xticks(rotation=50)
plt.show()
C:\Users\Haziq\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

png

Description: The bar chart shows the death rate per 1000 people by state and gender in Malaysia. Here, we can see that Perlis recorded the highest total death rate for both gender in Malaysia. In contrast, W.P. Putrajaya have the lowest death rate for both genders. For all of the states, we can see that men have a higher death rate than women.