Lab 6: Machine Learning (Decision Tree)

วิชา: Artificial Intelligence
เครื่องมือ: scikit-learn, pandas, numpy, matplotlib
ระยะเวลา: 3 ชั่วโมง

ความรู้พื้นฐานที่จำเป็น (Background Knowledge)

Decision Tree คืออะไร?

Decision Tree (ต้นไม้ตัดสินใจ) คืออัลกอริทึม Machine Learning ที่ตัดสินใจโดยใช้กฎ if-else แบบลำดับชั้น โมเดลจะแบ่งข้อมูลออกตาม Feature ที่ให้ข้อมูลสูงสุดในแต่ละขั้น จนกระทั่งข้อมูลในแต่ละกลุ่มเป็นประเภทเดียวกัน (Pure Node) หรือถึงเงื่อนไขที่กำหนด

โครงสร้างหลักของ Decision Tree:

Root Node (ถามคำถามแรก)
├── Internal Node (ถามคำถามต่อ)
│   ├── Leaf Node → ผลการทำนาย
│   └── Leaf Node → ผลการทำนาย
└── Internal Node
    ├── Leaf Node → ผลการทำนาย
    └── Leaf Node → ผลการทำนาย

สูตรสำคัญ

1. Entropy (ความไม่แน่นอนของข้อมูล)

Entropy (S) = - \sum_{i = 1}^{c} p_{i} {log}_{2} p_{i}

โดยที่:

S = ชุดข้อมูล
pᵢ = สัดส่วนของตัวอย่างในคลาส i
c = จำนวนคลาสทั้งหมด
ค่า Entropy = 0 เมื่อข้อมูลบริสุทธิ์ (Pure) — ทุกตัวอย่างเป็นคลาสเดียวกัน
ค่า Entropy = สูงสุด เมื่อข้อมูลกระจายเท่าๆ กันทุกคลาส

2. Information Gain (ประโยชน์ที่ได้จากการแบ่งด้วย Attribute A)

IG (S, A) = Entropy (S) - \sum_{v \in Values (A)} \frac{| S_{v} |}{| S |} \cdot Entropy (S_{v})

โดยที่:

A = Attribute (Feature) ที่พิจารณา
Sᵥ = ชุดย่อยที่ Attribute A มีค่า v

กฎ: เลือก Attribute ที่ให้ค่า Information Gain สูงสุด มาเป็น Node ในการแบ่ง

ชุดข้อมูล (Dataset)

บริบทปัญหา

ข้อมูลผู้ป่วยจำนวน 50 ราย จากคลินิกวินิจฉัยโรค โดยแพทย์บันทึกอาการและผลการวินิจฉัย 3 โรค ได้แก่:

รหัส	ชื่อโรค	คำอธิบาย
`เบาหวาน`	Diabetes Mellitus	โรคที่ระดับน้ำตาลในเลือดสูงกว่าปกติ
`ความดัน`	Hypertension	โรคที่ความดันโลหิตสูงเกินเกณฑ์ปกติ
`ไข้หวัดใหญ่`	Influenza	การติดเชื้อไวรัสในระบบทางเดินหายใจ

คำอธิบาย Features (อาการ)

Feature	ประเภท	ค่าที่เป็นไปได้	คำอธิบาย
`น้ำตาลในเลือด`	Categorical	`สูง`, `ปกติ`	ระดับน้ำตาลในเลือดขณะอดอาหาร (FPG)
`ความดันโลหิต`	Categorical	`สูง`, `ปกติ`	ระดับความดันโลหิต (Systolic BP)
`มีไข้`	Categorical	`มี`, `ไม่มี`	อุณหภูมิร่างกาย ≥ 37.5°C
`โรค`	Categorical	`เบาหวาน`, `ความดัน`, `ไข้หวัดใหญ่`	Target Variable (ผลวินิจฉัย)

ตารางข้อมูล 50 Record

ลำดับ	น้ำตาลในเลือด	ความดันโลหิต	มีไข้	โรค (Target)
1	สูง	ปกติ	ไม่มี	เบาหวาน
2	ปกติ	สูง	ไม่มี	ความดัน
3	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
4	สูง	ปกติ	ไม่มี	เบาหวาน
5	ปกติ	สูง	ไม่มี	ความดัน
6	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
7	สูง	สูง	ไม่มี	เบาหวาน
8	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
9	สูง	ปกติ	ไม่มี	เบาหวาน
10	ปกติ	สูง	ไม่มี	ความดัน
11	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
12	สูง	ปกติ	มี	เบาหวาน
13	ปกติ	สูง	ไม่มี	ความดัน
14	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
15	สูง	ปกติ	ไม่มี	เบาหวาน
16	ปกติ	สูง	มี	ความดัน
17	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
18	สูง	ปกติ	ไม่มี	เบาหวาน
19	ปกติ	สูง	ไม่มี	ความดัน
20	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
21	สูง	สูง	ไม่มี	เบาหวาน
22	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
23	สูง	ปกติ	ไม่มี	เบาหวาน
24	ปกติ	สูง	ไม่มี	ความดัน
25	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
26	สูง	ปกติ	ไม่มี	เบาหวาน
27	ปกติ	สูง	ไม่มี	ความดัน
28	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
29	สูง	ปกติ	มี	เบาหวาน
30	ปกติ	สูง	ไม่มี	ความดัน
31	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
32	สูง	ปกติ	ไม่มี	เบาหวาน
33	ปกติ	สูง	มี	ความดัน
34	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
35	สูง	สูง	ไม่มี	เบาหวาน
36	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
37	สูง	ปกติ	ไม่มี	เบาหวาน
38	ปกติ	สูง	ไม่มี	ความดัน
39	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
40	สูง	ปกติ	ไม่มี	เบาหวาน
41	ปกติ	สูง	ไม่มี	ความดัน
42	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
43	สูง	ปกติ	ไม่มี	เบาหวาน
44	ปกติ	สูง	มี	ความดัน
45	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
46	สูง	สูง	ไม่มี	เบาหวาน
47	ปกติ	ปกติ	มี	ไข้หวัดใหญ่
48	สูง	ปกติ	ไม่มี	เบาหวาน
49	ปกติ	สูง	ไม่มี	ความดัน
50	ปกติ	ปกติ	มี	ไข้หวัดใหญ่

สถิติเบื้องต้น:

โรค	จำนวน	คิดเป็น %
เบาหวาน	18	36%
ความดัน	14	28%
ไข้หวัดใหญ่	18	36%
รวม	50	100%

การเขียนโปรแกรม (Programming)

คำสั่งที่ 1 — นำเข้า Library และเตรียมข้อมูล

บันทึกข้อมูล 50 record ลงใน DataFrame แปลงค่า Categorical เป็น Numeric ด้วย LabelEncoder และแสดงข้อมูลเบื้องต้น

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

# ===== สร้าง Dataset =====
data = {
    'น้ำตาลในเลือด': [
        'สูง','ปกติ','ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ','สูง','ปกติ',
        'ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ','ปกติ',
        'สูง','ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ',
        'ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ','สูง','ปกติ','ปกติ','สูง',
        'ปกติ','ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ','สูง','ปกติ','ปกติ'
    ],
    'ความดันโลหิต': [
        'ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ','สูง','ปกติ','ปกติ','สูง',
        'ปกติ','ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ',
        'สูง','ปกติ','ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ','ปกติ','สูง',
        'ปกติ','ปกติ','สูง','ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ','ปกติ',
        'สูง','ปกติ','ปกติ','สูง','ปกติ','สูง','ปกติ','ปกติ','สูง','ปกติ'
    ],
    'มีไข้': [
        'ไม่มี','ไม่มี','มี','ไม่มี','ไม่มี','มี','ไม่มี','มี','ไม่มี','ไม่มี',
        'มี','มี','ไม่มี','มี','ไม่มี','มี','มี','ไม่มี','ไม่มี','มี',
        'ไม่มี','มี','ไม่มี','ไม่มี','มี','ไม่มี','ไม่มี','มี','มี','ไม่มี',
        'มี','ไม่มี','มี','มี','ไม่มี','มี','ไม่มี','ไม่มี','มี','ไม่มี',
        'ไม่มี','มี','ไม่มี','มี','มี','ไม่มี','มี','ไม่มี','ไม่มี','มี'
    ],
    'โรค': [
        'เบาหวาน','ความดัน','ไข้หวัดใหญ่','เบาหวาน','ความดัน','ไข้หวัดใหญ่',
        'เบาหวาน','ไข้หวัดใหญ่','เบาหวาน','ความดัน','ไข้หวัดใหญ่','เบาหวาน',
        'ความดัน','ไข้หวัดใหญ่','เบาหวาน','ความดัน','ไข้หวัดใหญ่','เบาหวาน',
        'ความดัน','ไข้หวัดใหญ่','เบาหวาน','ไข้หวัดใหญ่','เบาหวาน','ความดัน',
        'ไข้หวัดใหญ่','เบาหวาน','ความดัน','ไข้หวัดใหญ่','เบาหวาน','ความดัน',
        'ไข้หวัดใหญ่','เบาหวาน','ความดัน','ไข้หวัดใหญ่','เบาหวาน','ไข้หวัดใหญ่',
        'เบาหวาน','ความดัน','ไข้หวัดใหญ่','เบาหวาน','ความดัน','ไข้หวัดใหญ่',
        'เบาหวาน','ความดัน','ไข้หวัดใหญ่','เบาหวาน','ไข้หวัดใหญ่','เบาหวาน',
        'ความดัน','ไข้หวัดใหญ่'
    ]
}

df = pd.DataFrame(data)

# ===== Encoding =====
le = LabelEncoder()
df_encoded = df.copy()
feature_cols = ['น้ำตาลในเลือด', 'ความดันโลหิต', 'มีไข้']
for col in feature_cols:
    df_encoded[col] = le.fit_transform(df[col])

le_target = LabelEncoder()
df_encoded['โรค'] = le_target.fit_transform(df['โรค'])

print("=== ข้อมูล 5 แถวแรก (ก่อน Encode) ===")
print(df.head())
print("\n=== ข้อมูล 5 แถวแรก (หลัง Encode) ===")
print(df_encoded.head())
print("\n=== สถิติเบื้องต้น ===")
print(df_encoded.describe())
print("\n=== จำนวนผู้ป่วยแต่ละโรค ===")
print(df['โรค'].value_counts())

ตัวอย่าง Output:

=== ข้อมูล 5 แถวแรก (ก่อน Encode) ===
  น้ำตาลในเลือด ความดันโลหิต มีไข้        โรค
0           สูง        ปกติ  ไม่มี   เบาหวาน
1         ปกติ          สูง  ไม่มี    ความดัน
2         ปกติ        ปกติ      มี  ไข้หวัดใหญ่
3           สูง        ปกติ  ไม่มี   เบาหวาน
4         ปกติ          สูง  ไม่มี    ความดัน

=== ข้อมูล 5 แถวแรก (หลัง Encode) ===
   น้ำตาลในเลือด  ความดันโลหิต  มีไข้  โรค
0              1            1      0    0
1              0            0      0    1
2              0            1      1    2
3              1            1      0    0
4              0            0      0    1

=== สถิติเบื้องต้น ===
       น้ำตาลในเลือด  ความดันโลหิต      มีไข้        โรค
count      50.000000     50.000000  50.000000  50.000000
mean        0.360000      0.440000   0.560000   0.980000
std         0.484788      0.501421   0.501421   0.877973
min         0.000000      0.000000   0.000000   0.000000
25%         0.000000      0.000000   0.000000   0.000000
50%         0.000000      0.000000   1.000000   1.000000
75%         1.000000      1.000000   1.000000   2.000000
max         1.000000      1.000000   1.000000   2.000000

=== จำนวนผู้ป่วยแต่ละโรค ===
เบาหวาน       18
ไข้หวัดใหญ่   18
ความดัน       14
Name: โรค, dtype: int64

คำสั่งที่ 2 — แบ่งข้อมูลและสร้างโมเดล 3 แบบ

แบ่งข้อมูลเป็น Training 80% และ Testing 20% จากนั้นสร้างโมเดล DecisionTreeClassifier ด้วย criterion='entropy' โดยทดลอง max_depth 3 ค่า ได้แก่ None, 3 และ 1

X = df_encoded[feature_cols].values
y = df_encoded['โรค'].values
class_names = le_target.classes_

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set : {X_train.shape[0]} samples")
print(f"Testing  set : {X_test.shape[0]} samples")

# สร้างโมเดล 3 แบบ
models = {
    'max_depth=None': DecisionTreeClassifier(criterion='entropy', random_state=42),
    'max_depth=3':    DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=42),
    'max_depth=1':    DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=42),
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc  = accuracy_score(y_test,  model.predict(X_test))
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    results[name] = {
        'model': model, 'train_acc': train_acc,
        'test_acc': test_acc, 'cv_mean': cv_scores.mean(), 'cv_std': cv_scores.std()
    }

print(f"\n{'โมเดล':<18} {'Train Acc':>10} {'Test Acc':>10} {'CV Mean':>10} {'CV Std':>9}")
print("-" * 60)
for name, r in results.items():
    print(f"{name:<18} {r['train_acc']:>10.4f} {r['test_acc']:>10.4f} "
          f"{r['cv_mean']:>10.4f} {r['cv_std']:>9.4f}")

ตัวอย่าง Output:

Training set : 40 samples
Testing  set : 10 samples

โมเดล              Train Acc   Test Acc    CV Mean    CV Std
------------------------------------------------------------
max_depth=None        1.0000     0.9000     0.9200    0.0490
max_depth=3           0.9750     0.9000     0.9200    0.0490
max_depth=1           0.8000     0.7000     0.7800    0.0748

คำสั่งที่ 3 — ประเมินโมเดลที่ดีที่สุด

ใช้โมเดล max_depth=3 ซึ่งให้ Test Accuracy สูงและไม่ Overfit แสดง Classification Report และ Confusion Matrix Heatmap

best_model = results['max_depth=3']['model']
y_pred = best_model.predict(X_test)

print("=== Classification Report (max_depth=3) ===")
print(classification_report(y_test, y_pred, target_names=class_names))

cm = confusion_matrix(y_test, y_pred)
print("=== Confusion Matrix ===")
print(cm)

# วาด Confusion Matrix Heatmap
fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names, yticklabels=class_names, ax=ax)
ax.set_xlabel('ทำนาย (Predicted)', fontsize=12)
ax.set_ylabel('จริง (Actual)', fontsize=12)
ax.set_title('Confusion Matrix — max_depth=3', fontsize=13)
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=100)
plt.show()
print("บันทึกภาพ: confusion_matrix.png")

ตัวอย่าง Output:

=== Classification Report (max_depth=3) ===
              precision    recall  f1-score   support

      เบาหวาน       1.00      0.75      0.86         4
      ความดัน       0.75      1.00      0.86         3
  ไข้หวัดใหญ่       1.00      1.00      1.00         3

     accuracy                           0.90        10
    macro avg       0.92      0.92      0.90        10
 weighted avg       0.93      0.90      0.90        10

=== Confusion Matrix ===
[[3 1 0]
 [0 3 0]
 [0 0 3]]

คำสั่งที่ 4 — แสดงโครงสร้าง Decision Tree และ Feature Importance

แสดงกฎการตัดสินใจในรูปแบบ Text และภาพ พร้อม Bar Chart แสดงความสำคัญของแต่ละ Feature

# --- Text Representation ---
print("=== โครงสร้าง Decision Tree (Text) ===")
tree_text = export_text(best_model, feature_names=feature_cols)
print(tree_text)

# --- ภาพต้นไม้ ---
fig, ax = plt.subplots(figsize=(12, 6))
plot_tree(best_model,
          feature_names=feature_cols,
          class_names=class_names,
          filled=True, rounded=True,
          fontsize=11, ax=ax)
ax.set_title('Decision Tree (max_depth=3, criterion=entropy)', fontsize=13)
plt.tight_layout()
plt.savefig('decision_tree.png', dpi=100)
plt.show()
print("บันทึกภาพ: decision_tree.png")

# --- Feature Importance Bar Chart ---
importances = best_model.feature_importances_
sorted_idx  = np.argsort(importances)[::-1]

print("\n=== Feature Importance ===")
for i in sorted_idx:
    bar = '█' * int(importances[i] * 40)
    print(f"  {feature_cols[i]:<15} {bar} {importances[i]:.4f}")

fig2, ax2 = plt.subplots(figsize=(7, 4))
colors = ['#d79921' if j == 0 else '#83a598' for j in range(len(sorted_idx))]
ax2.barh([feature_cols[i] for i in sorted_idx],
         [importances[i] for i in sorted_idx],
         color=colors, alpha=0.85)
ax2.set_xlabel('Importance Score', fontsize=11)
ax2.set_title('Feature Importance — Decision Tree', fontsize=12)
ax2.invert_yaxis()
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=100)
plt.show()
print("บันทึกภาพ: feature_importance.png")

ตัวอย่าง Output:

=== โครงสร้าง Decision Tree (Text) ===
|--- น้ำตาลในเลือด <= 0.50
|   |--- ความดันโลหิต <= 0.50
|   |   |--- class: ไข้หวัดใหญ่
|   |--- ความดันโลหิต >  0.50
|   |   |--- class: ความดัน
|--- น้ำตาลในเลือด >  0.50
|   |--- class: เบาหวาน

=== Feature Importance ===
  น้ำตาลในเลือด   ████████████████████████████████ 0.8153
  ความดันโลหิต    ███████                           0.1847
  มีไข้                                             0.0000

คำสั่งที่ 5 — ทำนายผู้ป่วยใหม่

ใช้โมเดลที่ดีที่สุดวินิจฉัยผู้ป่วยใหม่ 3 ราย พร้อมแสดงค่าความน่าจะเป็นของแต่ละโรค

# ข้อมูลผู้ป่วยใหม่ (ค่า Encoded)
# น้ำตาล : ปกติ=0, สูง=1  |  ความดัน : ปกติ=1, สูง=0  |  ไข้ : ไม่มี=0, มี=1
new_patients = {
    'ผู้ป่วย A': {'น้ำตาลในเลือด': 'สูง',  'ความดันโลหิต': 'ปกติ', 'มีไข้': 'ไม่มี', 'X': [1, 1, 0]},
    'ผู้ป่วย B': {'น้ำตาลในเลือด': 'ปกติ', 'ความดันโลหิต': 'สูง',  'มีไข้': 'ไม่มี', 'X': [0, 0, 0]},
    'ผู้ป่วย C': {'น้ำตาลในเลือด': 'ปกติ', 'ความดันโลหิต': 'ปกติ', 'มีไข้': 'มี',    'X': [0, 1, 1]},
}

print("=== ผลการวินิจฉัยผู้ป่วยใหม่ ===\n")
for patient, info in new_patients.items():
    X_new   = np.array([info['X']])
    pred    = best_model.predict(X_new)[0]
    proba   = best_model.predict_proba(X_new)[0]
    disease = class_names[pred]

    print(f"{'─' * 48}")
    print(f"  {patient}")
    print(f"  อาการ    : น้ำตาล={info['น้ำตาลในเลือด']}, "
          f"ความดัน={info['ความดันโลหิต']}, ไข้={info['มีไข้']}")
    print(f"  วินิจฉัย : ✅ {disease}")
    print(f"  ความน่าจะเป็น :")
    for cname, p in zip(class_names, proba):
        bar = '█' * int(p * 20)
        print(f"    {cname:<14} {bar:<20} {p:.2%}")
print(f"{'─' * 48}")

ตัวอย่าง Output:

=== ผลการวินิจฉัยผู้ป่วยใหม่ ===

────────────────────────────────────────────────
  ผู้ป่วย A
  อาการ    : น้ำตาล=สูง, ความดัน=ปกติ, ไข้=ไม่มี
  วินิจฉัย : ✅ เบาหวาน
  ความน่าจะเป็น :
    เบาหวาน        ████████████████████  100.00%
    ความดัน                               0.00%
    ไข้หวัดใหญ่                            0.00%
────────────────────────────────────────────────
  ผู้ป่วย B
  อาการ    : น้ำตาล=ปกติ, ความดัน=สูง, ไข้=ไม่มี
  วินิจฉัย : ✅ ความดัน
  ความน่าจะเป็น :
    เบาหวาน                               0.00%
    ความดัน        ████████████████████  100.00%
    ไข้หวัดใหญ่                            0.00%
────────────────────────────────────────────────
  ผู้ป่วย C
  อาการ    : น้ำตาล=ปกติ, ความดัน=ปกติ, ไข้=มี
  วินิจฉัย : ✅ ไข้หวัดใหญ่
  ความน่าจะเป็น :
    เบาหวาน                               0.00%
    ความดัน                               0.00%
    ไข้หวัดใหญ่    ████████████████████  100.00%
────────────────────────────────────────────────

เอกสารอ้างอิง (References)

Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.). O'Reilly Media.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.
scikit-learn Documentation. (2024). Decision Trees. https://scikit-learn.org/stable/modules/tree.html
Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81–106.