reveal.js

# Unsupervised Learning - การจัดกลุ่มข้อมูล (Clustering)
**ผู้จัดทำ:** อรรถพล คงหวาน

---

# สารบัญ

1. บทนำสู่ Unsupervised Learning และ Clustering
2. K-Means Clustering
3. Hierarchical Clustering
4. DBSCAN
5. การประเมินผลการจัดกลุ่ม
6. สรุปและการเปรียบเทียบ
7. เอกสารอ้างอิง

---

# บทนำสู่ Unsupervised Learning และ Clustering

---

## 1.1 ความหมายของ Unsupervised Learning

**Unsupervised Learning** คือกระบวนการที่อัลกอริทึมเรียนรู้จากข้อมูล**ไม่มีป้ายกำกับ (unlabeled data)** โดยไม่มีคำตอบกำหนดไว้ล่วงหน้า เป้าหมายคือค้นหาโครงสร้าง รูปแบบ หรือความสัมพันธ์ที่ซ่อนอยู่

| คุณสมบัติ | Supervised | Unsupervised |
|-----------|-----------|--------------|
| ข้อมูล Training | มี Label | ไม่มี Label |
| เป้าหมาย | ทำนาย Output | ค้นหาโครงสร้าง |
| การประเมินผล | เปรียบเทียบ Ground Truth | Internal Metrics |
| ตัวอย่าง | Classification, Regression | Clustering, Dim. Reduction |

---

## 1.2 ความหมายของ Clustering

**Clustering** คือการแบ่งกลุ่มข้อมูลโดยข้อมูลภายในกลุ่มมีความคล้ายคลึงกัน (high intra-cluster similarity) และข้อมูลต่างกลุ่มมีความแตกต่างกัน (low inter-cluster similarity)

**คุณสมบัติที่ดีของ Clustering:**
- **Cohesion:** สมาชิกในกลุ่มเดียวกันอยู่ใกล้กัน
- **Separation:** กลุ่มต่างกันมีระยะห่างมาก
- **Scalability:** ทำงานได้ดีกับข้อมูลขนาดใหญ่
- **Ability to handle noise:** ทนต่อข้อมูล Outlier

---

## 1.3 ประเภทของ Clustering Algorithms

```mermaid
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#282828', 'primaryTextColor': '#ebdbb2', 'primaryBorderColor': '#fabd2f', 'lineColor': '#fabd2f', 'secondaryColor': '#3c3836', 'tertiaryColor': '#3c3836'}}}%%
flowchart TD
    A["🔵 Clustering Algorithms"]:::main --> B["Partitional - การแบ่งพาร์ทิชัน"]:::partition
    A --> C["Hierarchical - แบบลำดับชั้น"]:::hier
    A --> D["Density-Based - แบบความหนาแน่น"]:::density
    A --> E["Model-Based - แบบโมเดล"]:::model
    B --> B1["K-Means"]:::algo
    B --> B2["K-Medoids"]:::algo
    C --> C1["Agglomerative"]:::algo
    C --> C2["Divisive"]:::algo
    D --> D1["DBSCAN"]:::algo
    D --> D2["OPTICS"]:::algo
    E --> E1["HMM"]:::algo
    classDef main fill:#282828,stroke:#fabd2f,color:#ebdbb2,font-weight:bold
    classDef partition fill:#458588,stroke:#83a598,color:#ebdbb2
    classDef hier fill:#98971a,stroke:#b8bb26,color:#ebdbb2
    classDef density fill:#d79921,stroke:#fabd2f,color:#282828
    classDef model fill:#cc241d,stroke:#fb4934,color:#ebdbb2
    classDef algo fill:#3c3836,stroke:#928374,color:#ebdbb2
```

---

## 1.4 ประวัติความเป็นมาของ Clustering

```mermaid
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#282828', 'primaryTextColor': '#ebdbb2', 'lineColor': '#fabd2f'}}}%%
flowchart LR
    subgraph era1["ยุคที่ 1 (1950s-1960s)"]
        A1["1957: Ward's Method"]
        A2["1967: K-Means - MacQueen"]
    end
    subgraph era2["ยุคที่ 2 (1970s-1980s)"]
        B1["1973: ISODATA - Dunn's Index"]
        B2["1984: Fuzzy C-Means"]
    end
    subgraph era3["ยุคที่ 3 (1990s-2000s)"]
        C1["1996: DBSCAN"]
        C2["2002: Spectral Clustering"]
    end
    subgraph era4["ยุคที่ 4 (2010s-ปัจจุบัน)"]
        D1["2010: Deep Clustering"]
        D2["2020s: Self-Supervised"]
    end
    era1 --> era2 --> era3 --> era4
    style era1 fill:#282828,stroke:#fabd2f,color:#ebdbb2
    style era2 fill:#3c3836,stroke:#83a598,color:#ebdbb2
    style era3 fill:#282828,stroke:#b8bb26,color:#ebdbb2
    style era4 fill:#3c3836,stroke:#fb4934,color:#ebdbb2
```

---

# K-Means Clustering

---

## 2.1 แนวคิดหลักของ K-Means

**K-Means Clustering** คืออัลกอริทึมแบบ **Partitional** ที่แบ่งข้อมูล N จุด ออกเป็น K กลุ่ม แต่ละกลุ่มมี **Centroid (จุดศูนย์กลาง)** และข้อมูลแต่ละจุดจะถูกกำหนดให้อยู่ใน Cluster ที่มี Centroid ใกล้ที่สุด

**Objective Function (Within-Cluster Sum of Squares - WCSS):**

- **J** = Objective Function ที่ต้องการลดให้น้อยที่สุด
- **μ_k** = Centroid ของ Cluster k
- **‖x_i − μ_k‖²** = ระยะทางยูคลิดกำลังสอง

---

## 2.2 ขั้นตอนอัลกอริทึม K-Means

```mermaid
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#282828', 'primaryTextColor': '#ebdbb2', 'lineColor': '#fabd2f'}}}%%
flowchart TD
    S["🟡 เริ่มต้น - กำหนดจำนวน K"]:::start
    S --> I["สุ่มเลือก K Centroids - Initialize K centroids randomly"]:::step
    I --> A["Assignment Step - แต่ละจุดข้อมูล → Centroid ที่ใกล้ที่สุด"]:::step
    A --> U["Update Step - คำนวณค่าเฉลี่ยใหม่ของแต่ละ Cluster"]:::step
    U --> C{"Centroids - เปลี่ยนแปลงหรือไม่?"}:::decision
    C -->|"ใช่ (Yes)"| A
    C -->|"ไม่ (Converged)"| E["🟢 สิ้นสุด - ผลลัพธ์: K Clusters"]:::end
    classDef start fill:#d79921,stroke:#fabd2f,color:#282828,font-weight:bold
    classDef step fill:#458588,stroke:#83a598,color:#ebdbb2
    classDef decision fill:#98971a,stroke:#b8bb26,color:#ebdbb2
    classDef end fill:#689d6a,stroke:#8ec07c,color:#282828,font-weight:bold
```

---

## 2.3 ตัวอย่างการคำนวณ K-Means (ข้อมูล 6 จุด, K=2)

**Centroid เริ่มต้น:** μ₁ = A(1,1), μ₂ = D(5,7)

| จุด | d(จุด, μ₁) | d(จุด, μ₂) | Cluster |
|-----|-----------|-----------|---------|
| A(1,1) | **0.00** | 7.21 | C₁ |
| B(1.5,2) | **1.12** | 6.10 | C₁ |
| C(3,4) | 3.61 | 3.61 | C₁* |
| D(5,7) | 7.21 | **0.00** | C₂ |
| E(3.5,5) | 4.72 | **2.50** | C₂ |
| F(4.5,5) | 5.32 | **2.06** | C₂ |

---

## 2.3 Update Centroid (Iteration 1)

**สูตรคำนวณ Centroid ใหม่:**

**ผลลัพธ์ Iteration 1:**
- C₁ = {A, B, C} → μ₁ = ((1+1.5+3)/3, (1+2+4)/3) = **(1.83, 2.33)**
- C₂ = {D, E, F} → μ₂ = ((5+3.5+4.5)/3, (7+5+5)/3) = **(4.33, 5.67)**

**Iteration 2:** ทำ Assignment ใหม่ด้วย Centroid ที่อัปเดต — เมื่อ Assignment ไม่เปลี่ยน → **Converged!**

---

## 2.4 ตัวอย่าง Code Python — K-Means

```python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.8, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Elbow Method
inertia_values = []
for k in range(1, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertia_values.append(km.inertia_)
```

---

## 2.4 ตัวอย่าง Code Python — K-Means (ต่อ)

```python
# สร้างโมเดล K-Means ด้วย K-Means++
best_k = 4
kmeans = KMeans(
    n_clusters=best_k,
    init='k-means++',   # ใช้ K-Means++ แทนการสุ่มธรรมดา
    n_init=10,
    max_iter=300,
    random_state=42
)
labels = kmeans.fit_predict(X_scaled)

print(f"WCSS: {kmeans.inertia_:.4f}")
print(f"Iterations: {kmeans.n_iter_}")
for i in range(best_k):
    print(f"Cluster {i}: {(labels==i).sum()} จุด")
```

---

## 2.5 K-Means++ การปรับปรุงการเลือก Centroid

**ปัญหา:** การสุ่ม Centroid เริ่มต้นแบบ Random นำไปสู่ผลลัพธ์ที่ไม่ดี

**K-Means++ แก้ปัญหาโดย** เลือก Centroid ถัดไปด้วยความน่าจะเป็นแปรผันตรงกับ **ระยะทาง**:

**D(x)** = ระยะทางของจุด x ไปยัง Centroid ที่ใกล้ที่สุดที่เลือกไว้แล้ว

---

## 2.6 ข้อดี ข้อเสีย และข้อจำกัดของ K-Means

| ด้าน | รายละเอียด |
|------|-----------|
| ✅ ข้อดี | ง่าย รวดเร็ว Scalable สำหรับข้อมูลขนาดใหญ่ |
| ✅ ข้อดี | Convergence รับประกัน |
| ❌ ข้อเสีย | ต้องกำหนดจำนวน K ล่วงหน้า |
| ❌ ข้อเสีย | ไวต่อ Outlier และ Centroid เริ่มต้น |
| ❌ ข้อเสีย | สมมุติว่า Cluster มีรูปร่างเป็น Spherical (กลม) |
| ❌ ข้อเสีย | ไม่เหมาะกับ Cluster ที่มีขนาด/รูปร่างแตกต่างกัน |

---

# Hierarchical Clustering

---

## 3.1 แนวคิดหลักของ Hierarchical Clustering

**Hierarchical Clustering** สร้างโครงสร้างลำดับชั้นของ Cluster เรียกว่า **Dendrogram** มี 2 แนวทางหลัก:

- **Agglomerative (Bottom-Up):** เริ่มจากข้อมูลแต่ละจุดเป็น Cluster ของตัวเอง แล้วค่อย ๆ รวมกลุ่มที่ใกล้ที่สุดเข้าด้วยกัน
- **Divisive (Top-Down):** เริ่มจากทุกจุดเป็น Cluster เดียว แล้วค่อย ๆ แบ่งออก

**ข้อได้เปรียบ:**
- ไม่จำเป็นต้องกำหนดจำนวน Cluster ล่วงหน้า
- แสดงความสัมพันธ์ลำดับชั้นของข้อมูลได้

---

## 3.2 Linkage Methods

```mermaid
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#282828', 'primaryTextColor': '#ebdbb2', 'lineColor': '#fabd2f'}}}%%
flowchart TD
    L["Linkage Methods - วิธีวัดระยะห่างระหว่าง Cluster"]:::main
    L --> SL["Single Linkage - (Minimum) - ระยะห่างจุดใกล้สุด"]:::link
    L --> CL["Complete Linkage - (Maximum) - ระยะห่างจุดไกลสุด"]:::link
    L --> AL["Average Linkage - (UPGMA) - ค่าเฉลี่ยระยะห่างทุกคู่"]:::link
    L --> WL["Ward's Linkage - ลดค่า WCSS รวม"]:::link
    SL --> SD["⚠️ Chain Effect"]:::note
    CL --> CD["✅ ทนต่อ Outlier ดีกว่า"]:::note
    AL --> AD["✅ สมดุลระหว่าง Single และ Complete"]:::note
    WL --> WD["⭐ นิยมใช้มากที่สุด"]:::note
    classDef main fill:#282828,stroke:#fabd2f,color:#ebdbb2,font-weight:bold
    classDef link fill:#458588,stroke:#83a598,color:#ebdbb2
    classDef note fill:#3c3836,stroke:#928374,color:#a89984,font-size:11px
```

---

## 3.2 สูตร Linkage Methods

**Single Linkage:**
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <mi>d</mi><mo>(</mo><mi>A</mi><mo>,</mo><mi>B</mi><mo>)</mo><mo>=</mo>
  <munder><mo movablelimits="true">min</mo><mrow><msub><mi>a</mi><mi>i</mi></msub><mo>∈</mo><mi>A</mi><mo>,</mo><msub><mi>b</mi><mi>j</mi></msub><mo>∈</mo><mi>B</mi></mrow></munder>
  <mi>d</mi><mo>(</mo><msub><mi>a</mi><mi>i</mi></msub><mo>,</mo><msub><mi>b</mi><mi>j</mi></msub><mo>)</mo>
</math>

**Complete Linkage:**
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <mi>d</mi><mo>(</mo><mi>A</mi><mo>,</mo><mi>B</mi><mo>)</mo><mo>=</mo>
  <munder><mo movablelimits="true">max</mo><mrow><msub><mi>a</mi><mi>i</mi></msub><mo>∈</mo><mi>A</mi><mo>,</mo><msub><mi>b</mi><mi>j</mi></msub><mo>∈</mo><mi>B</mi></mrow></munder>
  <mi>d</mi><mo>(</mo><msub><mi>a</mi><mi>i</mi></msub><mo>,</mo><msub><mi>b</mi><mi>j</mi></msub><mo>)</mo>
</math>

**Ward's Linkage:**
<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <mi>d</mi><mo>(</mo><mi>A</mi><mo>,</mo><mi>B</mi><mo>)</mo><mo>=</mo>
  <mfrac><mrow><mo>|</mo><mi>A</mi><mo>|</mo><mo>⋅</mo><mo>|</mo><mi>B</mi><mo>|</mo></mrow><mrow><mo>|</mo><mi>A</mi><mo>|</mo><mo>+</mo><mo>|</mo><mi>B</mi><mo>|</mo></mrow></mfrac>
  <msup><mrow><mo>‖</mo><msub><mi>μ</mi><mi>A</mi></msub><mo>-</mo><msub><mi>μ</mi><mi>B</mi></msub><mo>‖</mo></mrow><mn>2</mn></msup>
</math>

---

## 3.3 ตัวอย่างการคำนวณ Agglomerative (5 จุด)

**Distance Matrix เริ่มต้น:**

|   | P1 | P2 | P3 | P4 | P5 |
|---|----|----|----|----|-----|
| P1 | 0 | 1.00 | 1.00 | 7.07 | 7.81 |
| P2 | 1.00 | 0 | 1.41 | 5.66 | 6.40 |
| P3 | 1.00 | 1.41 | 0 | 6.40 | 7.07 |
| P4 | 7.07 | 5.66 | 6.40 | 0 | 1.00 |
| P5 | 7.81 | 6.40 | 7.07 | 1.00 | 0 |

**Steps:** P1+P2 → {P1,P2} → รวม P3 → {P1,P2,P3} → P4+P5 → {P4,P5} → รวมทั้งหมด

---

## 3.4 ตัวอย่าง Code Python — Hierarchical

```python
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

# คำนวณ Linkage สำหรับ Dendrogram
Z = linkage(X_scaled, method='ward')

# สร้างโมเดล Agglomerative
model = AgglomerativeClustering(
    n_clusters=3,
    linkage='ward',
    metric='euclidean'
)
labels = model.fit_predict(X_scaled)
print(f"Cluster distribution: {dict(zip(*np.unique(labels, return_counts=True)))}")
```

---

## 3.5 เปรียบเทียบ Linkage Methods

| Linkage | ลักษณะ Cluster | ข้อดี | ข้อเสีย |
|---------|--------------|-------|---------|
| Single | ยาว แคบ (Chaining) | เร็ว ตรวจ Non-convex ได้ | Chain Effect |
| Complete | กลม กะทัดรัด | ทนต่อ Outlier | Sensitive to Outliers |
| Average | กลาง ๆ | สมดุล | คำนวณช้ากว่า |
| Ward | กลม ขนาดใกล้เคียง | ผลดีที่สุดทั่วไป | เฉพาะ Euclidean |

---

# DBSCAN

---

## 4.1 แนวคิดหลักของ DBSCAN

**DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) ค้นหา Cluster โดยใช้ความหนาแน่นของจุดข้อมูล

**ข้อได้เปรียบสำคัญ:**
- ไม่จำเป็นต้องกำหนดจำนวน Cluster ล่วงหน้า
- ตรวจจับ **Outlier (Noise)** ได้โดยอัตโนมัติ
- รองรับ Cluster ที่มีรูปร่างซับซ้อน (Non-convex)

**พารามิเตอร์:**
- **ε (epsilon):** รัศมีของ Neighborhood
- **MinPts:** จำนวนจุดขั้นต่ำใน ε-Neighborhood

---

## 4.2 ประเภทจุดข้อมูลใน DBSCAN

```mermaid
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#282828', 'primaryTextColor': '#ebdbb2', 'lineColor': '#fabd2f'}}}%%
flowchart LR
    subgraph classification["การจำแนกประเภทจุดข้อมูล"]
        direction TB
        CP["🔴 Core Point - |N_ε(x)| ≥ MinPts"]
        BP["🟡 Border Point - อยู่ใน N_ε ของ Core - แต่ตัวเองไม่ใช่ Core"]
        NP["⚫ Noise Point - ไม่ใช่ทั้ง Core และ Border"]
    end
    CP -->|"Directly Reachable"| BP
    CP -->|"Density-Connected"| CP
    NP -->|"ไม่อยู่ใน Cluster ใด ๆ"| NP
    style classification fill:#282828,stroke:#fabd2f,color:#ebdbb2
    style CP fill:#cc241d,stroke:#fb4934,color:#ebdbb2
    style BP fill:#d79921,stroke:#fabd2f,color:#282828
    style NP fill:#3c3836,stroke:#928374,color:#a89984
```

---

## 4.3 ขั้นตอนอัลกอริทึม DBSCAN

1. สำหรับทุกจุด x ที่ยังไม่ถูก Visit:
   - คำนวณ N_ε(x) = จำนวนจุดในรัศมี ε
   - ถ้า |N_ε(x)| < MinPts → ทำเครื่องหมาย x เป็น **Noise**
   - ถ้า |N_ε(x)| ≥ MinPts → x เป็น **Core Point** → สร้าง Cluster ใหม่
2. ขยาย Cluster โดยเพิ่มจุดที่ **Density-Reachable** ทั้งหมด
3. ทำซ้ำจนครบทุกจุด

**นิยาม Density-Reachable:** จุด q Density-Reachable จาก p ถ้ามีลำดับ p₁, p₂, ..., pₙ โดย p₁=p, pₙ=q และแต่ละ p_{i+1} อยู่ใน ε-Neighborhood ของ p_i

---

## 4.4 ตัวอย่างการคำนวณ DBSCAN (ε=1.5, MinPts=3)

| จุด | x | y | |N_ε| | ประเภท |
|-----|---|---|------|--------|
| A | 1 | 1 | 4 | **Core** |
| B | 1.5 | 1.5 | 4 | **Core** |
| C | 2 | 1 | 4 | **Core** |
| D | 8 | 8 | 2 | Noise |
| E | 1.2 | 0.8 | 4 | **Core** |
| F | 9 | 8 | 2 | **Noise** |

**ผลลัพธ์:**
- **Cluster 1:** {A, B, C, E} (Dense Region)
- **Noise:** D และ F

---

## 4.5 ตัวอย่าง Code Python — DBSCAN

```python
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors

X_scaled = StandardScaler().fit_transform(X)

# หาค่า ε ด้วย k-distance Graph
k = 5
nbrs = NearestNeighbors(n_neighbors=k).fit(X_scaled)
distances, _ = nbrs.kneighbors(X_scaled)
k_distances_sorted = np.sort(distances[:, -1])[::-1]

# สร้างโมเดล DBSCAN
db = DBSCAN(eps=0.3, min_samples=5)
labels = db.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(f"Clusters: {n_clusters}, Noise: {(labels==-1).sum()}")
```

---

## 4.6 การเลือกพารามิเตอร์ DBSCAN

**วิธีหาค่า ε:**
- ใช้ **k-distance Graph** (k = MinPts)
- เรียงระยะทาง k-th Nearest Neighbor จากมากไปน้อย
- ค้นหา **"Elbow"** หรือจุดที่กราฟหักงอ → นั่นคือค่า ε ที่เหมาะสม

**วิธีเลือก MinPts:**
- กฎทั่วไป: MinPts ≥ Dimension + 1
- สำหรับ 2D → MinPts ≥ 3
- ข้อมูลใหญ่/มี Noise มาก → เพิ่ม MinPts
- ค่าแนะนำ: **MinPts = 2 × Dimension**

---

# การประเมินผลการจัดกลุ่ม

---

## 5.1 ประเภทของ Evaluation Metrics

```mermaid
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#282828', 'primaryTextColor': '#ebdbb2', 'lineColor': '#fabd2f'}}}%%
flowchart TD
    EM["Clustering Evaluation Metrics"]:::main
    EM --> INT["Internal Metrics - (ไม่ต้องการ Ground Truth)"]:::cat
    EM --> EXT["External Metrics - (ต้องการ Ground Truth)"]:::cat
    EM --> REL["Relative Metrics - เปรียบเทียบผลต่าง ๆ"]:::cat
    INT --> SC["Silhouette Score - [-1,1] ยิ่งสูงยิ่งดี"]:::metric
    INT --> DBI["Davies-Bouldin Index - ≥0 ยิ่งต่ำยิ่งดี"]:::metric
    INT --> CHI["Calinski-Harabasz Index - ≥0 ยิ่งสูงยิ่งดี"]:::metric
    EXT --> ARI["Adjusted Rand Index - [-1,1] ยิ่งใกล้ 1 ยิ่งดี"]:::metric
    EXT --> NMI["Normalized Mutual Info - [0,1] ยิ่งสูงยิ่งดี"]:::metric
    REL --> ELB["Elbow Method (WCSS)"]:::metric
    classDef main fill:#282828,stroke:#fabd2f,color:#ebdbb2,font-weight:bold
    classDef cat fill:#458588,stroke:#83a598,color:#ebdbb2
    classDef metric fill:#3c3836,stroke:#928374,color:#ebdbb2
```

---

## 5.2 Silhouette Score

**Silhouette Score** วัดว่าข้อมูลแต่ละจุดถูกจัดอยู่ใน Cluster ที่เหมาะสมแค่ไหน

- **a(i)** = ค่าเฉลี่ยระยะทางจากจุด i ไปยังจุดอื่น ๆ ใน **Cluster เดียวกัน**
- **b(i)** = ค่าเฉลี่ยระยะทางไปยัง **Cluster ใกล้ที่สุดที่ i ไม่ได้อยู่**
- s(i) ≈ **+1**: อยู่ใน Cluster ที่เหมาะสมมาก | ≈ **0**: อยู่บนขอบ | ≈ **-1**: จัดผิด Cluster

---

## 5.2 ตัวอย่างคำนวณ Silhouette Score

**ข้อมูล:** Cluster A: {P1=(0,0), P2=(1,0)}, Cluster B: {P3=(5,0)}

**คำนวณสำหรับ P1:**
- a(P1) = d(P1, P2) = √[(1-0)²+(0-0)²] = **1.0** (ระยะทางเฉลี่ยในกลุ่ม A)
- b(P1) = d(P1, P3) = √[(5-0)²+(0-0)²] = **5.0** (ระยะทางเฉลี่ยไปกลุ่ม B)

✅ ดีมาก — จุด P1 ถูกจัดอยู่ใน Cluster ที่เหมาะสม

---

## 5.3 Davies-Bouldin Index (DBI)

**DBI** วัดอัตราส่วนระหว่าง Scatter ภายใน Cluster กับ Separation ระหว่าง Cluster

- **s_i** = ค่าเฉลี่ยระยะทางจากจุดใน Cluster i ไปยัง Centroid i (Scatter)
- **d(c_i, c_j)** = ระยะทางระหว่าง Centroid ของ Cluster i และ j
- **DBI ต่ำ = Clustering ดี** (Cluster กะทัดรัด ห่างกัน)

---

## 5.4 ตัวอย่าง Code Python — Evaluation Metrics

```python
from sklearn.metrics import (
    silhouette_score, davies_bouldin_score,
    calinski_harabasz_score,
    adjusted_rand_score, normalized_mutual_info_score
)

mask = labels != -1  # กรอง Noise ของ DBSCAN

results = {
    'Silhouette': silhouette_score(X[mask], labels[mask]),
    'Davies-Bouldin': davies_bouldin_score(X[mask], labels[mask]),
    'Calinski-Harabasz': calinski_harabasz_score(X[mask], labels[mask]),
    'ARI': adjusted_rand_score(y_true[mask], labels[mask]),
    'NMI': normalized_mutual_info_score(y_true[mask], labels[mask])
}
```

---

## 5.5 สรุป Evaluation Metrics

| Metric | ต้องการ Ground Truth | ช่วงค่า | ค่าที่ดี |
|--------|---------------------|---------|---------|
| Silhouette Score | ❌ | [-1, 1] | → 1 |
| Davies-Bouldin | ❌ | [0, ∞) | → 0 |
| Calinski-Harabasz | ❌ | [0, ∞) | → ∞ |
| Adjusted Rand Index | ✅ | [-1, 1] | → 1 |
| NMI | ✅ | [0, 1] | → 1 |

---

# สรุปและการเปรียบเทียบ

---

## 6.1 เปรียบเทียบ Clustering Algorithms

| คุณสมบัติ | K-Means | Hierarchical | DBSCAN |
|-----------|---------|--------------|--------|
| กำหนด K ล่วงหน้า | ✅ ต้อง | ✅ ต้อง (ตัด) | ❌ ไม่ต้อง |
| รูปร่าง Cluster | Spherical | ขึ้นกับ Linkage | Arbitrary |
| ตรวจ Outlier | ❌ | ❌ | ✅ อัตโนมัติ |
| Scalability | ✅ สูง | ❌ ต่ำ O(n²) | กลาง |
| ความซับซ้อน | O(nKt) | O(n² log n) | O(n log n)* |
| เหมาะกับ | ข้อมูลใหญ่ กลม | ข้อมูลเล็ก ลำดับชั้น | ข้อมูลมี Noise |

---

## 6.2 แนวทางเลือก Clustering Algorithm

```mermaid
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#282828', 'primaryTextColor': '#ebdbb2', 'lineColor': '#fabd2f'}}}%%
flowchart TD
    Q0["🤔 จะเลือก Clustering Algorithm ใด?"]:::question
    Q0 --> Q2{"รู้จำนวน - Cluster หรือไม่?"}:::q
    Q2 -->|"ไม่รู้"| Q3{"ข้อมูลมี - Noise มาก?"}:::q
    Q2 -->|"รู้จำนวน K"| Q5{"ข้อมูลมีรูปร่าง - ซับซ้อน?"}:::q
    Q3 -->|"ใช่"| R1["✅ DBSCAN - ตรวจ Outlier ได้"]:::result
    Q3 -->|"ไม่มาก"| R2["✅ Hierarchical - ดู Dendrogram หาจำนวน K"]:::result
    Q5 -->|"ซับซ้อน"| R4["✅ DBSCAN หรือ - Spectral Clustering"]:::result
    Q5 -->|"Spherical"| R5["✅ K-Means - เร็ว ง่าย ข้อมูลขนาดใหญ่"]:::result
    classDef question fill:#d79921,stroke:#fabd2f,color:#282828,font-weight:bold
    classDef q fill:#458588,stroke:#83a598,color:#ebdbb2
    classDef result fill:#98971a,stroke:#b8bb26,color:#ebdbb2,font-weight:bold
```

---

## 6.3 Best Practices และข้อแนะนำ

**ก่อนทำ Clustering:**
- ทำ **Feature Scaling** (StandardScaler หรือ MinMaxScaler) เสมอ
- ลบหรือจัดการ Missing Values
- พิจารณา **Dimensionality Reduction** (PCA) ถ้า Feature มีจำนวนมาก

**ระหว่างทำ Clustering:**
- ลอง **หลาย Algorithm** และเปรียบเทียบ
- ใช้ **Elbow Method** + **Silhouette Score** ร่วมกันเพื่อหา K
- สำหรับ DBSCAN: ใช้ k-distance Graph หาค่า ε

**หลังทำ Clustering:**
- ตีความ Cluster — แต่ละ Cluster แทนอะไร?
- Validate ผลด้วย **Domain Knowledge**

---

## 6.4 ตัวอย่างการใช้งานจริง

| Domain | Use Case | Algorithm แนะนำ |
|--------|----------|-----------------|
| Marketing | Customer Segmentation | K-Means |
| Biology | Gene Expression Analysis | Hierarchical |
| Cybersecurity | Network Anomaly Detection | DBSCAN |
| Image Processing | Image Segmentation | K-Means |
| NLP | Document Clustering | K-Means, Hierarchical |
| Astronomy | Galaxy Classification | DBSCAN |
| Finance | Market Regime Detection | HMM |

---

## 7. เอกสารอ้างอิง

1. **Bishop, C. M.** (2006). *Pattern Recognition and Machine Learning*. Springer.
2. **Murphy, K. P.** (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press.
3. **Hastie, T., Tibshirani, R., & Friedman, J.** (2009). *The Elements of Statistical Learning* (2nd ed.). Springer.
4. **MacQueen, J.** (1967). Some methods for classification and analysis of multivariate observations. *Proc. 5th Berkeley Symposium*, 1, 281–297.
5. **Ester, M. et al.** (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. *KDD*, 96(34), 226–231.
6. **Ward, J. H.** (1963). Hierarchical grouping to optimize an objective function. *JASA*, 58(301), 236–244.
7. **scikit-learn Documentation** — https://scikit-learn.org/stable/modules/clustering.html

---

# คำถาม - ข้อสงสัย
<img src="/revealjs/pics/Designer.png" width="55%" />