3. การเลือกและการกรองข้อมูล (Selection & Filtering)

หัวใจสำคัญของการวิเคราะห์ข้อมูลคือความสามารถในการดึงเฉพาะข้อมูลที่เราสนใจออกมาจากชุดข้อมูลขนาดใหญ่ ซึ่งเปรียบเสมือนการหยิบเฉพาะเครื่องมือที่ต้องการออกมาจากกล่องเครื่องมือที่มีหลายชิ้น การเลือกและกรองข้อมูลใน Pandas มีหลายวิธีที่ยืดหยุ่นและทรงพลัง แต่ละวิธีมีจุดประสงค์และข้อดีที่แตกต่างกัน

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#ebdbb2','primaryTextColor':'#3c3836','primaryBorderColor':'#928374','lineColor':'#458588','secondaryColor':'#d5c4a1','tertiaryColor':'#fbf1c7','background':'#fbf1c7','mainBkg':'#ebdbb2','secondBkg':'#d5c4a1','textColor':'#3c3836','fontSize':'16px'}}}%%
graph TB
    A["DataFrame
ดาต้าเฟรม"] --> B["Column Selection
เลือกคอลัมน์"]
    A --> C["Row Selection
เลือกแถว"]
    A --> D["Filtering
กรองข้อมูล"]
    
    B --> B1["df['col']
เลือกคอลัมน์เดียว"]
    B --> B2["df[['col1', 'col2']]
เลือกหลายคอลัมน์"]
    
    C --> C1[".loc[]
Label-based"]
    C --> C2[".iloc[]
Integer-based"]
    
    D --> D1["Boolean Indexing
เงื่อนไขบูลีน"]
    D --> D2[".query()
เขียนเงื่อนไขแบบ SQL-like"]
    
    style A fill:#b8bb26,stroke:#3c3836,stroke-width:3px,color:#3c3836
    style B fill:#fabd2f,stroke:#3c3836,stroke-width:2px,color:#3c3836
    style C fill:#fabd2f,stroke:#3c3836,stroke-width:2px,color:#3c3836
    style D fill:#fabd2f,stroke:#3c3836,stroke-width:2px,color:#3c3836
    style B1 fill:#d3869b,stroke:#3c3836,color:#3c3836
    style B2 fill:#d3869b,stroke:#3c3836,color:#3c3836
    style C1 fill:#83a598,stroke:#3c3836,color:#3c3836
    style C2 fill:#83a598,stroke:#3c3836,color:#3c3836
    style D1 fill:#fe8019,stroke:#3c3836,color:#3c3836
    style D2 fill:#fe8019,stroke:#3c3836,color:#3c3836

3.1 การเลือก Column (Column Selection)

Column หรือคอลัมน์ คือแกนแนวตั้งของข้อมูลใน DataFrame ซึ่งแต่ละคอลัมน์มักจะเก็บข้อมูลประเภทเดียวกัน (เช่น อายุ, ชื่อ, เงินเดือน) การเลือกคอลัมน์เป็นการดำเนินการพื้นฐานที่สุดใน Pandas

3.1.1 การเลือกคอลัมน์เดียว (Single Column Selection)

Pandas มีวิธีการเลือกคอลัมน์เดียวอยู่ 2 วิธี ที่ให้ผลลัพธ์ต่างกัน:

import pandas as pd

# สร้าง DataFrame ตัวอย่าง
data = {
    'name': ['สมชาย', 'สมหญิง', 'สมศักดิ์', 'สมพร'],
    'age': [25, 30, 35, 28],
    'salary': [30000, 45000, 55000, 38000],
    'department': ['IT', 'HR', 'Finance', 'IT']
}
df = pd.DataFrame(data)

# วิธีที่ 1: ใช้วงเล็บเหลี่ยม - คืนค่าเป็น Series
age_series = df['age']
print(type(age_series))  # <class 'pandas.core.series.Series'>

# วิธีที่ 2: ใช้ dot notation - คืนค่าเป็น Series (แต่มีข้อจำกัด)
age_series_dot = df.age
print(type(age_series_dot))  # <class 'pandas.core.series.Series'>

ข้อแตกต่างสำคัญ:

วิธีการ	รูปแบบ	ข้อดี	ข้อจำกัด
วงเล็บเหลี่ยม `df['col']`	ใช้ได้เสมอ	- รองรับชื่อที่มีช่องว่าง - รองรับชื่อที่ซ้ำกับ method - ใช้กับตัวแปรได้	ต้องพิมพ์มากกว่า
Dot notation `df.col`	สั้นกว่า	- เขียนโค้ดได้เร็ว - อ่านง่าย	- ไม่ได้กับชื่อที่มีช่องว่าง - ชื่อซ้ำกับ method ใช้ไม่ได้ - ใช้กับตัวแปรไม่ได้

ตัวอย่างข้อจำกัดของ Dot notation:

# สร้าง DataFrame ที่มีชื่อคอลัมน์พิเศษ
df_special = pd.DataFrame({
    'first name': ['John', 'Jane'],  # มีช่องว่าง
    'count': [10, 20]  # ซ้ำกับ method .count()
})

# ❌ ไม่ทำงาน - มีช่องว่าง
# print(df_special.first name)  # SyntaxError

# ✅ ทำงานได้
print(df_special['first name'])

# ⚠️ อันตราย - เรียก method แทนที่จะเป็นคอลัมน์
# print(df_special.count)  # ได้ method ไม่ใช่คอลัมน์

# ✅ ปลอดภัย
print(df_special['count'])

3.1.2 การเลือกหลายคอลัมน์ (Multiple Columns Selection)

เมื่อต้องการเลือกหลายคอลัมน์พร้อมกัน ต้องใช้ double brackets [[]] ซึ่งจะคืนค่าเป็น DataFrame ใหม่ที่มีเฉพาะคอลัมน์ที่เลือก

def select_columns_demo():
    """
    ฟังก์ชันสาธิตการเลือกหลายคอลัมน์
    
    Returns:
        DataFrame ที่มีเฉพาะคอลัมน์ที่เลือก
    """
    # สร้างข้อมูลพนักงาน
    employees = pd.DataFrame({
        'emp_id': [101, 102, 103, 104],
        'name': ['อรุณ', 'สุดา', 'วิชัย', 'มานี'],
        'age': [28, 32, 45, 29],
        'department': ['Sales', 'IT', 'HR', 'Finance'],
        'salary': [35000, 48000, 52000, 41000],
        'years_exp': [3, 5, 15, 4]
    })
    
    # เลือกเฉพาะชื่อและเงินเดือน
    name_salary = employees[['name', 'salary']]
    print(f"ประเภทข้อมูล: {type(name_salary)}")  # DataFrame
    print(f"รูปร่าง: {name_salary.shape}")  # (4, 2)
    
    # เลือกคอลัมน์โดยการสร้างลิสต์ตัวแปร
    columns_to_select = ['emp_id', 'name', 'department']
    subset = employees[columns_to_select]
    
    return subset

# เรียกใช้งาน
result = select_columns_demo()
print(result)

สูตรการคำนวณจำนวนคอลัมน์ที่เลือก:

การเลือกคอลัมน์ $k$ คอลัมน์จาก DataFrame ที่มี $n$ คอลัมน์ จะได้ DataFrame ใหม่ที่มีขนาด:

shape = (m, k)

โดยที่:

$m$ = จำนวนแถว (ไม่เปลี่ยนแปลง)
$k$ = จำนวนคอลัมน์ที่เลือก (โดยที่ $k \leq n$ )

3.1.3 การเรียงลำดับคอลัมน์ใหม่ (Reordering Columns)

บางครั้งเราต้องการเปลี่ยนลำดับของคอลัมน์เพื่อให้อ่านข้อมูลได้ง่ายขึ้น:

# สร้าง DataFrame
df = pd.DataFrame({
    'salary': [30000, 45000],
    'name': ['บุญมี', 'สมศรี'],
    'age': [25, 30],
    'id': [1, 2]
})

# เรียงลำดับใหม่ให้ id และ name อยู่หน้าสุด
desired_order = ['id', 'name', 'age', 'salary']
df_reordered = df[desired_order]
print(df_reordered)

3.2 การเลือก Row และ Column แบบระบุตำแหน่ง (Positional Selection)

การเลือกข้อมูลใน Pandas มีเครื่องมือหลัก 2 ตัว ที่มักทำให้ผู้เริ่มต้นสับสน นั่นคือ .loc[] และ .iloc[] แต่จริงๆ แล้วมันง่ายมากถ้าเข้าใจหลักการ

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#ebdbb2','primaryTextColor':'#3c3836','primaryBorderColor':'#928374','lineColor':'#458588','secondaryColor':'#d5c4a1','tertiaryColor':'#fbf1c7','background':'#fbf1c7','mainBkg':'#ebdbb2','secondBkg':'#d5c4a1','textColor':'#3c3836','fontSize':'16px'}}}%%
graph LR
    A["Selection Methods
วิธีการเลือกข้อมูล"] --> B[".loc[]
Label-based
ใช้ชื่อ/ป้ายกำกับ"]
    A --> C[".iloc[]
Integer-based
ใช้ตำแหน่งเลข"]
    
    B --> B1["ใช้ชื่อ index"]
    B --> B2["ใช้ชื่อ column"]
    B --> B3["รวมทั้ง slice"]
    
    C --> C1["ใช้ตำแหน่ง 0,1,2..."]
    C --> C2["ทำงานเหมือน Python list"]
    C --> C3["ไม่รวม endpoint"]
    
    style A fill:#b8bb26,stroke:#3c3836,stroke-width:3px,color:#3c3836
    style B fill:#83a598,stroke:#3c3836,stroke-width:2px,color:#3c3836
    style C fill:#d3869b,stroke:#3c3836,stroke-width:2px,color:#3c3836

3.2.1 การใช้ .loc[] - Label-based Selection

.loc[] ใช้สำหรับเลือกข้อมูลโดยอ้างอิงชื่อ (Label) ของ index และ column

รูปแบบการใช้งาน:

df.loc[row_labels, column_labels]

def loc_selection_demo():
    """
    สาธิตการใช้ .loc[] ในรูปแบบต่างๆ
    
    .loc[] เหมาะสำหรับ:
    - การเลือกด้วยชื่อ index/column ที่มีความหมาย
    - การ slice ที่ต้องการรวม endpoint
    - การทำงานกับ MultiIndex
    """
    # สร้าง DataFrame พร้อม custom index
    df = pd.DataFrame({
        'product': ['แล็ปท็อป', 'มือถือ', 'แท็บเล็ต', 'หูฟัง', 'เมาส์'],
        'price': [25000, 15000, 12000, 2500, 800],
        'stock': [15, 50, 30, 100, 200],
        'category': ['Computer', 'Mobile', 'Mobile', 'Accessory', 'Accessory']
    }, index=['P001', 'P002', 'P003', 'P004', 'P005'])
    
    print("=== ตัวอย่างที่ 1: เลือกแถวเดียว ===")
    # เลือกแถวที่มี index = 'P001'
    row = df.loc['P001']
    print(row)
    print(f"ประเภท: {type(row)}\n")  # Series
    
    print("=== ตัวอย่างที่ 2: เลือกแถวเดียวและหลายคอลัมน์ ===")
    # เลือกชื่อสินค้าและราคาของ P002
    item = df.loc['P002', ['product', 'price']]
    print(item)
    print()
    
    print("=== ตัวอย่างที่ 3: เลือกช่วงแถว (Slice) ===")
    # เลือก P001 ถึง P003 (รวม P003 ด้วย!)
    subset = df.loc['P001':'P003']
    print(subset)
    print()
    
    print("=== ตัวอย่างที่ 4: เลือกหลายแถวและหลายคอลัมน์ ===")
    # เลือกเฉพาะ P001 และ P003, แสดงเฉพาะ product และ stock
    selected = df.loc[['P001', 'P003'], ['product', 'stock']]
    print(selected)
    print()
    
    print("=== ตัวอย่างที่ 5: เลือกทุกแถว แต่เฉพาะบางคอลัมน์ ===")
    # ใช้ : เพื่อเลือกทุกแถว
    all_rows = df.loc[:, ['product', 'price']]
    print(all_rows)
    
    return df

# ทดสอบการใช้งาน
df_result = loc_selection_demo()

กฎสำคัญของ .loc[]:

Slice รวม endpoint เสมอ: df.loc['A':'C'] จะได้ A, B, และ C
ต้องใช้ชื่อที่มีอยู่จริง: ถ้าไม่มี index ชื่อนั้นจะ error
สามารถใช้ boolean mask ได้: df.loc[df['age'] > 25]

3.2.2 การใช้ .iloc[] - Integer-based Selection

.iloc[] ใช้สำหรับเลือกข้อมูลโดยอ้างอิงตำแหน่งตัวเลข (Integer Position) เหมือนการเข้าถึง list ของ Python

รูปแบบการใช้งาน:

df.iloc[row_positions, column_positions]

def iloc_selection_demo():
    """
    สาธิตการใช้ .iloc[] ในรูปแบบต่างๆ
    
    .iloc[] เหมาะสำหรับ:
    - การเลือกด้วยตำแหน่งที่แน่นอน
    - การ loop หรือการเข้าถึงแบบมีโปรแกรม
    - เมื่อไม่สนใจชื่อ index/column
    """
    # ใช้ DataFrame เดิม
    df = pd.DataFrame({
        'name': ['สมชาย', 'สมหญิง', 'สมศักดิ์', 'สมพร', 'สมใจ'],
        'score': [85, 92, 78, 88, 95],
        'grade': ['B', 'A', 'C', 'B', 'A'],
        'pass': [True, True, True, True, True]
    })
    
    print("=== ตัวอย่างที่ 1: เลือกแถวเดียว (แถวแรก) ===")
    # แถวที่ตำแหน่ง 0
    first_row = df.iloc[0]
    print(first_row)
    print()
    
    print("=== ตัวอย่างที่ 2: เลือกแถวเดียวและคอลัมน์เดียว ===")
    # แถวที่ 1, คอลัมน์ที่ 2 (score ของ สมหญิง)
    value = df.iloc[1, 1]
    print(f"คะแนนของแถวที่ 2: {value}")
    print()
    
    print("=== ตัวอย่างที่ 3: เลือกช่วงแถว (Slice) ===")
    # แถวที่ 0 ถึง 2 (ไม่รวม 3!)
    subset = df.iloc[0:3]
    print(subset)
    print()
    
    print("=== ตัวอย่างที่ 4: เลือกหลายแถวและหลายคอลัมน์ ===")
    # แถว 0,2,4 และคอลัมน์ 0,1 (name และ score)
    selected = df.iloc[[0, 2, 4], [0, 1]]
    print(selected)
    print()
    
    print("=== ตัวอย่างที่ 5: เลือกแถวสุดท้าย ===")
    # ใช้ -1 เหมือน Python list
    last_row = df.iloc[-1]
    print(last_row)
    print()
    
    print("=== ตัวอย่างที่ 6: เลือกทุกแถว แต่เฉพาะ 2 คอลัมน์แรก ===")
    first_two_cols = df.iloc[:, 0:2]
    print(first_two_cols)
    
    return df

# ทดสอบ
df_test = iloc_selection_demo()

กฎสำคัญของ .iloc[]:

Slice ไม่รวม endpoint: df.iloc[0:3] จะได้แถว 0, 1, 2 (ไม่มี 3)
เริ่มนับจาก 0: แถวแรกคือ 0, คอลัมน์แรกคือ 0
รองรับ negative indexing: -1 คือตำแหน่งสุดท้าย
ไม่สนใจชื่อ index: ทำงานกับตำแหน่งเท่านั้น

3.2.3 เปรียบเทียบ .loc[] vs .iloc[]

คุณสมบัติ	.loc[]	.iloc[]
อ้างอิงด้วย	ชื่อ (Label)	ตำแหน่ง (Position)
Slice endpoint	รวม endpoint	ไม่รวม endpoint
ตัวอย่าง row	`df.loc['A']`	`df.iloc[0]`
ตัวอย่าง slice	`df.loc['A':'C']` → A,B,C	`df.iloc[0:3]` → 0,1,2
Boolean indexing	รองรับ	ไม่รองรับ
Negative index	ไม่รองรับ	รองรับ `-1, -2`
ใช้เมื่อ	รู้ชื่อ index/column	รู้ตำแหน่งที่แน่นอน

ตัวอย่างเปรียบเทียบโดยตรง:

# สร้าง DataFrame ตัวอย่าง
df = pd.DataFrame(
    {'A': [10, 20, 30], 'B': [40, 50, 60]},
    index=['x', 'y', 'z']
)

print("Original DataFrame:")
print(df)
print()

# เปรียบเทียบการ slice
print("df.loc['x':'y']:")  # รวม 'y'
print(df.loc['x':'y'])
print()

print("df.iloc[0:2]:")  # ไม่รวม index 2
print(df.iloc[0:2])

3.2.4 การเลือกแบบผสม - ใช้ทั้ง .loc[] และ .iloc[]

บางครั้งเราต้องใช้ทั้งสองวิธีร่วมกัน:

def mixed_selection_example():
    """
    ตัวอย่างการผสมใช้ .loc[] และ .iloc[]
    """
    # สร้างข้อมูลยอดขาย
    sales_df = pd.DataFrame({
        'Q1': [100, 150, 200, 180],
        'Q2': [120, 160, 210, 190],
        'Q3': [130, 170, 220, 195],
        'Q4': [140, 180, 230, 200]
    }, index=['North', 'South', 'East', 'West'])
    
    # วิธีที่ 1: ใช้ .iloc[] หาแถวแรก แล้วใช้ .loc[] เลือกคอลัมน์
    # (ไม่แนะนำ - ทำให้สับสน)
    # first_row_q1 = sales_df.iloc[0].loc['Q1']
    
    # วิธีที่ 2: แยกใช้ชัดเจน
    # ใช้ iloc เลือกแถว 3 แถวแรก
    first_three = sales_df.iloc[0:3]
    # จากนั้นใช้ loc เลือกคอลัมน์ Q1 และ Q4
    result = first_three.loc[:, ['Q1', 'Q4']]
    
    print(result)
    
    return result

mixed_selection_example()

3.3 การกรองข้อมูลด้วยเงื่อนไข (Boolean Indexing)

Boolean Indexing คือเทคนิคที่ทรงพลังที่สุดในการกรองข้อมูล โดยใช้เงื่อนไขทางตรรกะ (Logical Conditions) ในการคัดเลือกแถวที่ต้องการ คล้ายกับการใช้ WHERE clause ใน SQL

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#ebdbb2','primaryTextColor':'#3c3836','primaryBorderColor':'#928374','lineColor':'#458588','secondaryColor':'#d5c4a1','tertiaryColor':'#fbf1c7','background':'#fbf1c7','mainBkg':'#ebdbb2','secondBkg':'#d5c4a1','textColor':'#3c3836','fontSize':'14px'}}}%%
flowchart TB
    A["DataFrame
ข้อมูลทั้งหมด"] --> B["สร้างเงื่อนไข
Condition"]
    B --> C["ได้ Boolean Series
True/False แต่ละแถว"]
    C --> D["ใช้ Boolean Series
กรอง DataFrame"]
    D --> E["ได้เฉพาะแถวที่
ตรงเงื่อนไข"]
    
    B --> B1["df['age'] > 25"]
    B --> B2["df['city'] == 'กรุงเทพ'"]
    B --> B3["df['salary'].between(30000, 50000)"]
    
    style A fill:#83a598,stroke:#3c3836,stroke-width:2px,color:#3c3836
    style C fill:#fabd2f,stroke:#3c3836,stroke-width:2px,color:#3c3836
    style E fill:#b8bb26,stroke:#3c3836,stroke-width:2px,color:#3c3836

3.3.1 เงื่อนไขเดียว (Single Condition)

def single_condition_filtering():
    """
    การกรองด้วยเงื่อนไขเดียว
    
    Boolean operators:
    - == (เท่ากับ)
    - != (ไม่เท่ากับ)
    - > (มากกว่า)
    - < (น้อยกว่า)
    - >= (มากกว่าเท่ากับ)
    - <= (น้อยกว่าเท่ากับ)
    """
    # สร้างข้อมูลพนักงาน
    employees = pd.DataFrame({
        'name': ['สมชาย', 'สมหญิง', 'สมศักดิ์', 'สมพร', 'สมใจ'],
        'age': [25, 32, 28, 45, 29],
        'department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
        'salary': [35000, 48000, 38000, 65000, 42000],
        'experience': [2, 5, 3, 15, 4]
    })
    
    print("=== ข้อมูลทั้งหมด ===")
    print(employees)
    print()
    
    # ตัวอย่างที่ 1: หาพนักงานที่อายุมากกว่า 30
    print("=== พนักงานที่อายุ > 30 ===")
    condition = employees['age'] > 30
    print(f"Boolean Series:\n{condition}\n")  # แสดง True/False
    filtered = employees[condition]
    print(filtered)
    print()
    
    # ตัวอย่างที่ 2: หาพนักงานแผนก IT (วิธีสั้น)
    print("=== พนักงานแผนก IT ===")
    it_employees = employees[employees['department'] == 'IT']
    print(it_employees)
    print()
    
    # ตัวอย่างที่ 3: หาพนักงานที่เงินเดือนไม่ถึง 40,000
    print("=== เงินเดือน < 40,000 ===")
    low_salary = employees[employees['salary'] < 40000]
    print(low_salary)
    
    return employees

# ทดสอบ
df_emp = single_condition_filtering()

สูตรคณิตศาสตร์ของ Boolean Indexing:

ให้ DataFrame $D$ มี $n$ แถว และมีเงื่อนไข $C$ ผลลัพธ์ที่ได้จะเป็น DataFrame ใหม่ที่มีจำนวนแถว $n_{filtered}$ :

n_{filtered} = \sum_{i = 1}^{n} C_{i}

โดยที่ $C_{i}$ = 1 ถ้าแถวที่ $i$ ตรงเงื่อนไข, 0 ถ้าไม่ตรง

3.3.2 เงื่อนไขหลายข้อร่วมกัน (Multiple Conditions)

เมื่อต้องการกรองด้วยหลายเงื่อนไข เราใช้ Logical Operators:

& (AND) - ต้องเป็นจริงทั้งหมด
| (OR) - เป็นจริงข้อใดข้อหนึ่ง
~ (NOT) - กลับค่า True/False

⚠️ ข้อสำคัญ: ต้องใช้วงเล็บ ( ) รอบแต่ละเงื่อนไข!

def multiple_conditions_filtering():
    """
    การกรองด้วยหลายเงื่อนไข
    
    Operators:
    & = AND (และ)
    | = OR (หรือ)
    ~ = NOT (ไม่)
    """
    # ข้อมูลนักศึกษา
    students = pd.DataFrame({
        'name': ['อรุณ', 'สุดา', 'วิชัย', 'มานี', 'นภา', 'สมพร'],
        'score': [85, 92, 78, 88, 95, 72],
        'attendance': [95, 88, 92, 85, 90, 70],
        'year': [2, 3, 1, 4, 2, 3],
        'scholarship': [False, True, False, True, True, False]
    })
    
    print("=== ตัวอย่างที่ 1: AND - คะแนน >= 85 และ มาเรียน >= 90 ===")
    excellent = students[
        (students['score'] >= 85) & 
        (students['attendance'] >= 90)
    ]
    print(excellent)
    print()
    
    print("=== ตัวอย่างที่ 2: OR - ปี 1 หรือ ปี 4 ===")
    first_or_fourth = students[
        (students['year'] == 1) | 
        (students['year'] == 4)
    ]
    print(first_or_fourth)
    print()
    
    print("=== ตัวอย่างที่ 3: NOT - ไม่ได้ทุน ===")
    no_scholarship = students[~students['scholarship']]
    print(no_scholarship)
    print()
    
    print("=== ตัวอย่างที่ 4: เงื่อนไขซับซ้อน ===")
    # (คะแนน > 85 และ มาเรียน > 85) หรือ มีทุน
    complex_filter = students[
        ((students['score'] > 85) & (students['attendance'] > 85)) |
        (students['scholarship'] == True)
    ]
    print(complex_filter)
    
    return students

# ทดสอบ
df_students = multiple_conditions_filtering()

ตารางความจริง (Truth Table) ของ Logical Operators:

A	B	A & B (AND)	A \| B (OR)	~A (NOT)
True	True	True	True	False
True	False	False	True	False
False	True	False	True	True
False	False	False	False	True

3.3.3 การใช้ .isin() - ตรวจสอบว่าอยู่ในลิสต์หรือไม่

เมื่อต้องการตรวจสอบว่าค่าอยู่ในกลุ่มค่าที่กำหนดหรือไม่ ใช้ .isin() สะดวกกว่าการใช้ | หลายตัว

def isin_filtering():
    """
    การใช้ .isin() ในการกรองข้อมูล
    เหมาะสำหรับการตรวจสอบหลายค่าพร้อมกัน
    """
    # ข้อมูลสินค้า
    products = pd.DataFrame({
        'product_id': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006'],
        'name': ['แล็ปท็อป', 'มือถือ', 'แท็บเล็ต', 'หูฟัง', 'เมาส์', 'คีย์บอร์ด'],
        'category': ['Computer', 'Mobile', 'Mobile', 'Accessory', 'Accessory', 'Computer'],
        'price': [25000, 15000, 12000, 2500, 800, 1500],
        'brand': ['Dell', 'Samsung', 'Apple', 'Sony', 'Logitech', 'Logitech']
    })
    
    print("=== ตัวอย่างที่ 1: หาสินค้าที่อยู่ในหมวด Mobile หรือ Computer ===")
    # วิธีแบบยาว (ไม่แนะนำ)
    # filtered = products[
    #     (products['category'] == 'Mobile') | 
    #     (products['category'] == 'Computer')
    # ]
    
    # วิธีแบบสั้นด้วย .isin() (แนะนำ)
    categories_wanted = ['Mobile', 'Computer']
    filtered = products[products['category'].isin(categories_wanted)]
    print(filtered)
    print()
    
    print("=== ตัวอย่างที่ 2: หาสินค้าที่ไม่ใช่แบรนด์ที่กำหนด ===")
    unwanted_brands = ['Sony', 'Logitech']
    # ใช้ ~ เพื่อกลับเงื่อนไข
    not_these_brands = products[~products['brand'].isin(unwanted_brands)]
    print(not_these_brands)
    print()
    
    print("=== ตัวอย่างที่ 3: หาสินค้าที่ product_id อยู่ในลิสต์ที่สนใจ ===")
    interested_ids = ['P001', 'P003', 'P005']
    selected_products = products[products['product_id'].isin(interested_ids)]
    print(selected_products)
    
    return products

# ทดสอบ
df_products = isin_filtering()

3.3.4 การใช้ .between() - กรองค่าในช่วง

เมื่อต้องการหาค่าที่อยู่ระหว่างค่าต่ำสุดและสูงสุด

def between_filtering():
    """
    การใช้ .between() สำหรับกรองค่าในช่วง
    
    Parameters ของ .between():
    - left: ค่าต่ำสุด
    - right: ค่าสูงสุด
    - inclusive: 'both' (default), 'neither', 'left', 'right'
    """
    # ข้อมูลพนักงานขาย
    sales_data = pd.DataFrame({
        'salesperson': ['อรุณ', 'สุดา', 'วิชัย', 'มานี', 'นภา'],
        'sales_amount': [150000, 89000, 234000, 45000, 198000],
        'commission_rate': [0.05, 0.03, 0.07, 0.02, 0.06],
        'years_experience': [5, 2, 10, 1, 8]
    })
    
    print("=== ตัวอย่างที่ 1: หายอดขายระหว่าง 100,000 - 200,000 ===")
    # วิธีแบบยาว
    # middle_sales = sales_data[
    #     (sales_data['sales_amount'] >= 100000) & 
    #     (sales_data['sales_amount'] <= 200000)
    # ]
    
    # วิธีแบบสั้นด้วย .between()
    middle_sales = sales_data[
        sales_data['sales_amount'].between(100000, 200000)
    ]
    print(middle_sales)
    print()
    
    print("=== ตัวอย่างที่ 2: หาคนที่มีประสบการณ์ 3-7 ปี (ไม่รวม 7) ===")
    mid_experience = sales_data[
        sales_data['years_experience'].between(3, 7, inclusive='left')
    ]
    print(mid_experience)
    print()
    
    print("=== ตัวอย่างที่ 3: อัตราค่าคอมมิชชั่น 0.04 - 0.06 ===")
    commission_range = sales_data[
        sales_data['commission_rate'].between(0.04, 0.06)
    ]
    print(commission_range)
    
    return sales_data

# ทดสอบ
df_sales = between_filtering()

3.3.5 การใช้ .str methods - กรองข้อมูล String

การกรองข้อมูลข้อความมีเทคนิคพิเศษผ่าน .str accessor

def string_filtering():
    """
    การกรองข้อมูล String ด้วย .str methods
    
    Methods ที่ใช้บ่อย:
    - .str.contains() - มีคำที่หาอยู่หรือไม่
    - .str.startswith() - ขึ้นต้นด้วย
    - .str.endswith() - ลงท้ายด้วย
    - .str.len() - ความยาวของข้อความ
    """
    # ข้อมูลลูกค้า
    customers = pd.DataFrame({
        'customer_id': ['C001', 'C002', 'C003', 'C004', 'C005'],
        'name': ['บริษัท ABC จำกัด', 'ร้านกาแฟดีดี', 'บริษัท XYZ มหาชน', 'ร้านอาหารสมชาย', 'ABC Trading'],
        'email': ['abc@company.com', 'deedee@gmail.com', 'xyz@corp.co.th', 'somchai@hotmail.com', 'trading@abc.co.th'],
        'phone': ['02-1234567', '081-2345678', '02-7654321', '089-8765432', '02-1111111'],
        'type': ['Corporate', 'SME', 'Corporate', 'SME', 'Corporate']
    })
    
    print("=== ตัวอย่างที่ 1: หาลูกค้าที่ชื่อมีคำว่า 'บริษัท' ===")
    has_company = customers[customers['name'].str.contains('บริษัท')]
    print(has_company)
    print()
    
    print("=== ตัวอย่างที่ 2: หาอีเมลที่ลงท้ายด้วย .co.th ===")
    thai_email = customers[customers['email'].str.endswith('.co.th')]
    print(thai_email)
    print()
    
    print("=== ตัวอย่างที่ 3: หาเบอร์มือถือ (ขึ้นต้นด้วย 08 หรือ 09) ===")
    mobile = customers[
        customers['phone'].str.startswith('08') | 
        customers['phone'].str.startswith('09')
    ]
    print(mobile)
    print()
    
    print("=== ตัวอย่างที่ 4: หาชื่อที่สั้นกว่า 15 ตัวอักษร ===")
    short_name = customers[customers['name'].str.len() < 15]
    print(short_name)
    print()
    
    print("=== ตัวอย่างที่ 5: ค้นหาแบบไม่สนใจตัวพิมพ์ใหญ่-เล็ก ===")
    # case=False จะไม่สนใจตัวพิมพ์
    has_abc = customers[customers['name'].str.contains('abc', case=False)]
    print(has_abc)
    
    return customers

# ทดสอบ
df_customers = string_filtering()

3.4 การใช้ .query() - เขียนเงื่อนไขแบบ SQL-like

.query() เป็นวิธีที่สะดวกและอ่านง่ายกว่าในการเขียนเงื่อนไขซับซ้อน โดยเขียนเป็นสตริงคล้าย SQL แทนที่จะใช้ boolean indexing แบบปกติ

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#ebdbb2','primaryTextColor':'#3c3836','primaryBorderColor':'#928374','lineColor':'#458588','secondaryColor':'#d5c4a1','tertiaryColor':'#fbf1c7','background':'#fbf1c7','mainBkg':'#ebdbb2','secondBkg':'#d5c4a1','textColor':'#3c3836','fontSize':'14px'}}}%%
graph LR
    A["Boolean Indexing
วิธีปกติ"] --> A1["df[(df['age'] > 25) &
(df['city'] == 'Bangkok')]"]
    B[".query() Method
วิธีใหม่"] --> B1["df.query('age > 25 and
city == #quot;Bangkok#quot;')"]
    
    A1 --> C["ผลลัพธ์เดียวกัน"]
    B1 --> C
    
    style A fill:#d3869b,stroke:#3c3836,stroke-width:2px,color:#3c3836
    style B fill:#b8bb26,stroke:#3c3836,stroke-width:2px,color:#3c3836
    style C fill:#83a598,stroke:#3c3836,stroke-width:2px,color:#3c3836

3.4.1 พื้นฐานการใช้ .query()

def basic_query_demo():
    """
    สาธิตการใช้ .query() พื้นฐาน
    
    ข้อดีของ .query():
    1. อ่านง่ายกว่า เหมือนภาษาธรรมชาติ
    2. ไม่ต้องใช้วงเล็บซ้อนกัน
    3. เขียนสั้นกว่า
    4. เร็วกว่าเมื่อ DataFrame ใหญ่มาก
    """
    # สร้างข้อมูลคะแนนสอบ
    exam_data = pd.DataFrame({
        'student': ['นภา', 'อรุณ', 'สุดา', 'วิชัย', 'มานี', 'สมพร'],
        'math': [85, 72, 90, 65, 88, 78],
        'english': [78, 85, 88, 70, 82, 90],
        'science': [92, 68, 85, 75, 90, 80],
        'grade_level': [10, 11, 10, 12, 11, 12],
        'scholarship': [True, False, True, False, True, False]
    })
    
    print("=== ตัวอย่างที่ 1: เงื่อนไขเดียว ===")
    # Boolean indexing แบบเดิม
    result1 = exam_data[exam_data['math'] > 80]
    # .query() แบบใหม่
    result2 = exam_data.query('math > 80')
    print(result2)
    print()
    
    print("=== ตัวอย่างที่ 2: AND - คะแนนคณิตศาสตร์ > 80 และ วิทยาศาสตร์ > 85 ===")
    # Boolean indexing (ยาว)
    # result_old = exam_data[
    #     (exam_data['math'] > 80) & (exam_data['science'] > 85)
    # ]
    # .query() (สั้นกว่า)
    result = exam_data.query('math > 80 and science > 85')
    print(result)
    print()
    
    print("=== ตัวอย่างที่ 3: OR - ชั้น ม.10 หรือ ม.12 ===")
    result = exam_data.query('grade_level == 10 or grade_level == 12')
    print(result)
    print()
    
    print("=== ตัวอย่างที่ 4: ใช้ in สำหรับตรวจสอบหลายค่า ===")
    # เทียบเท่ากับ .isin()
    result = exam_data.query('grade_level in [10, 11]')
    print(result)
    
    return exam_data

# ทดสอบ
df_exam = basic_query_demo()

3.4.2 การใช้ตัวแปรภายนอกใน .query()

ใช้ @ เพื่ออ้างอิงตัวแปรที่อยู่นอก DataFrame

def query_with_variables():
    """
    การใช้ตัวแปรภายนอกใน .query()
    ใช้ @ ตามด้วยชื่อตัวแปร
    """
    products = pd.DataFrame({
        'name': ['แล็ปท็อป', 'มือถือ', 'แท็บเล็ต', 'หูฟัง', 'เมาส์'],
        'price': [25000, 15000, 12000, 2500, 800],
        'stock': [15, 50, 30, 100, 200],
        'category': ['Computer', 'Mobile', 'Mobile', 'Accessory', 'Accessory']
    })
    
    # ตัวแปรสำหรับกำหนดเงื่อนไข
    min_price = 10000
    max_price = 20000
    interested_categories = ['Mobile', 'Computer']
    
    print("=== ตัวอย่างที่ 1: ใช้ตัวแปรกำหนดช่วงราคา ===")
    result = products.query('price >= @min_price and price <= @max_price')
    print(result)
    print()
    
    print("=== ตัวอย่างที่ 2: ใช้ตัวแปรลิสต์ ===")
    result = products.query('category in @interested_categories')
    print(result)
    print()
    
    print("=== ตัวอย่างที่ 3: การคำนวณด้วยตัวแปร ===")
    margin_percent = 0.20  # กำไร 20%
    result = products.query('price > (@min_price * (1 + @margin_percent))')
    print(result)
    
    return products

# ทดสอบ
df_prod = query_with_variables()

3.4.3 เงื่อนไขซับซ้อนด้วย .query()

def complex_query_demo():
    """
    ตัวอย่างเงื่อนไขที่ซับซ้อนด้วย .query()
    """
    # ข้อมูลพนักงาน
    employees = pd.DataFrame({
        'name': ['สมชาย', 'สมหญิง', 'สมศักดิ์', 'สมพร', 'สมใจ', 'สมควร'],
        'department': ['IT', 'HR', 'IT', 'Finance', 'HR', 'IT'],
        'salary': [35000, 48000, 38000, 65000, 42000, 55000],
        'age': [25, 32, 28, 45, 29, 38],
        'years_exp': [2, 5, 3, 15, 4, 10],
        'performance_score': [85, 92, 78, 95, 88, 90]
    })
    
    print("=== เงื่อนไขซับซ้อน 1: ===")
    print("แผนก IT และ (เงินเดือน > 40000 หรือ คะแนน > 85)")
    result = employees.query(
        'department == "IT" and (salary > 40000 or performance_score > 85)'
    )
    print(result)
    print()
    
    print("=== เงื่อนไขซับซ้อน 2: ===")
    print("(อายุ 25-35 และ ประสบการณ์ > 3) หรือ คะแนน > 90")
    result = employees.query(
        '(age >= 25 and age <= 35 and years_exp > 3) or performance_score > 90'
    )
    print(result)
    print()
    
    print("=== การใช้ not เพื่อกลับเงื่อนไข ===")
    print("ไม่ใช่แผนก HR และ เงินเดือน >= 50000")
    result = employees.query(
        'department != "HR" and salary >= 50000'
    )
    print(result)
    
    return employees

# ทดสอบ
df_emp_complex = complex_query_demo()

3.4.4 เปรียบเทียบ Boolean Indexing vs .query()

คุณสมบัติ	Boolean Indexing	.query()
ไวยากรณ์	`df[(df['a'] > 5) & (df['b'] < 10)]`	`df.query('a > 5 and b < 10')`
ความยาว	ยาวกว่า	สั้นกว่า
การอ่าน	ต้องชินกับ syntax	เหมือนภาษาอังกฤษ
วงเล็บ	ต้องใส่ทุกเงื่อนไข	ไม่ต้องใส่
Operators	`&`, `\|`, `~`	`and`, `or`, `not`
ตัวแปรนอก	ใช้ได้ตรงๆ	ต้องใช้ `@`
Performance	เร็วกว่าเล็กน้อย (DataFrame เล็ก)	เร็วกว่า (DataFrame ใหญ่มาก)
ใช้เมื่อ	เงื่อนไขง่าย, ต้องการความเร็ว	เงื่อนไขซับซ้อน, ต้องการความชัดเจน

ตัวอย่างเปรียบเทียบโดยตรง:

# สร้างข้อมูล
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': ['x', 'y', 'x', 'z', 'y']
})

# เงื่อนไข: A > 2 และ B < 45 และ C == 'x'

# วิธีที่ 1: Boolean Indexing (ยาว)
result1 = df[(df['A'] > 2) & (df['B'] < 45) & (df['C'] == 'x')]

# วิธีที่ 2: .query() (สั้น อ่านง่าย)
result2 = df.query('A > 2 and B < 45 and C == "x"')

print("ผลลัพธ์เหมือนกัน:")
print(result2)

3.5 เทคนิคขั้นสูงและ Best Practices

3.5.1 การรวม .loc[] กับ Boolean Indexing

def loc_with_boolean():
    """
    การใช้ .loc[] ร่วมกับ Boolean Indexing
    เพื่อเลือกทั้งแถวที่ตรงเงื่อนไขและคอลัมน์ที่ต้องการ
    """
    sales = pd.DataFrame({
        'product': ['A', 'B', 'C', 'D', 'E'],
        'revenue': [10000, 25000, 15000, 30000, 12000],
        'cost': [7000, 18000, 11000, 22000, 9000],
        'profit': [3000, 7000, 4000, 8000, 3000],
        'region': ['North', 'South', 'North', 'East', 'West']
    })
    
    print("=== เลือกเฉพาะ product และ profit ของสินค้าที่กำไร > 5000 ===")
    # สร้าง boolean mask
    high_profit = sales['profit'] > 5000
    # ใช้ .loc[] เลือกทั้งแถวและคอลัมน์
    result = sales.loc[high_profit, ['product', 'profit']]
    print(result)
    print()
    
    print("=== เลือกภูมิภาค North และแสดงเฉพาะ revenue กับ cost ===")
    result = sales.loc[sales['region'] == 'North', ['revenue', 'cost']]
    print(result)
    
    return sales

loc_with_boolean()

3.5.2 การใช้ .loc[] ในการแก้ไขค่า

def modify_with_loc():
    """
    การใช้ .loc[] เพื่อแก้ไขค่าในเงื่อนไขที่ระบุ
    """
    products = pd.DataFrame({
        'name': ['สินค้า A', 'สินค้า B', 'สินค้า C', 'สินค้า D'],
        'price': [100, 200, 150, 300],
        'discount': [0, 0, 0, 0],
        'category': ['Electronics', 'Clothing', 'Electronics', 'Clothing']
    })
    
    print("ข้อมูลเริ่มต้น:")
    print(products)
    print()
    
    # ให้ส่วนลด 10% สำหรับสินค้า Electronics ที่ราคา > 100
    products.loc[
        (products['category'] == 'Electronics') & (products['price'] > 100),
        'discount'
    ] = 10
    
    print("หลังจากให้ส่วนลด:")
    print(products)
    print()
    
    # เพิ่มคอลัมน์ราคาหลังหักส่วนลด
    products.loc[:, 'final_price'] = products['price'] * (1 - products['discount']/100)
    print("เพิ่มคอลัมน์ราคาสุดท้าย:")
    print(products)

modify_with_loc()

3.5.3 Performance Tips

def performance_comparison():
    """
    เปรียบเทียบประสิทธิภาพของวิธีต่างๆ
    
    Tips:
    1. .query() เร็วกว่าเมื่อ DataFrame มีขนาดใหญ่มาก (> 100,000 แถว)
    2. Boolean indexing ง่ายกว่าเมื่อเงื่อนไขไม่ซับซ้อน
    3. หลีกเลี่ยงการ loop ผ่านแถว ใช้ vectorized operations แทน
    4. Chain operations แทนการสร้าง DataFrame ชั่วคราว
    """
    import numpy as np
    import time
    
    # สร้างข้อมูลขนาดใหญ่
    large_df = pd.DataFrame({
        'A': np.random.randint(0, 100, 100000),
        'B': np.random.randint(0, 100, 100000),
        'C': np.random.choice(['X', 'Y', 'Z'], 100000)
    })
    
    print("=== เปรียบเทียบเวลาทำงาน ===")
    
    # วิธีที่ 1: Boolean Indexing
    start = time.time()
    result1 = large_df[(large_df['A'] > 50) & (large_df['B'] < 30)]
    time1 = time.time() - start
    print(f"Boolean Indexing: {time1:.4f} วินาที")
    
    # วิธีที่ 2: .query()
    start = time.time()
    result2 = large_df.query('A > 50 and B < 30')
    time2 = time.time() - start
    print(f".query(): {time2:.4f} วินาที")
    
    # ❌ วิธีที่ไม่ควรทำ: Loop
    # start = time.time()
    # result3 = []
    # for i in range(len(large_df)):
    #     if large_df.iloc[i]['A'] > 50 and large_df.iloc[i]['B'] < 30:
    #         result3.append(large_df.iloc[i])
    # time3 = time.time() - start
    # print(f"Loop (ไม่แนะนำ): {time3:.4f} วินาที")

performance_comparison()

3.5.4 Common Pitfalls - ข้อผิดพลาดที่พบบ่อย

def common_mistakes():
    """
    ข้อผิดพลาดที่พบบ่อยและวิธีแก้ไข
    """
    df = pd.DataFrame({
        'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 30, 40, 50]
    })
    
    print("=== ❌ ข้อผิดพลาดที่ 1: ลืมใส่วงเล็บรอบเงื่อนไข ===")
    try:
        # result = df[df['A'] > 2 & df['B'] < 45]  # Error!
        print("จะเกิด error เพราะลืมวงเล็บ")
    except:
        print("เกิด error แล้ว")
    
    # ✅ วิธีที่ถูก
    result = df[(df['A'] > 2) & (df['B'] < 45)]
    print("วิธีที่ถูก:")
    print(result)
    print()
    
    print("=== ❌ ข้อผิดพลาดที่ 2: ใช้ and/or แทน &/| ===")
    try:
        # result = df[(df['A'] > 2) and (df['B'] < 45)]  # Error!
        print("จะเกิด error เพราะใช้ 'and' แทน '&'")
    except:
        print("เกิด error แล้ว")
    print()
    
    print("=== ❌ ข้อผิดพลาดที่ 3: .loc vs .iloc สับสน ===")
    df_indexed = pd.DataFrame(
        {'value': [10, 20, 30]},
        index=['a', 'b', 'c']
    )
    
    print("df.loc['a'] จะได้แถวที่ชื่อ 'a':")
    print(df_indexed.loc['a'])
    print()
    
    print("df.iloc[0] จะได้แถวตำแหน่งที่ 0:")
    print(df_indexed.iloc[0])
    print()
    
    print("=== ⚠️ คำเตือนที่ 4: SettingWithCopyWarning ===")
    # การแก้ไขค่าใน subset อาจไม่ส่งผลกับ DataFrame เดิม
    # ใช้ .loc[] หรือ .copy() อย่างชัดเจน
    subset = df[df['A'] > 2].copy()  # ใช้ .copy() เพื่อหลีกเลี่ยงปัญหา
    subset.loc[:, 'B'] = 999
    print("แก้ไข subset ไม่กระทบ df เดิม")

common_mistakes()

สรุป (Summary)

การเลือกและกรองข้อมูลเป็นทักษะพื้นฐานที่สำคัญที่สุดใน Pandas ซึ่งประกอบด้วย:

เครื่องมือหลักทั้ง 4 ประเภท:

การเลือก Column:
- df['col'] - เลือกคอลัมน์เดียว (ได้ Series)
- df[['col1', 'col2']] - เลือกหลายคอลัมน์ (ได้ DataFrame)
การเลือกด้วยตำแหน่ง:
- .loc[] - ใช้ชื่อ (label-based), slice รวม endpoint
- .iloc[] - ใช้ตำแหน่ง (integer-based), slice ไม่รวม endpoint
Boolean Indexing:
- เงื่อนไขเดียว: df[df['age'] > 25]
- หลายเงื่อนไข: df[(df['age'] > 25) & (df['city'] == 'Bangkok')]
- .isin() - ตรวจสอบค่าในลิสต์
- .between() - กรองค่าในช่วง
- .str methods - กรองข้อมูล string
.query() Method:
- เขียนเงื่อนไขแบบ SQL-like
- ใช้ @ เรียกตัวแปรภายนอก
- อ่านง่ายกว่า เหมาะกับเงื่อนไขซับซ้อน

Best Practices สำหรับการเลือกและกรองข้อมูล:

ใช้ .loc[] เมื่อรู้ชื่อ index/column ที่ชัดเจน
ใช้ .iloc[] เมื่อทำงานกับตำแหน่งที่แน่นอน
ใช้ .query() เมื่อเงื่อนไขซับซ้อนและต้องการความชัดเจน
ระวังการใช้ &, | ต้องมีวงเล็บรอบเงื่อนไข
หลีกเลี่ยงการ loop ผ่านแถว ให้ใช้ vectorized operations
ใช้ .copy() เมื่อต้องการแก้ไข subset โดยไม่กระทบต้นฉบับ

สูตรความสัมพันธ์:

filtered_rows = total_rows \times selectivity

โดยที่ selectivity = สัดส่วนของแถวที่ผ่านเงื่อนไข (0 ≤ selectivity ≤ 1)

เอกสารอ้างอิง (References)

Pandas Official Documentation
- Selection and Indexing: https://pandas.pydata.org/docs/user_guide/indexing.html
- Boolean Indexing: https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing
- Query Method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
Books & Tutorials
- "Python for Data Analysis" by Wes McKinney (ผู้สร้าง Pandas)
- "Pandas Cookbook" by Theodore Petrou
- Real Python: Pandas DataFrames 101
Online Resources
- Kaggle Learn: Pandas Course
- DataCamp: Data Manipulation with Pandas
- Stack Overflow: Pandas Tag
Academic Papers
- McKinney, W. (2010). "Data Structures for Statistical Computing in Python"
- VanderPlas, J. (2016). "Python Data Science Handbook"

หมายเหตุ: เอกสารนี้เป็นส่วนหนึ่งของคู่มือ Pandas ฉบับสมบูรณ์ ซึ่งครอบคลุมทุกด้านของการใช้งาน Pandas สำหรับการวิเคราะห์ข้อมูลในภาษา Python