Breast Cancer
組員:陳致安、陳昱豪
Topic Introduction
- 藉由乳房腫瘤切片上細胞核的各種特徵判斷屬於良性還是惡性腫瘤。

Literature
文獻探討
Gouda I. Salama, M.B.Abdelhalim, and Magdy Abd-elghany Zeid -Breast Cancer Diagnosis on Three Different Datasets Using Multi-Classifiers
柯建全 乳房腫瘤切片電腦診斷系統
「採用細胞核輪廓之曲率、細胞核之面積、細胞核之緊密度,與細胞核之對比度,作為判別病變與否之特徵」
MLP(類神經)精確度(95.279)比J48(95.1359)來得高。
「敏感度指標會表現出影像品質的差異,若影像之背景過於複雜、細胞核過於密集,或細胞核內部之亮度不均皆會導致敏感度變低」
Attributes Introduce

| 資料數 | 屬性資料 | 遺失值 |
|---|---|---|
| 569筆 | 32筆 | 0筆 |
Attributes Introduce

| 資料數 | 屬性資料 | 遺失值 |
|---|---|---|
| 569筆 | 32筆 | 0筆 |
Data preprocessing
資料預處理
InfoGain Good
(Full training set)

InfoGain
(Full training set)

InfoGain Good
(Cross-Validation)
Flods : 10, Seeds : 3

InfoGain
(Cross-Validation)
Flods : 10, Seeds : 3

GainRatio Good
(Full training set)

GainRatio
(Full training set)

GainRatio Good
(Cross-Validation)
Flods : 10, Seeds : 3

GainRatio
(Cross-Validation)
Flods : 10, Seeds : 3

Delete columns
1(ID_Number),
Fractal Dimension : 12 (Mean),22(Stander_Error),32(Worst),
14(Texture (Stander_Error)),
17(Smoothness (Stander_Error)),
Best columns
25(WorstPerimeter)
23(WorstRadius), 26(WorstArea)
30(WorstConcavePoints),
10(MeanConcavePoints)
Classification
分類

Data Discretize

ID3

1.7575
0.5848
4.3859
ID3-4bins
- First floor:WorstConcave Points
- Second floor: RadiusSE、WorstPerimeter、WorstRadius
- Third floor: WorstPerimeter、WorstTexture、MeanSmoothness
J48

0.703
0.8771
J48 Tree

三種訓練方式
樹狀圖皆相同
類神經

0.3515
Suitable algorithm
- Accuracy (before alter): 類神經 > ID3 > J48
- The amount of accuracy change: ID3 > J48 > 類神經
- 70% traning set 表現不優
Grouping
分群
K - means
分群數: 2
seeds: 3
錯誤率: 6.5026%

同分群數,不同seed

Kmeans 各屬性重心

Worst Perimeter(Good)


Worst Radius(Good)


Worst Area(Good)


Mean Symmetry(Bad)


Better column
- From GainRatio : WorstPerimeter, WorstRadius, WorstArea,WorstConcavePoints, MeanConcavePoints
- From InfoGain : Worst Perimeter, Worst Area, Worst Radius,WorstConcavePoints, MeanConcavePoints
- From K-means: Worst Perimeter, Worst Area, Worst Radius
關聯法則
Apriori
Min Support: 0.8

Min Support: 0.7

Min Support: 0.6

Min Support: 0.7
1. meanArea='(-inf-732.875]' 413 ==> worstArea='(-inf-1202.4]' 412 conf:(1)
平均細胞核面積 < 732.875的情況下,最大細胞核面積不大於1202.4
2. worstArea='(-inf-1202.4]' 439 ==> meanArea='(-inf-732.875]' 412 conf:(0.94)
最大細胞核面積 < 1202.4的情況下,平均細胞核面積不大於732.875
Min Support: 0.6
- 平均細胞核面積<732.875 且平均凹陷度< 0.1067 的情況下, 最大細胞核面積不大於1202.4 conf:(1)
- 診斷結果為良性且平均細胞核面積<732.875的情況下, 最大細胞核面積不大於1202.4 conf:(1)
- 平均凹陷值<0.0503的情況下, 最大細胞核面積不大於1202.4 conf:(0.99)
- 診斷結果為良性且最大細胞核面積<1202.4的情況下,平均細胞核面積不大於732.875 conf:(0.98)
- 診斷結果為良性的情況下平均細胞核面積不大於732.875且最大細胞核面積不大於1202.4 conf:(0.98)
- 平均凹陷度<0.1067且最大細胞核面積<1202.4的情況下,平均細胞核面積不大於732.875 conf:(0.97)
Min Support: 0.6
7. 最大凹陷度<0.313的情況下,平均凹陷度不大於0.1067 conf:(0.96)
8. 平均凹陷度<0.1067的情況下,最大細胞核面積不大於1202.4
conf:(0.94)
9. 平均凹陷度<0.1067的情況下,平均細胞核面積不大於732.875且最大細胞核面積不大於1202.4 conf:(0.91)
Conclusion

總結
經過資料預處理的 InfoGain & GainRatio 運算後挑選出低相關度的欄位刪除後
透過ID3 與 K-means 的分類分群 結合 Apriori關聯法則 的分析結果
得出如同 平均細胞核面積<732.875 且平均凹陷度< 0.1067 的情況下, 最大細胞核面積不大於1202.4 的結論
證實在診斷腫瘤為良、惡性時,
細胞核的特定特徵數據(EX: Worst Perimeter、Worst concavity...)有高相關度
Reference
1.Gouda I. Salama, M.B.Abdelhalim, and Magdy Abd-elghany Zeid -Breast Cancer Diagnosis on Three Different Datasets Using Multi-Classifiers
2.柯建全-乳房腫瘤切片電腦診斷系統
3.Shih-Chieh Ting ,Dr. Duen-Ren Liu-Applying Data Mining Techniques to the Analysis of Wafer Testing
4. http://goo.gl/qzBnft (癌邦網-乳癌資訊)
5.http://www.phalanx.com.tw/attachment/EDM/201405/TWDM/Report.pdf
(華聯生技-癌幹細胞)
6. http://www.cc.ntu.edu.tw/chinese/epaper/0029/20140620_2905.html
(台大計算機及資訊網路中心電子報資-料探勘SO EASY)
7.維基百科: https://www.wikipedia.org/
8. http://www.nature.com/nrc/journal/v9/n11/full/nrc2757.html (nature -Cancer stem cells)
Thank you !
BreastCancer
By jackiechen08
BreastCancer
Data Mining的期末報告
- 655