0. 個人心得
1. Learning Python by Mark Lutz
2. Mastering Python by Rick van Hattem
3. Data Science from Scratch
- First Principles with Python by Joel Grus 自修心得
4/5/2018
EE degree back in 1982
Z80 was the most popular CPU
Pascal/Fortran/COBOL were popular languages
Apple ][ + BASIC and CP/M
intel 80386SX PC mother board designer
......
Interested in Linux since 2016
Z80 CPU
intel 80386SX CPU
photo source: wikipedia.org
Apple ][
marconi.jiang@gmail.com
Text
0. 系統
1. 其它
4/28/2018
dir() 查詢當前系統已經定義的名稱
['In', 'Out', '_', '__', '___', '__builtin__',
'__builtins__', '__doc__', '__loader__',
'__name__', '__package__', '__spec__',
'_dh', '_i', '_i1', '_ih', '_ii', '_iii', '_oh',
'_sh', 'exit', 'get_ipython', 'quit']
dir(__builtin__) 列出系統內置模組
['ArithmeticError',
'AssertionError',
'AttributeError', ....]
id(x), id(7)
id() is an inbuilt function in Python. Syntax: id(object)
The id() function returns a unique id for the specified object. All objects in Python has its own unique id. The id is assigned to the object when it is created.
As we can see the function accepts a single parameter and is used to return the identity of an object. This identity has to be unique and constant for this object during the lifetime.
The id is the object's memory address, and will be different for each time you run the program. (except for some object that has a constant unique id, like integers from -5 to 256)
Two objects with non-overlapping lifetimes may have the same id() value.
type(a) : 可觀察物件的類型
Mutable
List,
Immutable
String, Tuple
0. 其它
安裝 Anaconda
How to change the Jupyter start-up folder
Open cmd (or Anaconda Prompt) and run
This writes a file to
Search for the following line:
Replace and un-remark
Make sure you use forward slashes in your path and use /home/user/ instead of ~/ for your home directory
$ jupyter notebook --generate-config
c.NotebookApp.notebook_dir = '/Volumes/HDD160G/Dropbox/'
#c.NotebookApp.notebook_dir = ''
~/.jupyter/jupyter_notebook_config.py
0. 其它
關於 import
About import
Text
容易出錯的引用
import numpy
>>> a = numpy.array([1, 2,3,4])
import numpy as np
>>> a = np.array([1, 2,3,4])
from numpy import *
>>> a = array([1, 2,3,4])
>>> a
array([1, 2, 3, 4])
>>> a.dtype
dtype('int32')
>>> a = array(1,2,3,4) # WRONG
>>> a = array([1,2,3,4]) # RIGHT
0. 其它
關於 Data Types
Python Data Types
Numeric Type 數值類型
String Type 字串類型
Container Type 容器類型
參考資料 : Book - Python 與量化投資
>>> S0 = {}
>>> S1 = set()
>>> type(S0)
<class 'dict'>
>>> type(S1)
<class 'set'>
>>> set('Quant')
{'a', 'Q', 'u', 'n', 't')
>>> type(set('Quant'))
<class 'set'>
>>> {'Quant'}
{'Quant'}
>>> len({'Quant'})
1
>>> type({'Quant'})
<class 'set'>
Python Data Types – Learn From Basic To Advanced
參考資料 : 淺談 Python 的屬性
This works since there is no way (1,2,3,4) could be a generator. There is nothing to generate there, you just specified all the elements, not a rule to obtain them.
In order for your generator to be a tuple, the expression (i for i in sample_list) would have to be a tuple comprehension. There is no way to have tuple comprehensions, since comprehensions require a mutable data type.
Iterating over the generator expression or the list comprehension will do the same thing. However, the list comprehension will create the entire list in memory first while the generator expression will create the items on the fly, so you are able to use it for very large (and also infinite!) sequences.
Difference between list and tuple
Literal
someTuple = (1,2)
someList = [1,2]
Size
a = tuple(range(1000)) # 如果是 generator, 更省 memory
b = list(range(1000)) # c = (i for i in range(1000))
a.__sizeof__() # 8024 # c.__sizeof__() # 64
b.__sizeof__() # 9088
Due to the smaller size of a tuple operation, it becomes a bit faster, but not that much to mention about until you have a huge number of elements.
Usage
As a list is mutable, it can't be used as a key in a dictionary, whereas a tuple can be used.
a = (1,2)
b = [1,2]
c = {a: 1} # OK
c = {b: 1} # Error
Permitted operations
b = [1,2]
b[0] = 3 # [3, 2]
a = (1,2)
a[0] = 3 # Error
That also means that you can't delete an element or sort a tuple. However, you could add new element to both list and tuple with the only difference that you will change id of the tuple by adding element
a = (1,2)
b = [1,2]
id(a) # 140230916716520
id(b) # 748527696
a += (3,) # (1, 2, 3)
b += [3] # [1, 2, 3]
id(a) # 140230916878160
id(b) # 748527696
list 與 tuple 的 methods 之差異,例如 list 可以 sort 排序, 而 tuple 就不能 sort 排序
List其特性有:
List is a collection which is ordered and changeable. Allows duplicate members.
Tuple因其特性有:
Tuple is a collection which is ordered and unchangeable. Allows duplicate members.
是ordered順序性與unchangeable(不可改變)的特性。
0. 其它
關於 Numpy 及 array
容易出錯的引用
import numpy
>>> a = numpy.array([1, 2,3,4])
import numpy as np
>>> a = np.array([1, 2,3,4])
from numpy import *
>>> a = array([1, 2,3,4])
>>> a
array([1, 2, 3, 4])
>>> a.dtype
dtype('int32')
>>> a = array(1,2,3,4) # WRONG
>>> a = array([1,2,3,4]) # RIGHT
当你列印一个数组,NumPy以类似嵌套列表的形式显示它,但是呈以下布局:
>>> c = arange(24).reshape(2,3,4) # 3d array
>>> print(c)
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
>>> c
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
(2, 3, 4) = (z, y, x) 從
z =0
x [ 0 - 3 ]
(y=0) y*4 [a0, +1, +2, +3]
(y=1) y*4 [a4+0, +1, +2, +3]
(y=2) y*4 [a8+0, +1, +2, +3]
z =1 (starting from y * x )
x [ 0 - 3 ]
(y=0) y*4 [a12, +1, +2, +3]
(y=1) y*4 [a12+4+0, +1, +2, +3]
(y=2) y*4 [a12+8+0, +1, +2, +3
print 指令與 interactive 下的 array 不同表達方式
另一個例子
( [ [0, 1],
[2, 3] ] )
In [1]: input_data =np.array([2,3])
In [2]: input_data.reshape(1,2).shape
Out[2]: (1, 2)
In [3]: input_data.reshape(1,2)
Out[3]: array([[2, 3]])
In [4]: input_data.shape
Out[4]: (2,)
In [5]: input_data
Out[5]: array([2, 3])
In [6]: input_data * weights['node_1']
Out[6]: array([-2, 3])
其中的 dtype 與 itemsize 隨 OS 有不同設定而有不同結果, 以我的例子結果分別是 'int64' 與 8
>>> from numpy as np
>>> a = np.arange(15).reshape(3, 5)
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> a.shape
(3, 5)
>>> a.ndim
2
>>> a.dtype.name
'int32'
>>> a.itemsize
4
>>> a.size
15
>>> type(a)
numpy.ndarray
複數
其它函数array, zeros, zeros_like, ones, ones_like, empty, empty_like, arange, linspace, rand, randn, fromfunction, fromfile参考:NumPy示例
>>> c = np.array([[1,2], [3,4]], dtype=complex)
>>> c
array([[ 1.+0.j, 2.+0.j],
[ 3.+0.j, 4.+0.j]])
>>> np.zeros( (3,4) )
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
>>> np.ones( (2,3,4), dtype=int16 ) # dtype can also be specified
array([[[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[ 1, 1, 1, 1]],
[[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[ 1, 1, 1, 1]]], dtype=int16)
>>> np.empty( (2,3) )
array([[1.39069238e-309, 1.39069238e-309, 1.39069238e-309],
[1.39069238e-309, 1.39069238e-309, 1.39069238e-309]])
>>> np.linspace(0, pi, 3)
array([0. , 1.57079633, 3.14159265])
函数 zeros 创建一个全是0的数组,函数ones创建一个全1的数组,函数empty创建一个内容随机并且依赖与内存状态的数组。默认创建的数组类型(dtype)都是float64。
sum()、min()、max()
指定axis参数你可以吧运算应用到数组指定的轴上:
>>> a = np.arange(12).reshape(3,4)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> a.sum()
66
>>> a.min()
0
>>> a.max()
11
>>> a.sum(axis=0) # sum of each column
array([12, 15, 18, 21])
>>>
>>> a.min(axis=1) # min of each row
array([0, 4, 8])
>>>
>>> a.cumsum(axis=1) # cumulative sum along each row
array([[ 0, 1, 3, 6],
[ 4, 9, 15, 22],
[ 8, 17, 27, 38]])
NumPy提供常见的数学函数如sin,cos和exp。在NumPy中,这些叫作“通用函数”(ufunc)。在NumPy里这些函数作用按数组的元素运算,产生一个数组作为输出。
>>> b = np.arange(3)
>>> b
array([0, 1, 2])
>>> exp(b)
array([ 1. , 2.71828183, 7.3890561 ])
>>> sqrt(b)
array([ 0. , 1. , 1.41421356])
>>> c = np.array([2., -1., 4.])
>>> add(b, c)
array([ 2., 0., 6.])
更多函数all, alltrue, any, apply along axis, argmax, argmin, argsort, average, bincount, ceil, clip, conj, conjugate, corrcoef, cov, cross, cumprod, cumsum, diff, dot, floor, inner, inv, lexsort, max, maximum, mean, median, min, minimum, nonzero, outer, prod, re, round, sometrue, sort, std, sum, trace, transpose, var, vdot, vectorize, where 参见:NumPy示例
当运算的是不同类型的数组时,结果数组和更普遍和精确的已知(这种行为叫做upcast)。
目前為止, 大概只有提到參考資料內容的 1/5, 待續
TypeError: can't multiply sequence by non-int of type 'list'
input_data = ([3, 5])
weights = {'node_0_0': ([2, 4]),
'node_0_1': ([ 4, -5]),
'node_1_0': ([-1, 2]),
'node_1_1': ([1, 2]),
'output': ([2, 7])}
input_data * weights['node_0_0'
TypeError: can't multiply sequence by non-int of type 'list'
input_data = np.array([3, 5])
weights = {'node_0_0': ([2, 4]),
'node_0_1': ([ 4, -5]),
'node_1_0': ([-1, 2]),
'node_1_1': ([1, 2]),
'output': ([2, 7])}
pd.get_dummies 與 np_utils.to_categorical 不同用法
參考資料:
1. Python Software Foundation random
4. Stackoverflow Differences between numpy.random and random.random in Python
特別注意最後一行, 是將整個 object 當成一個 item 儲存
The list.append method appends an object to the end of the list.
my_list.append(object)
Whatever the object is, whether a number, a string, another list, or something else, it gets added onto the end of my_list as a single entry on the list.
>>> my_list
['foo', 'bar']
>>> my_list.append('baz')
>>> my_list
['foo', 'bar', 'baz']
So keep in mind that a list is an object. If you append another list onto a list, the first list will be a single object at the end of the list (which may not be what you want):
>>> another_list = [1, 2, 3]
>>> my_list.append(another_list)
>>> my_list
['foo', 'bar', 'baz', [1, 2, 3]]
#^^^^^^^^^--- single item on end of list.
bigdata = data1.append(data2, ignore_index=True)
特別注意最後一行, 是將 'baz' 拆成 3 個 item 儲存
=> 目前為止, 我都用 append (還真不知道 extend 有什麼用途)
The list.extend method extends a list by appending elements from an iterable:
my_list.extend(iterable)
So with extend, each element of the iterable gets appended onto the list. For example:
>>> my_list
['foo', 'bar']
>>> another_list = [1, 2, 3]
>>> my_list.extend(another_list)
>>> my_list
['foo', 'bar', 1, 2, 3]
Keep in mind that a string is an iterable, so if you extend a list with a string, you'll append each character as you iterate over the string (which may not be what you want):
>>> my_list.extend('baz')
>>> my_list
['foo', 'bar', 1, 2, 3, 'b', 'a', 'z']
關於 Pandas 的基礎
get_dummies vs to_categorical
Pandas 提供的資料結構
1. Series:用來處理時間序列相關的資料(如感測器資料等),主要為建立索引的一維陣列。
2. DataFrame:用來處理結構化(Table like)的資料,有列索引與欄標籤的二維資料集,例如關聯式資料庫、CSV 等等。
3. Panel:用來處理有資料及索引、列索引與欄標籤的三維資料集。
|
|
---|---|
|
|
|
|
|
|
|
|
|
---|---|---|
|
|
|
|
|
|
|
|
|
1 – Boolean Indexing
2 – Apply Function
3 – Imputing missing files
4 – Pivot Table
5 – Multi-Indexing
6 – Crosstab
7 – Merge DataFrames
8 – Sorting DataFrames
9 – Plotting (Boxplot & Histogram)
10 – Cut function for binning
11 – Coding nominal data
12 – Iterating over rows of a dataframe
DataFrame 搜尋字串
df = df.drop(df[df['id'].str.contains('特殊字串')].index, 0)
df[df['A'].str.contains("hello")]
DataFrame 數字中內含 NaN 時,用 0 (或其它數字, 例如平均數, 取代)
df['column'].fillna(0, inplace=True) # can be another number instead of 0
# or
age_mean = df['age'].mean()
df['age'] = df['age'].fillna(age_mean, inplace=True)
df['sex']= df['sex'].map({'female':0, 'male': 1}).astype(int)
# or
for i in df['Sex']:
if i=='male':
male.append(1)
else:
male.append(0)
df['sex'] = male
見下一頁的 to_categorical
In[1]: df[:2]
Out[1]:
embarked survived pclass sex age sibsp parch fare
S 0 1 1 0 29.0000 0 0 211.3375
S 1 1 1 1 0.9167 1 2 151.5500
In[2]: x_OneHot_df = pd.get_dummies(data=df,columns=["embarked" ])
In[3]: x_OneHot_df[:2]
Out[3]:
embarked_C embarked_Q embarked_S survived pclass sex age sibsp parch fare
0 0 1 0 1 1 0 29.0000 0 0 211.3375
0 0 1 1 1 1 1 0.9167 1 2 151.5500
# or my dummy way
for i in df['Embarked']:
if i=='C':
embarked_from_cherbourg.append(1)
else:
embarked_from_cherbourg.append(0)
for i in df['Embarked']:
if i=='Q':
embarked_from_queenstown.append(1)
else:
embarked_from_queenstown.append(0)
for i in df['Embarked']:
if i=='S':
embarked_from_southampton.append(1)
else:
embarked_from_southampton.append(0)
df['embarked_from_cherbourg'] = embarked_from_cherbourg
df['embarked_from_queenstown'] = embarked_from_queenstown
df['embarked_from_southampton'] = embarked_from_southampton
見上一頁的 get_dummies()
from numpy import array
from numpy import argmax
from keras.utils import to_categorical
# define example
data = [1, 3, 2, 0, 3, 2, 2, 1, 0, 1]
data = array(data)
print(data)
# one hot encode
encoded = to_categorical(data)
print(encoded)
# invert encoding
inverted = argmax(encoded[0])
print(inverted)
# 原文網址:https://itw01.com/GJFRE5J.html
axis = 0 指的是 index
axis = 1 指的是 columns
df.loc['2018-1-1': , ['column1', 'column2']]
df.iloc[0: , ['column1','column2']]
holding_stocks.at[i,'price']= float(stock_price['Close'])
df.cumsum(axis = 0)
df['column1']
df.drop('2018-1-1', index = 0)
df.drop('column1', index =1)
bytes & bytearray 是用於處理位元組資料型態
bytes是不可變
bytearray是可改變
兩個型態是保存8bit(byte)的無號整數構成的序列,範圍是0~255
提供了很多與str類似的方法,也支援切片
但用切片存取單一byte會回傳int物件
# byte.py
w=b"abc"
print(w[0])
print(type(w[0]))
print(w[:1])
print(type(w[:1]))
# byte_2.py
w=b"\x74\x61\x69\x70\x65\x69"
print(w)
a=bytes.fromhex("746169706569")
print(a)
print(type(a))
bytearr = bytearray(a)
print(bytearr)
print(type(bytearr))
bytearr.pop()
print(bytearr)
bytearr.pop()
print(bytearr)
bytearr.pop()
print(bytearr)
bytearr.append(110)
print(bytearr)
bytearr.append(97)
print(bytearr)
bytearr.append(ord("n"))
print(bytearr)
>>> y107m01.sort_values(by='mom')
>>> y107m01.sort_values(by='mom').tail(20)
# -*- coding: utf-8 -*-
import csv
f = open('example.csv', 'r')
for row in csv.reader(f):
print row
f.close()
# -*- coding: utf-8 -*-
import csv
f = open('example.csv', 'r')
for row in csv.DictReader(f, ["日期", "成交股數", "成交金額", "成交筆數", "指數", "漲跌點數"]):
print row['指數']
df.to_csv(file_name, encoding='utf-8', index=False)
df = pd.read_csv(file_name, encoding='utf-8', index_col=0)
Stackoverflow 的 Print multiple arguments in python
Here are some common ways of doing it:
1. Pass it as a tuple:
print("Total score for %s is %s" % (name, score))
2. Pass it as a dictionary:
print("Total score for %(n)s is %(s)s" % {'n': name, 's': score})
There's also new-style string formatting, which might be a little easier to read:
3. Use the new-style string formatting:
print("Total score for {} is {}".format(name, score))
4. Use the new-style string formatting with numbers (useful for reordering or printing the same one multiple times):
print("Total score for {0} is {1}".format(name, score))
5. Use the new-style string formatting with explicit names:
print("Total score for {n} is {s}".format(n=name, s=score))
The clearest two, in my opinion:
6.Pass the values as parameters and print will do it:
print("Total score for", name, "is", score)
7. If you don't want spaces to be inserted automatically by print in the above example, change the sep parameter:
print("Total score for ", name, " is ", score, sep='')
If you're using Python 2, won't be able to use the last two because print isn't a function in Python 2. You can, however, import this behavior from __future__:
from __future__ import print_function
Use the new f-string formatting in Python 3.6:
print(f'Total score for {name} is {score}')
异常名称 描述
BaseException 所有异常的基类
SystemExit 解释器请求退出
KeyboardInterrupt 用户中断执行(通常是输入^C)
Exception 常规错误的基类
StopIteration 迭代器没有更多的值
GeneratorExit 生成器(generator)发生异常来通知退出
StandardError 所有的内建标准异常的基类
ArithmeticError 所有数值计算错误的基类
FloatingPointError 浮点计算错误
OverflowError 数值运算超出最大限制
ZeroDivisionError 除(或取模)零 (所有数据类型)
AssertionError 断言语句失败
AttributeError 对象没有这个属性
EOFError 没有内建输入,到达EOF 标记
EnvironmentError 操作系统错误的基类
IOError 输入/输出操作失败
OSError 操作系统错误
WindowsError 系统调用失败
ImportError 导入模块/对象失败
LookupError 无效数据查询的基类
IndexError 序列中没有此索引(index)
KeyError 映射中没有这个键
MemoryError 内存溢出错误(对于Python 解释器不是致命的)
NameError 未声明/初始化对象 (没有属性)
UnboundLocalError 访问未初始化的本地变量
ReferenceError 弱引用(Weak reference)试图访问已经垃圾回收了的对象
RuntimeError 一般的运行时错误
NotImplementedError 尚未实现的方法
SyntaxError Python 语法错误
IndentationError 缩进错误
TabError Tab 和空格混用
SystemError 一般的解释器系统错误
TypeError 对类型无效的操作
ValueError 传入无效的参数
UnicodeError Unicode 相关的错误
UnicodeDecodeError Unicode 解码时的错误
UnicodeEncodeError Unicode 编码时错误
UnicodeTranslateError Unicode 转换时错误
Warning 警告的基类
DeprecationWarning 关于被弃用的特征的警告
FutureWarning 关于构造将来语义会有改变的警告
OverflowWarning 旧的关于自动提升为长整型(long)的警告
PendingDeprecationWarning 关于特性将会被废弃的警告
RuntimeWarning 可疑的运行时行为(runtime behavior)的警告
SyntaxWarning 可疑的语法的警告
UserWarning 用户代码生成的警告
list.method(object)
dict2 = dict1.method()
z = x.copy() # 將 x copy 到 z, why not copy(x) or x.copy
df.info() # df (dataframe) 的 info
module.method
sys.platform # sys.platform() generate TypeError
df.method(file_name)
l = method(s)
l = len(s) # 返回字串 s 到長度到 l
x.counter = 1
x.counter = 1 # x 是 MyClass 物件下的一個 instance, not x(counter)=1
list.met
# coding: utf-8
# In[1]:
import struct
packed = struct.pack('>i4sh', 7, b'spam', 8)
packed
# Out[1]:
b'\x00\x00\x00\x07spam\x00\x08'
# In[2]:
file = open('binary_data.bin', 'wb')
file.write(packed)
file.close()
# In[3]:
data = open('binary_data.bin', 'rb').read()
data
# Out[3]:
b'\x00\x00\x00\x07spam\x00\x08'
# In[4]:
list(data)
# Out[4]:
[0, 0, 0, 7, 115, 112, 97, 109, 0, 8]
# In[5]:
struct.unpack('>i4sh', data)
# Out[5]:
(7, b'spam', 8)
venv :
distributed with Python 3.3, and simple, straightforward with no feature besides the bare necessities
virtualenv :
the most significant difference is the wide variety of Pythons that virtualenv supports.
Support convenient wrapper such as virtualenvwrapper (http://virtualenvwrapper.readthedocs.org/)
# pyvenv test_venv
# . ./test_venv/bin/activate
(test_venv) #
# python3 -m venv test_venv
# . ./test_venv/bin/activate
(test_venv) #
Download get-pip.py file: http://bootstrap.pypa.io/get-pip.py
# python get-pip.py
# sudo apt-get install python3-dev
# sudo apt-get build-dep python3
# sudo apt-get install python3-devel
# sudo apt-get build-dep python3
import re
my_regex = re.compile("[0-9]+", re.I)
import regex
my_regex = regex.compile("[0-9]+", regex.I)
import matplotlib.pyplot as plt
from collections import defaultdict, Counter
lookup = defaultdict(int)
my_counter = Counter()
from __future__ import division
def double(x):
return x * 2
def apply_to_one(f):
return f(1)
my_double = double
x = apply_to_one(my_double) # equals 2
my_double = double
x = apply_to_one(double(1))
def my_print(message="my default message"):
print message
my_print("hello") # prints 'hello'
my_print() # prints 'my default message'