Paddle
2017/02/10 翁子皓 何憶琳
Outline
- Installation
- Quick start
- Run on GPU / Cluster
Installation
- cmake >= 3.0 (the doc says >= 2.8??)
- python 2.7
- protobuf
- numpy
NOTE: I use Anaconda for python and the packages
Installation
git clone https://github.com/baidu/Paddle paddle
cd paddle
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=<path> [other options...]
make -j && make install
export PATH=<path>:$PATH
paddle # run paddle, it will automatically install missing python packages
Quick Start
- There's a quick start tutorial in the document
- Learn how to configure and run paddle for training and prediecting
Steps
- training / testing data
- define input (dataprovider.py)
- define model (trainer_config.py)
- define output
Input File
- Create a file that contains a list of our data filenames
- Paddle will loop through the list and read each file
dataprovider
-
Define input type
-
Read file and preprocess data
dataprovider
from paddle.trainer.PyDataProvider2 import *
# initializer: define input type and additional meta-data
# kwargs: passed from trainer_config.py
def initializer(settings, **kwargs):
settings.input_type = { 'label': integer_value(10), 'word': dense_vector(100) }
# process: use yield to return a data
# file_name: a name from train_list
@provider(init_hook=initializer, cache=CacheType.CACHE_PASS_IN_MEM)
def process(settings, file_name):
with open(file_name, 'r') as f:
for line in f:
label, word = line.split(':')
label = int(label)
word = word.split(',')
yield { 'label': label, 'word': word }
trainer_config
- Configure parameters
- Define Layers
trainer_config
define_py_data_sources2(
train_list=trn,
test_list=tst,
module="dataprovider",
obj=process)
settings(
batch_size=batch_size,
learning_rate=2e-3,
learning_method=AdamOptimizer(),
regularization=L2Regularization(8e-4),
gradient_clipping_threshold=25)
trainer_config
data = data_layer(name="word", size=len(word_dict))
# Define a fully connected layer with logistic activation.
# (also called softmax activation).
output = fc_layer(input=data, size=2, act=SoftmaxActivation())
if not is_predict:
# For training, we need label and cost
# define the category id for each example.
# The size of the data layer is the number of labels.
label = data_layer(name="label", size=2)
# Define cross-entropy classification loss and error.
cls = classification_cost(input=output, label=label)
outputs(cls)
else:
# For prediction, no label is needed. We need to output
# We need to output classification result, and class probabilities.
maxid = maxid_layer(output)
outputs([maxid, output])
Run Paddle
cfg=trainer_config.py
paddle train \
--config=$cfg \
--save_dir=./output \
--trainer_count=4 \
--log_period=1000 \
--dot_period=10 \
--num_passes=10 \
--use_gpu=false \
--show_parameter_stats_period=3000 \
2>&1 | tee 'train.log'
CLI arguments
-
--config: network architecture path.
-
--save_dir: model save directory.
-
--log_period: the logging period per batch.
-
--num_passes: number of training passes. One pass means the training would go over the whole training dataset once.
-
--config_args: Other configuration arguments.
-
--init_model_path: The path of the initial model parameter.
Result
- Bullet One
- Bullet Two
- Bullet Three
GPU / Cluster training
To train with GPU, we need to compile with GPU option
cmake .. -DWITH_GPU=ON
And run with GPU option
paddle train --use_gpu=true
GPU / Cluster training
- paddle/scripts/cluster_train
- conf.py
- paddle.py
- run.sh
TODO
- 研究Model (RNN, LSTM之類的)
- Cluster Training
- Scalability
- PCA
Paddle
By Gordon Ueng
Paddle
- 758