Social and Political Data Science: Introduction

Knowledge Mining

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Machine Learning Software

  • R:
    • randomForest
    • caret
    • keras
    • xgboost
    • glmnet
  • Python
    • scikit-learn
    • TensorFlow (Google)
    • Keras
    • XGBoost
    • PyTorch

 

Popular software on machine learning and deep learning

  • Why popular: Provides a consistent interface to various machine learning algorithms and simplifies the process of building, training, and evaluating models.
  • Important functionalities: Data preprocessing, feature selection, model training, hyperparameter tuning, and performance evaluation.

 

R: caret

  • Why popular: Implements a popular ensemble learning method that is easy to use and has good performance on a wide range of problems.
  • Important functionalities: Classification, regression, variable importance estimation, and proximity analysis using random forests.
  • Authors: Leo Breiman, Adele Cutler, Andy Liaw, Matthew Wiener
  • Citation: Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R News, 2(3), 18-22. URL: https://cran.r-project.org/doc/Rnews/Rnews_2002-3.pdf.

 

R: randomForest

  • Why popular: Implements an efficient and powerful gradient boosting algorithm that has demonstrated strong performance on various machine learning tasks and has been widely adopted in data science competitions.
  • Important functionalities: Gradient boosting for classification and regression, handling of missing data, parallel and distributed computing support.
  • Author: Tianqi Chen
  • Citation: Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. URL: http://dx.doi.org/10.1145/2939672.2939785.

 

R: xgboost

  • Why popular: Implements generalized linear models with regularization (LASSO and Ridge regression), which helps prevent overfitting and improve model generalization.
  • Important functionalities: Linear regression, logistic regression, and Cox regression with LASSO and Ridge regularization, cross-validation for hyperparameter tuning.
  • Authors: Jerome Friedman, Trevor Hastie, Rob Tibshirani
  • Citation: Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. URL: http://www.jstatsoft.org/v33/i01/.
  •  

 

R: glmnet

  • Why popular: Provides an R interface to the Keras deep learning library, enabling users to define and train deep learning models in R using TensorFlow as the backend.
  • Important functionalities: Support for various types of neural networks (feedforward, convolutional, and recurrent networks), pre-processing, data augmentation, and real-time visualization of training progress.
  • Authors: JJ Allaire, François Chollet
  • Citation: Allaire, J.J., Chollet, F. (2019). keras: R Interface to 'Keras'. R package version 2.2.4.1. URL: https://CRAN.R-project.org/package=keras.

 

R: keras (Python based)

  • Why popular: Provides a simple and consistent interface to a wide range of machine learning algorithms, making it easy to build, train, and evaluate models.
  • Important functionalities: Preprocessing, feature selection, dimensionality reduction, model training, hyperparameter tuning, and performance evaluation for various classification, regression, and clustering algorithms.
  • Authors: Scikit-learn developers
  • Citation: Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830. URL: https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html.

 

Python: scikit-learn

  • Why popular: Developed by Google, TensorFlow is a powerful and flexible open-source library for numerical computation and machine learning, with a particular focus on deep learning.
  • Important functionalities: Support for various neural network architectures, distributed computing, and GPU acceleration, as well as an extensive ecosystem of tools and libraries.

 

Python: TensorFlow


Python: TensorFlow

  • Why popular: A high-level deep learning library built on top of TensorFlow, Keras provides an easy-to-use interface for defining and training deep learning models.
  • Important functionalities: Support for various types of neural networks (feedforward, convolutional, and recurrent networks), pre-processing, data augmentation, and real-time visualization of training progress.
  • Author: François Chollet
  • Citation: Chollet, F. et al. (2015). Keras. URL: https://keras.io.

 

Python: Keras

  • Why popular: Similar to its R counterpart, the Python implementation of XGBoost is an efficient and powerful gradient boosting algorithm that has demonstrated strong performance on various machine learning tasks.
  • Important functionalities: Gradient boosting for classification and regression, handling of missing data, parallel and distributed computing support.
  • Author: Tianqi Chen
  • Citation: Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. URL: http://dx.doi.org/10.1145/2939672.2939785.

 

Python: XGBoost

  • Why popular: Developed by Facebook, PyTorch is a popular open-source deep learning library known for its flexibility, ease of use, and dynamic computation graph capabilities.
  • Important functionalities: Support for various neural network architectures, distributed computing, GPU acceleration, and a vast ecosystem of tools and libraries.
  • Authors: PyTorch developers, Facebook AI Research
  • Citation: Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems, 32, 8024-8035. URL: https://papers.nips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.

 

Python: PyTorch

Knowledge Mining: Machine Learning Software

By Karl Ho

Knowledge Mining: Machine Learning Software

  • 100