Skip to content

Machine Learning

  • ML library: uses dataframes
  • MLLib: uses RDDs (deprecated)

Capabilities

  • Feature Extraction
  • TF (Term Frequency)
  • IDF (Inverse Document Frequency)
  • Useful for search
  • Basic Statistics
  • Chi-squared test
  • Pearson correlation
  • Spearman correlation
  • Min, max, mean, variance
  • Linear regression
  • Fit a line onto a set of data
  • Logistic regression
  • Fit a curve onto a set of data
  • Support Vector Machines (SVM)
  • Find complex classification divisions
  • Naive Bayes classifier
  • Classification
  • Decision trees
  • K-Means clustering
  • Cluster data based on attributes
  • Unsupervised learning
  • Principal Component Analysis (PCA)
  • Singular value decomposition (SVD)
  • Dimensionality reduction techniques
  • Recommendations
  • Alternating Least Squares (ALS)