"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

November 06, 2020

Next Paper Read - Docker, RDBMS to ML

Paper #1 - An introduction to Docker for reproducible research

Key Notes

Docker provides a binary image in which all the software has already been installed, configured and tested

Technical Issues in Software Deployment

  • Software Dependency Hell
  • Imprecise documentation

Docker Features

  • Performing Linux container (LXC) based operating system (OS) level virtualization
  • Portable deployment of containers across platforms component reuse
  • Versioning of container images
  • Docker images share the Linux kernel with the host machine
  • Sharing the Linux kernel makes Docker much more lightweight and higher performing than complete virtual machines

Components

  • Dockerfiles provide a simple script (similar to a Makefile) that defines exactly how to build up the image
  • Docker also supports Automated Builds through the Docker Hub (hub.docker.com).

Paper #2 - The Relational Data Borg is Learning

Key Notes

  • RDBMS in Data Science
  • Widespread need for efficient data processing
  • Process beyond classical database workloads
  • From the Survey 65% data is Relational. Retail has maximum structured data :)

Automated Feature Learning Approach

Key Features for Retail Stores

  • Items in stores
  • Store information
  • Demographics for areas around the stores
  • Inventory units for items in stores on particular dates
  • Weather Information

Queries based on Filters

  • Feature extraction query that joins these relations on keys for dates, locations, zipcode, and items
  • LMFAO (Layered Multiple Functional Aggregates Optimisation) 
  • PCA over relational data

Insights

  • Running aggregates over days, weeks, months; min, max, average, median aggregates, or aggregates over many-to-many relationships and categorical attributes

ML Tasks

  • One-hot encoded
  • Categorical attributes
  • New database workload motivated by a machine learning application
  • Similar aggregates are derived for k-means clustering

(Iterative Functional Aggregate Queries) Framework

  • IFAQ can automatically synthesise and optimise aggregates from ML+DB workloads


Key Insights / Lessons

  • Turn the learning problem into a database problem.
  • Exploit the problem structure to lower the complexity.
  • Generate optimised code to lower the constant factors

There is no Data Science without Database - RDBMS :) :)

Happy Learning!!!

No comments: