Paper #1 - An introduction to Docker for reproducible research
Key Notes
Docker provides a binary image in which all the software has already been installed, configured and tested
Technical Issues in Software Deployment
- Software Dependency Hell
- Imprecise documentation
Docker Features
- Performing Linux container (LXC) based operating system (OS) level virtualization
- Portable deployment of containers across platforms component reuse
- Versioning of container images
- Docker images share the Linux kernel with the host machine
- Sharing the Linux kernel makes Docker much more lightweight and higher performing than complete virtual machines
Components
- Dockerfiles provide a simple script (similar to a Makefile) that defines exactly how to build up the image
- Docker also supports Automated Builds through the Docker Hub (hub.docker.com).
Paper #2 - The Relational Data Borg is Learning
Key Notes
- RDBMS in Data Science
- Widespread need for efficient data processing
- Process beyond classical database workloads
- From the Survey 65% data is Relational. Retail has maximum structured data :)
Key Features for Retail Stores
- Items in stores
- Store information
- Demographics for areas around the stores
- Inventory units for items in stores on particular dates
- Weather Information
Queries based on Filters
- Feature extraction query that joins these relations on keys for dates, locations, zipcode, and items
- LMFAO (Layered Multiple Functional Aggregates Optimisation)
- PCA over relational data
Insights
- Running aggregates over days, weeks, months; min, max, average, median aggregates, or aggregates over many-to-many relationships and categorical attributes
ML Tasks
- One-hot encoded
- Categorical attributes
- New database workload motivated by a machine learning application
- Similar aggregates are derived for k-means clustering
(Iterative Functional Aggregate Queries) Framework
- IFAQ can automatically synthesise and optimise aggregates from ML+DB workloads
Key Insights / Lessons
- Turn the learning problem into a database problem.
- Exploit the problem structure to lower the complexity.
- Generate optimised code to lower the constant factors
There is no Data Science without Database - RDBMS :) :)
Happy Learning!!!
No comments:
Post a Comment