Pytube error - RegexMatchError: get_throttling_function_name: could not find match for multiple
June 28, 2023
Pytube error - RegexMatchError: get_throttling_function_name: could not find match for multiple
Better communication and connection
- Move Slow talk slow
- Make the other person feel safe
- Slow low tone better than loud note
- Comfort enables more thinking
- Panic restricts access to memory
- Mindset of open and courage not manipulation
- More Questions than opinions
- Eyes open for everything invisible
Designing Async / Paralleize tasks
Earlier roles / Feeds Processing
Had some critical tasks getting supplier feeds/data
- File Copy
- File load
- Run Validations
- Load Data
DB supporting it
- Schema
- Jobs / Options / Run status / Retry
Technically from a scale point of view
- File watcher
- File lock
- Process jobs
Validations
- Bunch of procedures
Key design tweaks
- Paralellize operations
- Data copy as objects/temp tables
- Parallel file copy
- Support Multiple threads
- Avoid data blocking/updates
ML Context
- For a Parallel model creation
- Configuration
- Submit Job
- 5 timeseries category datasets / Global models in each category
- 10 jobs, 5 category dataset models
10 different models
- Prepare data Job
- Fetch initial data
- Process missing variables
- Data imputation
- Save Results
Execute Job
- Read prepared data
- Fetch Algo ro run
- Train algo
- Put training accuracy
- Save model
Predict Job
- Load saved model
- Run predictions
- Save in DB
Design Ideas
- Atomic functions
- Job monitor / independent execution units
- Horizontal scaling in Kubernetes App
- Common DB and multiple execution parallel functions
Python Options
Fast API - Uvicorn also has an option to start and run several worker processes. Link
uvicorn main:app --host 0.0.0.0 --port 8080 --workers 4
Flask API - Link
if __name__ == '__main__':
app.run(threaded=True)
flask run --with-threads
app.run(threaded=True)
More Reads
System Design — Design a distributed job scheduler (Keep It Simple Stupid Interview series)
Orchestrating a Background Job Workflow in Celery for Python
System Design: Designing a distributed Job Scheduler | Many interesting concepts to learn
Examples
Keep Exploring!!!
June 27, 2023
Data science = Data + Domain + AI + Commonsense
Many times I read up basics again and again, Over the years, I started with Windows98 Testing, C/C++ Adapters, Nestle production support, Application support, Supply chain QA / Performance / OLTP Development, SQL Developer, BI Developer, Setting up Teams, Warranty, Refurbishment, API / Supply chain, Website A/B testing, On call support. Retail product team setup/forecasting/scaling and then a long 2-year learning curve / paid lectures / back to basics mode. More learning started after that. Getting a break needs a lot of freelance / consulting/training / applied learning. Past 3 years very focused on learning/projects/production deployments.
Now when I teach the flow/work, there are different areas overall to understand products/domain/use cases
- Stats, Probability A/B tests, LR
- ML world - Decision trees, SVM, Logistic regression, Random forests
- Some variations of it for anomaly detection, decision tree regressors, SVM regressor, loss functions, conditional random fields
- The deep learning side of CNN, RNN, LSTM, Transformers
- NLP side of token, embeddings, different architectures to latest state of art BERT, ChatGPT, Zero shot, few shot approaches
- Forecast track with different models both regression/time series approaches
- Recommendation track with basics to advanced hybrid models, user-user, item-item, hybrid, seasonal, and segment based
- Vision side of custom object, classification, transfer learning, segmentation, applied use cases
- World of genAI for text/vision
- Apart from this the production/deployment architecture
Sometimes I wonder how many things we can teach someone to switch to AI / ML. Always leverage your strengths in domain/data knowledge. It is vast and increasing day by day the scope of it. To succeed it is hard to know everything but the end goal is to add value to the business / use it to fix current challenges. Balance both learning and implementation. It will be a long journey to just learn forever.
Always blend your ideas in DATA + DOMAIN + AI + Business Value to find the right use cases.
Keep Exploring!!!
10 Reasons why Gen AI will Work vs Fail
Let's list 10 reasons why GenAI will succeed
- Saves time / provides ideas
- Create Power ideas with richer inspirations, add emotions, and logic with words/statements
- Copywriter / content writer effort draft versions can provide
- Summarize the given range with critical points
- Acts like assistant / chatbot
- Visual inspiration with images
- Generate different styles/notes/marketing/promo content
- It can create content based on prompts
- Human-like responses/content with proper grammar/flow
- Share opinions by reasons
Let's list 10 reasons why GenAI will Fail
- Cannot scale in every field. Cannot be considered for all domains
- Cannot be factually right always
- Draft content vs final content has to fill this space
- Content generated needs human validation
- Without relevant knowledge, we may not be able to spot issues
- Needs iterations to prepare the high-quality output
- Balance capabilities vs shortcomings based on the use case
- Vision is a long way to go
- Empathy is more of trained distribution words, ideally, it's all human-fed content
- True reasoning / wide subject knowledge is really limited
June 26, 2023
Background removal with Azure API
In Azure Portal
Under cognitive services you have below options
Fetch Endpoint and Keys after deployment
Example code in link
Keep Exploring!!!
June 24, 2023
My Langchain Notes - Day 1
- LLM are good at conditional generation
- P(next token | prompt)
- LLMs are not storing state
- Token size is key for better answers
- Langchain - Build apps with LLM
The different types of prompts - zero shot with limited prompts, reasoning preserve states, simplified step by step prompts
A prompt template can contain:
- instructions to the language model,
- a set of few shot examples to help the language model generate a better response,
- a question to the language model.
A few shot prompt template can be constructed from either a set of examples, or from an Example Selector object.
Few shot examples for chat models
Ref - Link
Keep Exploring!!!
June 22, 2023
Scaling Applications
We have AWS Lambda, GCP Cloud run servless function options. This will help effectively to autoscale.
For custom apps / rest / flask / fast end points how to we auto scale
- Horizontal Pod Autoscaler (HPA):adjusts the number of replicas of an application.
- HPA is a form of autoscaling that increases or decreases the number of pods
Ref - Link
HorizontalPodAutoscaler Walkthrough
Key Notes
kubectl autoscale subcommand, part of kubectl, that helps you do this.
kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10
# You can use "hpa" or "horizontalpodautoscaler"; either name works OK.
kubectl get hpa
How to Test Autoscaling in Kubernetes
Keep Exploring!!!
June 20, 2023
June 19, 2023
Virtual Try on - TryOnDiffusion: A Tale of Two UNets
- Transfer clothes between source, target
- Warping, blending
- Occlusion is challenging
- Diffusion models to handle issues
Warping -
Warping involves transforming an image's geometry, usually to correct distortions, align images, or change the perspective.
There are different types of warping, such as:
- Affine warping: This type of warping preserves parallel lines and involves a linear transformation followed by a translation. It can represent transformations like rotation, scaling, and shearing.
- Perspective (projective) warping: This type of warping can represent a more general transformation that includes perspective changes. It can correct distortions caused by the camera's viewpoint or create a "bird's-eye view" of a scene. Perspective warping requires four pairs of corresponding points in the input and output images to calculate the transformation matrix.
- Warping is widely used in various applications, such as image stitching (for creating panoramas), rectifying images for OCR (Optical Character Recognition), and correcting lens distortions in photographs.
In the context of computer vision libraries like OpenCV, warping functions are available to apply these transformations to images, given the appropriate transformation matrix and input/output coordinates.
OpenCV Warping functions
- cv2.warpAffine
- cv2.warpPerspective
- cv2.remap
OpenCV Blending functions
- cv2.addWeighted
- cv2.add
- cv2.subtract
- Note #1 - All segmentation done on low resolution
- Note #2 - Super Resolution is added to cover up low res and give high res outputs
- Note #3 - Running all tasks on high res is even more challenging
Paper - link
Things to note
- Poster detection
- More minmal clothes and superimposition approach
- Full body posture + cloth overlap on it.
Keep Exploring!!!
June 16, 2023
Concept Kullback-Leibler (KL) divergence, chi-square test
Kullback-Leibler (KL) divergence, also known as relative entropy. It is a measure of how one probability distribution is different from a second, reference probability distribution.
chi-square test is a statistical test used to determine whether there is a significant association between two categorical variables in a sample.
Keep Exploring!!!
June 15, 2023
June 14, 2023
June 13, 2023
My top picklist from article on Palantir principles
Expertise = Experience = Domain + Data + Tech (Blend of all)
- Do real customer work long enough to have full empathy and inspire.
- Don’t just empathize with the user; be the user.
- Built prototype solutions for the unique problems
- Build features that magnify value over time
- Consider using working products to iterate with users instead of designs and concepts.
Easier said, I can echo it to my past journey/projects. Solving for things the customer needs vs I know a tech what all I can do.
Ref - Link
Keep Exploring!!!
Thinking Questions
Reduction in Cart Abandonment
Key Notes
- RCA
- External Factors
- Competitor Launches / Products
- Affected Customer Segment
- Macro Economic changes
- Seasonality
- User Journey in App
- Data Captured - Gender / Age
- Campaign related impacts
- No correlation to compaign
- Any product design changes
- Catalog and inventory analysis of the product
- Which product category etc. has the highest dip.
- Any partnership change or any merchant backed.
- Geographical distribution of the influx of users on the site like a flood, internet blackout
- Compare the pricing of the product with the competitors
- Ratings of the products getting moved out of the cart
Ref - Link
Solutions Architect
Key Notes
- Challenges, Business Goals, Tech Goals
- Feature / Product Demo
- Product Integration aspects
- Customer Success Stories
- Next Steps
Ref - Link
MLOps
Key Projects
- Loan scoring, Forecasting
- MLOps Pipeline - Data Collection, Ingestion
- Data cleaning + Feature Engineering
- Different models for different products
- Automate model selection / training steps
- Model Validation / Testing phase
Ref - Link
System Design for Recommendations and Search
Key Notes
- Batched - Store in DB, Precomputed, Refreshed, Key-value pairs
- Real-time - Time-sensitive content
Key Concepts
- Embedding creation of interests
- Features mapping
- Ranking / Retrieval
- Behavior logs - candidate sets - recommendations
- Top N neighbors, KNN, Indexes
Ref - link
Model Deployment Architecture
My implementation experience and lessons :)
Product Implementation (2012-2014)
- Integrated in product
- Jobs scheduled for midnight
- Workflow to monitor variations
- Forecast updated every day for store
- Everything custom-coded formula embedded
- Weighted moving average
- Step up / Step down moving average approach
Batched State of Art (2021)
Recommendations AWS
- ETL / Glue jobs to get featured
- Full pull/delta pull scripts
- Feature engineering scripts
- Custom segmentation scripts
- Batch jobs to run models
- Large-scale recommendations generation
- Infra kubeflow setup
- Leverage existing Kubeflow monitoring setup
Forecasting State of Art (2021)
Kubeflow + AWS
- ETL / Glue jobs to get features
- Full pull/delta pull scripts
- Feature engineering scripts
- Custom segmentation scripts
- Batch jobs to run models
- Kubeflow pipelines for the forecast
- Results persist in Redshift DB
- Infra kubeflow setup
- Leverage existing kubeflow monitoring setup
Realtime State of Art (2022)
Real-time streaming / Vision Solution
- AWS Lamdbda-based approach
- Vision + Docker + AWS Lambda
- Request monitoring / logging
Keep Exploring!!!
Vector Databases Reads
Milvus Notes - Index/consistency / availability options
#1. Index type - usecase
- IVF_FLAT - High-speed query
- IVF_PQ - Very high-speed query
- HNSW - High-speed query
Inverted File (IVF): An IVF index divides the vector space into several clusters and holds an inverted file for each cluster, recording which vectors belong to the cluster.
IVF Flat: This is a combination of IVF and flat index. It uses the IVF index to partition the data into clusters and then uses the flat index (brute-force search) within each cluster.
Hierarchical Navigable Small World (HNSW): HNSW builds a multi-layer navigation graph to represent the vector space.
#2. Consistency levels - Strong, Bounded, Session or Eventually
- Strong - Most strict
- Bounded staleness - allows data inconsistency during a certain period of time.
- Session - Like dirty reads
- Eventually - weakest level among the four.
#3. HA - In-memory replicas help Milvus recover faster if a query node crashes.
#4. Vector search & Hybrid Search params offset, limit
offset - Number of results to skip in the returned set
limit - Number of the most similar results to return
How indexing and querying works
- Trees – ANNOY - Annoy (Approximate Nearest Neighbors Oh Yeah)
- Proximity graphs - HNSW Hierarchical Navigable Small World (HNSW) Graphs
- Clustering - FAISS
- Hashing - LSH - Locality-Sensitive Hashing (LSH)
- Vector compression - PQ or SCANN. - ScaNN (Scalable Nearest Neighbors). Product Quantization (PQ): PQ index compresses vectors into compact codes and is beneficial for large-scale, high-dimensional data.
- Utilizing Few-shot and Zero-shot learning with OpenAI embeddings
- Query Comparison
- Accelerating Similarity Search on Really Big Data with Vector Indexing
Keep Exploring!!!
June 12, 2023
Vision Catalog Creation
Every problem statement needs to have
- Selected products
- Custom backgrounds
- Present / Segmentation/options
- Variations with easy to use approach
1. Define products/layouts
2. Custom layout for each type of product
3. Once the product positioned custom background4. Generate a photoshoot
Keep Exploring!!!
June 11, 2023
How to train your own LLM - Copilot type LLMs
Notes
- Scenarios to custom train
- Privacy, IP, Customization
- Smaller and Efficient Models
- Restrict Information shared with LLM models
- Code completion model by Replit
Stack
- Databricks pipeline
- Hugging Face for tokenizers / inference tools for code
- MosaicML - GPU and model training
- Training LLM Architecture
- Extensive code base of Git / Stackoverflow
- Data preprocessing
- All preprocessing in distributed fashion
- Lot of work on notebooks
- Removed auto generated code from training
- Anonymize data remove PII info
- Remove code that does not compile
- Remove Python2 code and keep it for one version
- Maximum line length set
- Custom Vocabulary creation
- Custom tokenizer for domain specific dataset
MosaicML for training
Future
- Optimal / Smaller LLM
- Customized LLMs
- LLM with reasoning
Keep Exploring!!!
June 10, 2023
DBScan vs KMeans Summary
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and K-means are two popular clustering algorithms used for unsupervised learning tasks. They have different approaches to clustering and are suitable for different types of data. Here's a comparison of the two algorithms:
Approach:
DBSCAN: DBSCAN is a density-based clustering algorithm. It groups together points that are closely packed together based on a distance measure (e.g., Euclidean distance) and a density threshold. It can find clusters of arbitrary shapes and is also able to identify noise points that do not belong to any cluster.
K-means: K-means is a centroid-based clustering algorithm. It partitions the data into K clusters by minimizing the sum of squared distances between the data points and their corresponding cluster centroids. K-means assumes that clusters are spherical and have similar sizes.
Number of clusters:
DBSCAN: The number of clusters is determined automatically by the algorithm based on the input parameters (distance threshold and minimum number of points). You don't need to specify the number of clusters beforehand.
K-means: You need to specify the number of clusters (K) beforehand. Choosing the optimal value of K can be challenging and often requires domain knowledge or using techniques like the elbow method or silhouette analysis.
Cluster shapes:
DBSCAN: DBSCAN can find clusters of arbitrary shapes, making it suitable for datasets with complex structures.
K-means: K-means assumes that clusters are spherical and have similar sizes, which may not be suitable for datasets with complex structures or clusters with different shapes and sizes.
Handling noise:
DBSCAN: DBSCAN can identify and separate noise points that do not belong to any cluster.
K-means: K-means is sensitive to noise and outliers, as they can significantly affect the position of the cluster centroids.
Scalability:
DBSCAN: DBSCAN can be slower than K-means for large datasets, especially if the distance matrix needs to be computed. However, there are optimized versions of DBSCAN (e.g., HDBSCAN) that can handle large datasets more efficiently.
K-means: K-means is generally faster and more scalable than DBSCAN, especially when using optimized implementations (e.g., MiniBatchKMeans in scikit-learn).
In summary, DBSCAN is more suitable for datasets with complex structures, arbitrary cluster shapes, and noise, while K-means is faster and more scalable but assumes spherical clusters with similar sizes. The choice between DBSCAN and K-means depends on the characteristics of the data and the specific requirements of the clustering task.
Keep Exploring!!!
June 07, 2023
Data science takes time
- Real world is not kaggle data
- Its is very risky for more reliance on technology and less understanding of problem
- Do not jump into solutions without knowing domain
- Intent should not be solve fast but to solve with clarity
- Have a open mind about Domain vs Data vs Algo
- Be candid about opinions
- If all problems are like kaggle, we should have seen a ton of production solutions
- Interview questions may be products people spent years to build, Thought process / clarity is more important than quick working solutions
Keep Thinking!!!
June 05, 2023
Cashflow forecasting
Paper - Empowering cash managers to achieve cost savings by improving predictive accuracy
- Cash management is concerned with optimizing the short-term funding requirements of a company
Time Series Forecasting with Transformer Models and Application to Asset Management
- Sequence prediction - we often predict the next value of the sequence itself
- Sequence generation - convert sequences from one domain into sequences from another domain, such as machine translation, text summarization, chatbots
- Iterated multi-step forecasting
- Direct multi-step forecasting
Self-attention is designed to capture the dependencies in the sequence, such as the relationship between each word with each other word in a senten
For a given query, we compare it with all keys K and get different weights for different values
Self-attention and multi-head attention are permutation-equivariant with respect to its inputs
In our experiment, we consider three different portfolio allocation methods:
- Single-period MVO portfolio with monthly rebalancing
- Risk parity portfolio with monthly rebalancing
- Multi-period MVO portfolio with weekly rebalancing as described by Problem
How to Build a Cash Flow Forecast
- Determine Your Forecasting Objective(s)
- Short-term liquidity planning
- Interest and debt reduction
- Liquidity risk management
- Growth planning
Cash payments and receipts. - Short-period forecasts: Short-term forecasts typically look two to four weeks into the future and contain a daily breakdown of cash payments and receipts.
The most common medium-term forecast is the rolling 13-week cash flow forecast.
Long-period forecasts: Longer-term forecasts typically look 6–12 months into the future and are often the starting point for annual budgeting processes
Mixed-period forecasts: Mixed-period forecasts use a mix of the three periods above and are commonly used for liquidity risk management.
- Forecast your income or sales
- Estimate cash inflows
- Estimate cash outflows and expenses
- Review your estimated cash flows against the actual
Preparing a cash flow forecast: Simple steps for vital insight
- Decide how far out you want to plan for
- List all your income
- List all your outgoings
Empirical analysis of daily cash flow time series and its implications for forecasting
Cash management is concerned with the efficient use of a company’s cash and short-term investments such as marketable securities.
From these and other works, we observe that common assumptions on the statistical properties of cash flow time-series include:
- Normality: cash flows follow a Gaussian distribution with observations symmetrically centered around the mean, and with finite variance.
- Absence of correlation: the occurrence of past cash flows does not affect the probability of occurrence of the next ones.
- Stationarity: the probability distribution of cash flows does not change over time and, consequently, its statistical properties such as the mean and variance remain stable.
- Linearity: cash flows are proportional either to another (external) explanatory variable or to a combination of (external) explanatory variables.
Empowering cash managers to achieve cost savings by improving predictive accuracy
Kurtosis is a measure of the tailedness of a distribution. Tailedness is how often outliers occur
Transforming Financial Forecasting with Data Science and Machine Learning at Uber
- Strategic planning
- Operations
- Insights
Modeling strategic investments as an optimization problem
- Minimize spending
- Maximize number of drivers or riders
- Maximize number of first trips or total trips
- Maximize gross bookings
With each optimization problem, we can also specify constraints, such as:
- Maximum budget, overall or specific to certain channels (such as marketing versus rider promotion)
- Minimum number of first trips or trips
- Minimum month-to-month gross booking growth
Short-term use cases: Short-term use cases for cashflow forecasting include budgeting, forecasting sales, and managing cash flow. It can also be used to identify potential areas of overspending and to plan for future investments
Long-term use cases: Cashflow forecasting can be used to plan for long-term investments, such as capital expenditures and acquisitions. It can also be used to develop strategies for managing cash flow over the long-term, such as budgeting and debt management
- Receivables forecast
- Payable forecast
June 04, 2023
Forecast + Optimization
- Regression to find optimal values of 'X' values
- Add a constraint to make it an optimization problem
- Optimization with minimum expense for each track
June 03, 2023
Promises and Lies of ChatGPT - understanding how it works
- ChatGPT is the idea of n-gram models
- Given n-1 words guess nth word likely to be
- Distribution is learnt from sequence
- People tried in small values of n
- Sample from distribution of words
- More likely words more often
- Any N, Words next word
- Frequency, Conditional probability
- Generate words if the first word given
- More likely words + Patterns
- Abstract sequences
- Different answers every time
- Every sequence may be different generated distributions but a similar context is possible
- Chatgpt = something well written
- We believe in what seems realistic
- Connect to human experience
- Fact is different from possibility
- Plausible or probable or reasonable answers
- Humans are not always factual
- It can be perception based
- People can be finalized in civil society
- Machines can suggest without knowing the consequences
- Automation still may have a bias
- Being close to the truth we are impressed
- Can create bias in information
- Discriminate learning learns a conditional model
- Classifier then finds dogs vs generates dogs both different
- The prior distribution of reasonable images
- Teacher = Generative model
- Learning generative model is costlier
- chatgpt does something similar
- All learning is compression
- All learning is lossy compression
- jpeg lossy - approximating
- Representation of compressed details
- Significant footprint available to train systems
- Picaso style pics
- Shakespeare style writing
- Racial profiling not required
- Character and form are not connected
- Generalizations help for survival
- AI as creator / editor
- Harder to write original creative ways
- Original vs Derivative thinking
- Bad handwriting vs Good content
- Bad package vs Good product
- We have one scale good or bad
- LLM learns from human language
- Most likely completion given soceity is
- Social Enginner on Data
- RHLF
- Show results
- asks someone their likes
- Thumbs up / down to change distribution
- Re-learning it
- Collectively offensive content on web vs making a decent prompt engine
- Align to human values
- Concentration campus, Genocide - Human values
- Retrain for cultural norms
- False positive
- Different narrative, different takers
- Make LLM overwrite conditional network through prompts
- Adverserial learning prompts
- How to put knobs how it behaves well
- Basically put people to think about problem
- With enough eye balls every downside can be shallow bug
- We need more eyeballs to decide
- ChatGPT will not generate grammatically incorrect sentence
- Core problem of intelligent behavior - planning, diagnosis, reasoning
June 02, 2023
AI - Image Generator - Approach
Under the hood training from tons of images we are generating distributions based on
- Context generator = backgrounds
- Object generator = cars / bikes
- Object + Context = Car in beach
- Finetune to similar pictures
- Sharpen images / Super resolution
- Fix shapes / corners
Keep Exploring!!!