"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

June 28, 2023

Pytube error - RegexMatchError: get_throttling_function_name: could not find match for multiple

Pytube error - RegexMatchError: get_throttling_function_name: could not find match for multiple

This error locates to file C:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\pytube\cipher.py


Fix in link


One line Fix but a lot of misdirections / useless pointers

Fixing the issue in the least time / right debugging matters.

Keep Exploring!!!

Better communication and connection

  • Move Slow talk slow 
  • Make the other person feel safe
  • Slow low tone better than loud note
  • Comfort enables more thinking
  • Panic restricts access to memory
  • Mindset of open and courage not manipulation
  • More Questions than opinions
  • Eyes open for everything invisible
Keep Thinking!!!

Designing Async / Paralleize tasks

Earlier roles / Feeds Processing

Had some critical tasks getting supplier feeds/data

  • File Copy
  • File load
  • Run Validations
  • Load Data

DB supporting it

  • Schema
  • Jobs / Options / Run status / Retry

Technically from a scale point of view

  • File watcher
  • File lock
  • Process jobs

Validations

  • Bunch of procedures

Key design tweaks

  • Paralellize operations
  • Data copy as objects/temp tables 
  • Parallel file copy
  • Support Multiple threads
  • Avoid data blocking/updates

ML Context

  • For a Parallel model creation
  • Configuration
  • Submit Job
  • 5 timeseries category datasets / Global models in each category
  • 10 jobs, 5 category dataset models

10 different models

  • Prepare data Job
  • Fetch initial data
  • Process missing variables
  • Data imputation
  • Save Results

Execute Job

  • Read prepared data
  • Fetch Algo ro run
  • Train algo
  • Put training accuracy
  • Save model

Predict Job

  • Load saved model
  • Run predictions
  • Save in DB

Design Ideas

  • Atomic functions
  • Job monitor / independent execution units
  • Horizontal scaling in Kubernetes App
  • Common DB and multiple execution parallel functions

Python Options

Fast API - Uvicorn also has an option to start and run several worker processes. Link

uvicorn main:app --host 0.0.0.0 --port 8080 --workers 4

Flask API - Link

if __name__ == '__main__':

    app.run(threaded=True)

flask run --with-threads

app.run(threaded=True)

More Reads

System Design — Design a distributed job scheduler (Keep It Simple Stupid Interview series)

Orchestrating a Background Job Workflow in Celery for Python

System Design: Designing a distributed Job Scheduler | Many interesting concepts to learn

Examples


Keep Exploring!!!

#!pip install nest_asyncio
import asyncio
import nest_asyncio
nest_asyncio.apply()
async def processfeeds():
task1 = asyncio.create_task(loaddata())
print("Data load waiting")
await task1
print('Data load done')
await asyncio.sleep(1)
task2 = asyncio.create_task(processdata())
print("Data process waiting")
await task2
await asyncio.sleep(5)
print('Data process done')
async def loaddata():
print("loadingdata")
await asyncio.sleep(5)
print("data loaded")
async def processdata():
print("process data")
await asyncio.sleep(5)
print("data processed")
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(processfeeds())
view raw asyncexample.py hosted with ❤ by GitHub




June 27, 2023

Data science = Data + Domain + AI + Commonsense

Many times I read up basics again and again, Over the years, I started with Windows98 Testing, C/C++ Adapters, Nestle production support, Application support, Supply chain QA / Performance / OLTP Development, SQL Developer, BI Developer, Setting up Teams, Warranty, Refurbishment, API / Supply chain, Website A/B testing, On call support. Retail product team setup/forecasting/scaling and then a long 2-year learning curve / paid lectures / back to basics mode. More learning started after that. Getting a break needs a lot of freelance / consulting/training / applied learning. Past 3 years very focused on learning/projects/production deployments. 

Now when I teach the flow/work, there are different areas overall to understand products/domain/use cases

  1. Stats, Probability A/B tests, LR
  2. ML world - Decision trees, SVM, Logistic regression, Random forests
  3. Some variations of it for anomaly detection, decision tree regressors, SVM regressor, loss functions, conditional random fields
  4. The deep learning side of CNN, RNN, LSTM, Transformers
  5. NLP side of token, embeddings, different architectures to latest state of art BERT, ChatGPT, Zero shot, few shot approaches
  6. Forecast track with different models both regression/time series approaches
  7. Recommendation track with basics to advanced hybrid models, user-user, item-item, hybrid, seasonal, and segment based
  8. Vision side of custom object, classification, transfer learning, segmentation, applied use cases
  9. World of genAI for text/vision
  10. Apart from this the production/deployment architecture

Sometimes I wonder how many things we can teach someone to switch to AI / ML. Always leverage your strengths in domain/data knowledge. It is vast and increasing day by day the scope of it. To succeed it is hard to know everything but the end goal is to add value to the business / use it to fix current challenges. Balance both learning and implementation. It will be a long journey to just learn forever. 

Always blend your ideas in DATA + DOMAIN + AI + Business Value to find the right use cases.

Keep Exploring!!!

10 Reasons why Gen AI will Work vs Fail

 Let's list 10 reasons why GenAI will succeed

  1. Saves time / provides ideas
  2. Create Power ideas with richer inspirations, add emotions, and logic with words/statements
  3. Copywriter / content writer effort draft versions can provide
  4. Summarize the given range with critical points
  5. Acts like assistant / chatbot
  6. Visual inspiration with images
  7. Generate different styles/notes/marketing/promo content
  8. It can create content based on prompts
  9. Human-like responses/content with proper grammar/flow
  10. Share opinions by reasons

 Let's list 10 reasons why GenAI will Fail

  1. Cannot scale in every field. Cannot be considered for all domains
  2. Cannot be factually right always
  3. Draft content vs final content has to fill this space
  4. Content generated needs human validation
  5. Without relevant knowledge, we may not be able to spot issues 
  6. Needs iterations to prepare the high-quality output
  7. Balance capabilities vs shortcomings based on the use case
  8. Vision is a long way to go
  9. Empathy is more of trained distribution words, ideally, it's all human-fed content
  10. True reasoning / wide subject knowledge is really limited
Keep Thinking!!!


June 26, 2023

Background removal with Azure API

 In Azure Portal


Under cognitive services you have below options

Fetch Endpoint and Keys after deployment

Example code in link

#pip install azure-ai-vision
#https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/how-to/background-removal?tabs=python
import azure.ai.vision as sdk
service_options = sdk.VisionServiceOptions("https://NAMEYOURSERVICE.cognitiveservices.azure.com/",
"PROVIDEKEYS")
analysis_options = sdk.ImageAnalysisOptions()
analysis_options.segmentation_mode = sdk.ImageSegmentationMode.BACKGROUND_REMOVAL
vision_source = sdk.VisionSource(
url="https://learn.microsoft.com/azure/cognitive-services/computer-vision/media/quickstarts/presentation.png")
image_analyzer = sdk.ImageAnalyzer(service_options, vision_source, analysis_options)
result = image_analyzer.analyze()
if result.reason == sdk.ImageAnalysisResultReason.ANALYZED:
image_buffer = result.segmentation_result.image_buffer
print(" Segmentation result:")
print(" Output image buffer size (bytes) = {}".format(len(image_buffer)))
print(" Output image height = {}".format(result.segmentation_result.image_height))
print(" Output image width = {}".format(result.segmentation_result.image_width))
output_image_file = "output.png"
with open(output_image_file, 'wb') as binary_file:
binary_file.write(image_buffer)
print(" File {} written to disk".format(output_image_file));
else:
error_details = sdk.ImageAnalysisErrorDetails.from_result(result)
print(" Analysis failed.")
print(" Error reason: {}".format(error_details.reason))
print(" Error code: {}".format(error_details.error_code))
print(" Error message: {}".format(error_details.message))
print(" Did you set the computer vision endpoint and key?")
import cv2
from google.colab.patches import cv2_imshow
img = cv2.imread('output.png', cv2.IMREAD_UNCHANGED)
cv2_imshow(img)



Keep Exploring!!!

June 24, 2023

My Langchain Notes - Day 1

  • LLM are good at conditional generation
  • P(next token | prompt)
  • LLMs are not storing state
  • Token size is key for better answers
  • Langchain - Build apps with LLM

The different types of prompts - zero shot with limited prompts, reasoning preserve states, simplified step by step prompts

Prompt Template

A prompt template can contain:

  • instructions to the language model,
  • a set of few shot examples to help the language model generate a better response,
  • a question to the language model.

FewShotPromptTemplate

A few shot prompt template can be constructed from either a set of examples, or from an Example Selector object.

Few shot examples for chat models






Ref - Link


!pip -q install openai langchain
import os
os.environ['OPENAI_API_KEY'] = 'your_key'
from langchain.llms import OpenAI
llm = OpenAI(model_name='text-davinci-003',
temperature=0.9,
max_tokens = 256)
text = "Why did the recession came after covid"
print(llm(text))
"""## Prompt Templates"""
from langchain import PromptTemplate
restaurant_template = """
I want you to act as a naming consultant for new restaurants.
Return a list of restaurant names. Each name should be short, catchy and easy to remember. It shoud relate to the type of restaurant you are naming.
What are some good names for a restaurant that is {restaurant_desription}?
"""
prompt = PromptTemplate(
input_variables=["restaurant_desription"],
template=restaurant_template,
)
# An example prompt with one input variable
prompt_template = PromptTemplate(input_variables=["restaurant_desription"], template=restaurant_template)
description = "Fresh South Indian Food with Idly, Sambar"
description_02 = "Hyderbad Biryani and Tandoor items"
description_03 = "Jain food and vegeterian menu"
## to see what the prompt will be like
prompt_template.format(restaurant_desription=description)
## querying the model with the prompt template
from langchain.chains import LLMChain
chain = LLMChain(llm=llm, prompt=prompt_template)
# Run the chain only specifying the input variable.
print(chain.run(description_03))
"""## with Few Shot Learning"""
from langchain import PromptTemplate, FewShotPromptTemplate
# First, create the list of few shot examples.
examples = [
{"word": "happy", "antonym": "sad"},
{"word": "tall", "antonym": "short"},
]
# Next, we specify the template to format the examples we have provided.
# We use the `PromptTemplate` class for this.
example_formatter_template = """
Word: {word}
Antonym: {antonym}\n
"""
example_prompt = PromptTemplate(
input_variables=["word", "antonym"],
template=example_formatter_template,
)
# Finally, we create the `FewShotPromptTemplate` object.
few_shot_prompt = FewShotPromptTemplate(
# These are the examples we want to insert into the prompt.
examples=examples,
# This is how we want to format the examples when we insert them into the prompt.
example_prompt=example_prompt,
# The prefix is some text that goes before the examples in the prompt.
# Usually, this consists of intructions.
prefix="Give the antonym of every input",
# The suffix is some text that goes after the examples in the prompt.
# Usually, this is where the user input will go
suffix="Word: {input}\nAntonym:",
# The input variables are the variables that the overall prompt expects.
input_variables=["input"],
# The example_separator is the string we will use to join the prefix, examples, and suffix together with.
example_separator="\n\n",
)
# We can now generate a prompt using the `format` method.
print(few_shot_prompt.format(input="fast"))
from langchain.chains import LLMChain
chain = LLMChain(llm=llm, prompt=few_shot_prompt)
# Run the chain only specifying the input variable.
print(chain.run("fast"))

Keep Exploring!!!


June 22, 2023

Scaling Applications

We have AWS Lambda, GCP Cloud run servless function options. This will help effectively to autoscale.

For custom apps / rest / flask / fast end points how to we auto scale

  • Horizontal Pod Autoscaler (HPA):adjusts the number of replicas of an application.
  • HPA is a form of autoscaling that increases or decreases the number of pods

Ref - Link

HorizontalPodAutoscaler Walkthrough

Key Notes

kubectl autoscale subcommand, part of kubectl, that helps you do this.

kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10

# You can use "hpa" or "horizontalpodautoscaler"; either name works OK.

kubectl get hpa

kubernetes-fastapi

How to Test Autoscaling in Kubernetes


Keep Exploring!!!

Text Detection Models - Vision - GCP - Azure - Tesseract

 GCP - Link


Azure



Tesseract Results

!sudo apt install tesseract-ocr
!pip install pytesseract
import pytesseract
import shutil
import os
import random
try:
from PIL import Image
except ImportError:
import Image
from google.colab import files
uploaded = files.upload()
extractedInformation = pytesseract.image_to_string(Image.open('F1.jpg'))
print(extractedInformation)
view raw ocrteseract.py hosted with ❤ by GitHub





Keep Exploring!!!

June 20, 2023

Yolo V8 Examples vs Azure vs Google Vision vs Image2Text vs LogMeal

Azure Vision


GCP Vision

Ref - Link



Yolo V8


Image to Text Model
!pip install ultralytics
import ultralytics
ultralytics.checks()
from google.colab import files
files.upload()
# Run inference on an image with YOLOv8n
!yolo predict model=yolov8n.pt source='5.jpg'
cd runs/detect/predict
ls
import cv2
from google.colab.patches import cv2_imshow
image = cv2.imread('5.jpg')
cv2_imshow(image)



Log Meal - Link

Keep Exploring!!!

June 19, 2023

Virtual Try on - TryOnDiffusion: A Tale of Two UNets

  • Transfer clothes between source, target
  • Warping, blending
  • Occlusion is challenging
  • Diffusion models to handle issues

Warping - 

Warping involves transforming an image's geometry, usually to correct distortions, align images, or change the perspective. 

There are different types of warping, such as:

  • Affine warping: This type of warping preserves parallel lines and involves a linear transformation followed by a translation. It can represent transformations like rotation, scaling, and shearing. 
  • Perspective (projective) warping: This type of warping can represent a more general transformation that includes perspective changes. It can correct distortions caused by the camera's viewpoint or create a "bird's-eye view" of a scene. Perspective warping requires four pairs of corresponding points in the input and output images to calculate the transformation matrix.
  • Warping is widely used in various applications, such as image stitching (for creating panoramas), rectifying images for OCR (Optical Character Recognition), and correcting lens distortions in photographs.

In the context of computer vision libraries like OpenCV, warping functions are available to apply these transformations to images, given the appropriate transformation matrix and input/output coordinates.

OpenCV Warping functions

  • cv2.warpAffine
  • cv2.warpPerspective
  • cv2.remap

OpenCV Blending functions

  • cv2.addWeighted
  • cv2.add
  • cv2.subtract

  • Note #1 - All segmentation done on low resolution
  • Note #2 - Super Resolution is added to cover up low res and give high res outputs
  • Note #3 - Running all tasks on high res is even more challenging

Paper - link

Walmart Try on


Things to note

  • Poster detection
  • More minmal clothes and superimposition approach
  • Full body posture + cloth overlap on it.

Keep Exploring!!!

June 16, 2023

Deep Network of Humans

 


Keep Exploring!!!

Concept Kullback-Leibler (KL) divergence, chi-square test

Kullback-Leibler (KL) divergence, also known as relative entropy. It is a measure of how one probability distribution is different from a second, reference probability distribution.

chi-square test is a statistical test used to determine whether there is a significant association between two categorical variables in a sample.


#Kullback-Leibler (KL) divergence, also known as relative entropy
#is a measure of how one probability distribution is different from a second, reference probability distribution.
#It is used in various fields such as information theory, machine learning, and statistics.
#The logarithm function in the KL divergence formula is not symmetric with respect to its arguments.
#Specifically, log(P(i) / Q(i)) is not equal to log(Q(i) / P(i)). This asymmetry in the logarithm function contributes to the asymmetry of the KL divergence.
import numpy as np
def kl_divergence(p, q):
return np.sum(np.where(p != 0, p * np.log(p / q), 0))
# Example probability distributions
p = np.array([0.4, 0.6])
q = np.array([0.3, 0.7])
# Calculate KL divergence
kl_div = kl_divergence(p, q)
print("KL divergence:", kl_div)
kl_div = kl_divergence(q, p)
print("KL divergence:", kl_div)
import numpy as np
from scipy.stats import entropy
def kl_divergence(p, q):
return entropy(p, q)
# Example probability distributions
p = np.array([0.4, 0.6])
q = np.array([0.3, 0.7])
# Calculate KL divergence
kl_div = kl_divergence(p, q)
print("KL divergence:", kl_div)
kl_div = kl_divergence(q, p)
print("KL divergence:", kl_div)
#The chi-square test is a statistical test used to determine whether there is a significant association between two categorical variables in a sample.
#It is based on comparing the observed frequencies in each category with the frequencies that would be expected under the assumption of independence between the variables.
#Here's an example: Suppose we have data on the hair color and eye color of a group of people, and we want to test if there is an association between these two variables.
#Brown Eyes Blue Eyes Green Eyes Total
#Black Hair 50 20 30 100
#Blonde Hair 30 40 30 100
#Total 80 60 60 200
#We can perform a chi-square test using Python and the scipy.stats library:
import numpy as np
from scipy.stats import chi2_contingency
# Observed frequencies
observed = np.array([
[50, 20, 30],
[30, 40, 30]
])
# Perform chi-square test
chi2, p_value, _, _ = chi2_contingency(observed)
print("Chi-square statistic:", chi2)
print("P-value:", p_value)
#In this example, the chi-square statistic is 10.0, and the p-value is approximately 0.0067.
#If we choose a significance level of 0.05, we can reject the null hypothesis that hair color and eye color are independent, as the p-value is less than 0.05.
#This suggests that there is a significant association between hair color and eye color in this sample.
#Note that the chi-square test has some limitations:
#It requires a sufficiently large sample size to be valid, as it is based on the approximation of the chi-square distribution.
#It assumes that the observations are independent and identically distributed.
#It is sensitive to the choice of categories and may give different results if the categories are combined or split in different ways.

Keep Exploring!!!

June 15, 2023

GCP Vertex Text Examples

#!pip install "shapely<2.0.0"
#!pip install google-cloud-aiplatform
import vertexai
from vertexai.language_models import TextGenerationModel
vertexai.init(project="cohesive-gadget-166410", location="us-central1")
parameters = {
"temperature": 0.2,
"max_output_tokens": 256,
"top_p": 0.8,
"top_k": 40
}
text = "Create one liner add for below product. Product meta data is - 2 IN 1 USB CARD READER - Whether Trans Flash or SD, enjoy the power to read and write on two cards simultaneously, and avoid constantly plugging in and out. HIGH-SPEED TRANSMISSION SPEED - USB 3.0 gives you access to Super Speed data transfer rates of up to 5Gbps with all enabled cards. CARD COMPATIBILITY - Support SD/SDHC/SDXCMMC/MMC…¡/RS MMC/MMC 4.0/ Ultra …¡ SD/ Extreme SD/ Extreme … SD /T-Flash / MICRO SDHC/MICRO SDXC 35/5000,Through the adapter can also read MINI SD / MMC micro Card. DESIGN WITH OTG - Truly plug & play of this OTG card reader. Easy installation with no device driver necessary, even the elderly don't have to worry about how to use it.Ideal for transferring high-resolution images and video recordings.."
model = TextGenerationModel.from_pretrained("text-bison@001")
response = model.predict(
text,
**parameters
)
print(f"Response from Model: {response.text}")
#Response from Model: **2-in-1 USB 3.0 Card Reader: Read and write on two cards simultaneously at blazing fast speeds up to 5Gbps.**

 





Keep Exploring!!!

June 14, 2023

Future of Learning

 


Happy Learning!!!

June 13, 2023

My top picklist from article on Palantir principles

Expertise = Experience =  Domain + Data + Tech (Blend of all)

  • Do real customer work long enough to have full empathy and inspire.
  • Don’t just empathize with the user; be the user.
  • Built prototype solutions for the unique problems
  • Build features that magnify value over time
  • Consider using working products to iterate with users instead of designs and concepts.

Easier said, I can echo it to my past journey/projects. Solving for things the customer needs vs I know a tech what all I can do.

Ref - Link

Keep Exploring!!!

Thinking Questions

Reduction in Cart Abandonment

Key Notes

  • RCA
  • External Factors
  • Competitor Launches / Products
  • Affected Customer Segment
  • Macro Economic changes
  • Seasonality
  • User Journey in App
  • Data Captured - Gender / Age
  • Campaign related impacts
  • No correlation to compaign 
  • Any product design changes
  • Catalog and inventory analysis of the product
  • Which product category etc. has the highest dip.
  • Any partnership change or any merchant backed. 
  • Geographical distribution of the influx of users on the site like a flood, internet blackout 
  • Compare the pricing of the product with the competitors 
  • Ratings of the products getting moved out of the cart

Ref  - Link

Solutions Architect

Key Notes

  • Challenges, Business Goals, Tech Goals
  • Feature / Product Demo
  • Product Integration aspects
  • Customer Success Stories
  • Next Steps

Ref - Link  

MLOps

Key Projects

  • Loan scoring, Forecasting
  • MLOps Pipeline - Data Collection, Ingestion
  • Data cleaning + Feature Engineering
  • Different models for different products
  • Automate model selection / training steps
  • Model Validation / Testing phase

Ref - Link

System Design for Recommendations and Search

Key Notes

  • Batched - Store in DB, Precomputed, Refreshed, Key-value pairs
  • Real-time - Time-sensitive content

Key Concepts

  • Embedding creation of interests
  • Features mapping
  • Ranking / Retrieval
  • Behavior logs - candidate sets - recommendations
  • Top N neighbors, KNN, Indexes

Ref - link 

Keep Exploring!!!

Model Deployment Architecture

 My implementation experience and lessons :)

Product Implementation (2012-2014)

  • Integrated in product
  • Jobs scheduled for midnight
  • Workflow to monitor variations
  • Forecast updated every day for store
  • Everything custom-coded formula embedded
  • Weighted moving average 
  • Step up / Step down moving average approach

Batched State of Art (2021)

Recommendations AWS

  • ETL / Glue jobs to get featured
  • Full pull/delta pull scripts
  • Feature engineering scripts
  • Custom segmentation scripts
  • Batch jobs to run models
  • Large-scale recommendations generation
  • Infra kubeflow setup 
  • Leverage existing Kubeflow monitoring setup

Forecasting State of Art (2021)

Kubeflow + AWS

  • ETL / Glue jobs to get features
  • Full pull/delta pull scripts
  • Feature engineering scripts
  • Custom segmentation scripts
  • Batch jobs to run models
  • Kubeflow pipelines for the forecast
  • Results persist in Redshift DB
  • Infra kubeflow setup 
  • Leverage existing kubeflow monitoring setup

Realtime State of Art (2022)

Real-time streaming / Vision Solution

  • AWS Lamdbda-based approach
  • Vision + Docker + AWS Lambda
  • Request monitoring / logging

Keep Exploring!!!

Vector Databases Reads

Milvus Notes - Index/consistency / availability options

#1. Index type - usecase

  • IVF_FLAT - High-speed query
  • IVF_PQ - Very high-speed query
  • HNSW - High-speed query

Inverted File (IVF): An IVF index divides the vector space into several clusters and holds an inverted file for each cluster, recording which vectors belong to the cluster.

IVF Flat: This is a combination of IVF and flat index. It uses the IVF index to partition the data into clusters and then uses the flat index (brute-force search) within each cluster.

Hierarchical Navigable Small World (HNSW): HNSW builds a multi-layer navigation graph to represent the vector space.

#2. Consistency levels - Strong, Bounded, Session or Eventually

  • Strong - Most strict
  • Bounded staleness - allows data inconsistency during a certain period of time.
  • Session - Like dirty reads 
  • Eventually - weakest level among the four.

#3. HA - In-memory replicas help Milvus recover faster if a query node crashes.

#4. Vector search & Hybrid Search params offset, limit

offset - Number of results to skip in the returned set

limit - Number of the most similar results to return

How indexing and querying works

  • Trees – ANNOY -  Annoy (Approximate Nearest Neighbors Oh Yeah)
  • Proximity graphs - HNSW Hierarchical Navigable Small World (HNSW) Graphs
  • Clustering - FAISS
  • Hashing -  LSH - Locality-Sensitive Hashing (LSH)
  • Vector compression - PQ or SCANN. - ScaNN (Scalable Nearest Neighbors). Product Quantization (PQ): PQ index compresses vectors into compact codes and is beneficial for large-scale, high-dimensional data.
More Reads

Keep Exploring!!!

June 12, 2023

Vision Catalog Creation

Every problem statement needs to have

  • Selected products
  • Custom backgrounds
  • Present / Segmentation/options
  • Variations with easy to use approach

1. Define products/layouts


2. Custom layout for each type of product


3. Once the product positioned custom background

4. Generate a photoshoot



Keep Exploring!!!

June 11, 2023

How to train your own LLM - Copilot type LLMs

Notes

  • Scenarios to custom train
  • Privacy, IP, Customization
  • Smaller and Efficient Models
  • Restrict Information shared with LLM models

  • Code completion model by Replit

Stack

  • Databricks pipeline
  • Hugging Face for tokenizers / inference tools for code
  • MosaicML - GPU and model training

  • Training LLM Architecture

  • Extensive code base of Git / Stackoverflow

  • Data preprocessing 
  • All preprocessing in distributed fashion
  • Lot of work on notebooks
  • Removed auto generated code from training
  • Anonymize data remove PII info
  • Remove code that does not compile
  • Remove Python2 code and keep it for one version
  • Maximum line length set

  • Custom Vocabulary creation
  • Custom tokenizer for domain specific dataset

MosaicML for training




Future

  • Optimal / Smaller LLM
  • Customized LLMs
  • LLM with reasoning

Keep Exploring!!!

June 10, 2023

DBScan vs KMeans Summary

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and K-means are two popular clustering algorithms used for unsupervised learning tasks. They have different approaches to clustering and are suitable for different types of data. Here's a comparison of the two algorithms:

Approach:

DBSCAN: DBSCAN is a density-based clustering algorithm. It groups together points that are closely packed together based on a distance measure (e.g., Euclidean distance) and a density threshold. It can find clusters of arbitrary shapes and is also able to identify noise points that do not belong to any cluster.

K-means: K-means is a centroid-based clustering algorithm. It partitions the data into K clusters by minimizing the sum of squared distances between the data points and their corresponding cluster centroids. K-means assumes that clusters are spherical and have similar sizes.

Number of clusters:

DBSCAN: The number of clusters is determined automatically by the algorithm based on the input parameters (distance threshold and minimum number of points). You don't need to specify the number of clusters beforehand.

K-means: You need to specify the number of clusters (K) beforehand. Choosing the optimal value of K can be challenging and often requires domain knowledge or using techniques like the elbow method or silhouette analysis.

Cluster shapes:

DBSCAN: DBSCAN can find clusters of arbitrary shapes, making it suitable for datasets with complex structures.

K-means: K-means assumes that clusters are spherical and have similar sizes, which may not be suitable for datasets with complex structures or clusters with different shapes and sizes.

Handling noise:

DBSCAN: DBSCAN can identify and separate noise points that do not belong to any cluster.

K-means: K-means is sensitive to noise and outliers, as they can significantly affect the position of the cluster centroids.

Scalability:

DBSCAN: DBSCAN can be slower than K-means for large datasets, especially if the distance matrix needs to be computed. However, there are optimized versions of DBSCAN (e.g., HDBSCAN) that can handle large datasets more efficiently.

K-means: K-means is generally faster and more scalable than DBSCAN, especially when using optimized implementations (e.g., MiniBatchKMeans in scikit-learn).

In summary, DBSCAN is more suitable for datasets with complex structures, arbitrary cluster shapes, and noise, while K-means is faster and more scalable but assumes spherical clusters with similar sizes. The choice between DBSCAN and K-means depends on the characteristics of the data and the specific requirements of the clustering task.




#Both examples generate a sample dataset with 300 data points and 4 clusters using the make_blobs function. The K-means example requires specifying the number of clusters (4 in this case), while the DBSCAN example requires specifying the distance threshold (eps) and the minimum number of points required to form a dense region (min_samples). The resulting clusters are visualized using a scatter plot.
#kmeans
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Generate sample data
data, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# Create and fit the K-means model
kmeans = KMeans(n_clusters=4)
kmeans.fit(data)
# Get the cluster assignments
labels = kmeans.labels_
# Plot the clusters
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')
plt.title('K-means Clustering')
plt.show()
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
# Generate sample data
data, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# Create and fit the DBSCAN model
dbscan = DBSCAN(eps=1.5, min_samples=5)
dbscan.fit(data)
# Get the cluster assignments
labels = dbscan.labels_
# Plot the clusters
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')
plt.title('DBSCAN Clustering')
plt.show()
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
# Generate sample data
data, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# Calculate the Within-Cluster-Sum-of-Squares (WCSS) for different values of K
wcss = []
max_k = 10
for k in range(1, max_k + 1):
kmeans = KMeans(n_clusters=k)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
# Find the optimal K using the Elbow method
optimal_k = 1
for i in range(1, len(wcss) - 1):
if (wcss[i - 1] - wcss[i]) / (wcss[i] - wcss[i + 1]) > 2:
optimal_k = i + 1
break
print("Optimal number of clusters (K):", optimal_k)
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Generate sample data
data, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# Calculate the Silhouette scores for different values of K
silhouette_scores = []
max_k = 10
for k in range(2, max_k + 1):
kmeans = KMeans(n_clusters=k)
kmeans.fit(data)
labels = kmeans.labels_
silhouette_scores.append(silhouette_score(data, labels))
# Find the optimal K using the Silhouette method
optimal_k = silhouette_scores.index(max(silhouette_scores)) + 2
print("Optimal number of clusters (K):", optimal_k)

Keep Exploring!!!

June 07, 2023

Data science takes time

  • Real world is not kaggle data
  • Its is very risky for more reliance on technology and less understanding of problem
  • Do not jump into solutions without knowing domain
  • Intent should not be solve fast but to solve with clarity
  • Have a open mind about Domain vs Data vs Algo
  • Be candid about opinions
  • If all problems are like kaggle, we should have seen a ton of production solutions
  • Interview questions may be products people spent years to build, Thought process / clarity is more important than quick working solutions

Keep Thinking!!!

June 05, 2023

Cashflow forecasting

Paper - Empowering cash managers to achieve cost savings by improving predictive accuracy

  • Cash management is concerned with optimizing the short-term funding requirements of a company

Time Series Forecasting with Transformer Models and Application to Asset Management

  • Sequence prediction - we often predict the next value of the sequence itself
  • Sequence generation - convert sequences from one domain into sequences from another domain, such as machine translation, text summarization, chatbots
  • Iterated multi-step forecasting
  • Direct multi-step forecasting



Self-attention is designed to capture the dependencies in the sequence, such as the relationship between each word with each other word in a senten

For a given query, we compare it with all keys K and get different weights for different values

Self-attention and multi-head attention are permutation-equivariant with respect to its inputs

In our experiment, we consider three different portfolio allocation methods:

  • Single-period MVO portfolio with monthly rebalancing
  • Risk parity portfolio with monthly rebalancing
  • Multi-period MVO portfolio with weekly rebalancing as described by Problem

How to Build a Cash Flow Forecast

  • Determine Your Forecasting Objective(s)
  • Short-term liquidity planning
  • Interest and debt reduction
  • Liquidity risk management
  • Growth planning

Cash payments and receipts. - Short-period forecasts: Short-term forecasts typically look two to four weeks into the future and contain a daily breakdown of cash payments and receipts.

The most common medium-term forecast is the rolling 13-week cash flow forecast.

Long-period forecasts: Longer-term forecasts typically look 6–12 months into the future and are often the starting point for annual budgeting processes

Mixed-period forecasts: Mixed-period forecasts use a mix of the three periods above and are commonly used for liquidity risk management.

Cash flow forecasting

  • Forecast your income or sales
  • Estimate cash inflows
  • Estimate cash outflows and expenses
  • Review your estimated cash flows against the actual

Preparing a cash flow forecast: Simple steps for vital insight

  • Decide how far out you want to plan for
  • List all your income
  • List all your outgoings

Empirical analysis of daily cash flow time series and its implications for forecasting

Cash management is concerned with the efficient use of a company’s cash and short-term investments such as marketable securities.

From these and other works, we observe that common assumptions on the statistical properties of cash flow time-series include:

  • Normality: cash flows follow a Gaussian distribution with observations symmetrically centered around the mean, and with finite variance.
  • Absence of correlation: the occurrence of past cash flows does not affect the probability of occurrence of the next ones.
  • Stationarity: the probability distribution of cash flows does not change over time and, consequently, its statistical properties such as the mean and variance remain stable.
  • Linearity: cash flows are proportional either to another (external) explanatory variable or to a combination of (external) explanatory variables.

Empowering cash managers to achieve cost savings by improving predictive accuracy

Kurtosis is a measure of the tailedness of a distribution. Tailedness is how often outliers occur

Transforming Financial Forecasting with Data Science and Machine Learning at Uber

  • Strategic planning
  • Operations
  • Insights


Modeling strategic investments as an optimization problem

  • Minimize spending
  • Maximize number of drivers or riders
  • Maximize number of first trips or total trips
  • Maximize gross bookings

With each optimization problem, we can also specify constraints, such as:

  • Maximum budget, overall or specific to certain channels (such as marketing versus rider promotion)
  • Minimum number of first trips or trips
  • Minimum month-to-month gross booking growth

Short-term use cases: Short-term use cases for cashflow forecasting include budgeting, forecasting sales, and managing cash flow. It can also be used to identify potential areas of overspending and to plan for future investments

Long-term use cases: Cashflow forecasting can be used to plan for long-term investments, such as capital expenditures and acquisitions. It can also be used to develop strategies for managing cash flow over the long-term, such as budgeting and debt management




  • Receivables forecast
  • Payable forecast
Ref - Link


Keep Exploring!!!

June 04, 2023

Forecast + Optimization

  • Regression to find optimal values of 'X' values
  • Add a constraint to make it an optimization problem
  • Optimization with minimum expense for each track

#https://raw.githubusercontent.com/justmarkham/scikit-learn-videos/master/data/Advertising.csv
#https://github.com/georgeblu1/Data-Projects/blob/master/Budget%20Optimization.ipynb
import pandas as pd
data = pd.read_csv(r'https://raw.githubusercontent.com/justmarkham/scikit-learn-videos/master/data/Advertising.csv')
data.head()
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.tools.eval_measures import rmse
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#There should be a linear relationship between target and features. We can use scatterplot to visualize and validate for us.
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size = 4, aspect = 1)
#Little or no multicollinearity between features
sns.pairplot(data[['TV','Radio','Newspaper']])
#Homoscedasticity
#OLS assumes all residuals drawn from population has constant variance
sns.residplot(x = data['TV'], y = data["Sales"])
feature_cols = ['TV', 'Radio', 'Newspaper']
X = data[feature_cols]
y = data[["Sales"]]
# instantiate and fit
SkLearn_model = LinearRegression()
SkLearn_result = SkLearn_model.fit(X, y)
# print the coefficients
print(SkLearn_result.intercept_)
print(SkLearn_result.coef_)
# include Newspaper
X = data[['TV', 'Radio', 'Newspaper']]
y = data.Sales
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Instantiate model
lm2 = LinearRegression()
# Fit Model
lm2.fit(X_train, y_train)
# Predict
y_pred = lm2.predict(X_test)
# RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print(metrics.r2_score(y_test,y_pred))
coefficient = lm2.coef_
coefficient
inter = lm2.intercept_
inter
!pip install pulp
#Our budget should be less than 1000
from pulp import *
prob = LpProblem("Ads Sales Problem", LpMaximize)
#x - tv
#y - radio
#z - newpaper
x = LpVariable("x", 0, 200)
y = LpVariable("y", 0, 500)
z = LpVariable("z", 0, 500)
prob += x + y + z <= 1000
prob += 0.0548*x + 0.1022*y + 0.0007878*y + 4.6338
status = prob.solve()
LpStatus[status]
print(prob)
for v in prob.variables():
print(v.name, "=", v.varValue)
calculation = 0.0548*200 + 0.1022*500 + 0.0007878*0 + 4.6338
calculation
prob = LpProblem("Ads Sales Problem with minium in each track", LpMaximize)
x = LpVariable("x", 50, 200)
y = LpVariable("y", 50, 500)
z = LpVariable("z", 50, 500)
prob += x + y + z <= 1000
prob += 0.0548*x + 0.1022*y + 0.0007878*y + 4.6338
status = prob.solve()
LpStatus[status]
print(prob)
for v in prob.variables():
print(v.name, "=", v.varValue)
calculation = 0.0548*200 + 0.1022*500 + 0.0007878*50 + 4.6338
calculation

Keep Exploring!!!

June 03, 2023

Promises and Lies of ChatGPT - understanding how it works



Key Notes

Basics
  • ChatGPT is the idea of n-gram models
  • Given n-1 words guess nth word likely to be
  • Distribution is learnt from sequence
  • People tried in small values of n
  • Sample from distribution of words
  • More likely words more often

With large data
  • Any N, Words next word
  • Frequency, Conditional probability
  • Generate words if the first word given
  • More likely words + Patterns
Large sentences/meanings
  • Abstract sequences
  • Different answers every time
  • Every sequence may be different generated distributions but a similar context is possible
  • Chatgpt = something well written
Why it works?
  • We believe in what seems realistic
  • Connect to human experience
  • Fact is different from possibility
  • Plausible or probable or reasonable answers 
Similarity to humans
  • Humans are not always factual
  • It can be perception based
  • People can be finalized in civil society
  • Machines can suggest without knowing the consequences
  • Automation still may have a bias
  • Being close to the truth we are impressed
Predictive modeling
Train / predict
Conditional modeling
  • Can create bias in information
  • Discriminate learning learns a conditional model
  • Classifier then finds dogs vs generates dogs both different
Generative distribution - Joint distribution
  • The prior distribution of reasonable images
  • Teacher = Generative model
  • Learning generative model is costlier
The human brain works by on-demand stitching
  • chatgpt does something similar
  • All learning is compression
  • All learning is lossy compression
  • jpeg lossy - approximating
  • Representation of compressed details
  • Significant footprint available to train systems
Good writing for all
  • Picaso style pics
  • Shakespeare style writing
  • Racial profiling not required
  • Character and form are not connected
  • Generalizations help for survival
  • AI as creator / editor
Badly written with original thought is human writing
  • Harder to write original creative ways
  • Original vs Derivative thinking
  • Bad handwriting vs Good content
  • Bad package vs Good product
  • We have one scale good or bad
  • LLM learns from human language
  • Most likely completion given soceity is
  • Social Enginner on Data
Is this a good representation of all ethnicity ? 


How it for fine tuned ?
  • RHLF
  • Show results
  • asks someone their likes
  • Thumbs up / down to change distribution
  • Re-learning it
  • Collectively offensive content on web vs making a decent prompt engine

  • Align to human values
  • Concentration campus, Genocide - Human values
  • Retrain for cultural norms
  • False positive
  • Different narrative, different takers

  • Make LLM overwrite conditional network through prompts
  • Adverserial learning prompts
  • How to put knobs how it behaves well

AI systems to work with
  • Basically put people to think about problem
  • With enough eye balls every downside can be shallow bug
  • We need more eyeballs to decide
  • ChatGPT will not generate grammatically incorrect sentence
  • Core problem of intelligent behavior - planning, diagnosis, reasoning


Keep Exploring!!!

June 02, 2023

AI - Image Generator - Approach

Under the hood training from tons of images we are generating distributions based on

  • Context generator = backgrounds
  • Object generator = cars / bikes
  • Object + Context = Car in beach
  • Finetune to similar pictures
  • Sharpen images / Super resolution
  • Fix shapes / corners

Keep Exploring!!!