Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): June 2023

June 28, 2023

Pytube error - RegexMatchError: get_throttling_function_name: could not find match for multiple

This error locates to file C:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\pytube\cipher.py

Fix in link

One line Fix but a lot of misdirections / useless pointers

Fixing the issue in the least time / right debugging matters.

Keep Exploring!!!

Better communication and connection

Move Slow talk slow
Make the other person feel safe
Slow low tone better than loud note
Comfort enables more thinking
Panic restricts access to memory
Mindset of open and courage not manipulation
More Questions than opinions
Eyes open for everything invisible

Keep Thinking!!!

Designing Async / Paralleize tasks

Earlier roles / Feeds Processing

Had some critical tasks getting supplier feeds/data

File Copy
File load
Run Validations
Load Data

DB supporting it

Schema
Jobs / Options / Run status / Retry

Technically from a scale point of view

File watcher
File lock
Process jobs

Validations

Bunch of procedures

Key design tweaks

Paralellize operations
Data copy as objects/temp tables
Parallel file copy
Support Multiple threads
Avoid data blocking/updates

ML Context

For a Parallel model creation
Configuration
Submit Job
5 timeseries category datasets / Global models in each category
10 jobs, 5 category dataset models

10 different models

Prepare data Job
Fetch initial data
Process missing variables
Data imputation
Save Results

Execute Job

Read prepared data
Fetch Algo ro run
Train algo
Put training accuracy
Save model

Predict Job

Load saved model
Run predictions
Save in DB

Design Ideas

Atomic functions
Job monitor / independent execution units
Horizontal scaling in Kubernetes App
Common DB and multiple execution parallel functions

Python Options

Fast API - Uvicorn also has an option to start and run several worker processes. Link

uvicorn main:app --host 0.0.0.0 --port 8080 --workers 4

Flask API - Link

if __name__ == '__main__':

app.run(threaded=True)

flask run --with-threads

app.run(threaded=True)

More Reads

System Design — Design a distributed job scheduler (Keep It Simple Stupid Interview series)

Orchestrating a Background Job Workflow in Celery for Python

System Design: Designing a distributed Job Scheduler | Many interesting concepts to learn

Examples

Keep Exploring!!!

	#!pip install nest_asyncio

	import asyncio
	import nest_asyncio

	nest_asyncio.apply()

	async def processfeeds():
	task1 = asyncio.create_task(loaddata())
	print("Data load waiting")
	await task1
	print('Data load done')
	await asyncio.sleep(1)
	task2 = asyncio.create_task(processdata())
	print("Data process waiting")
	await task2
	await asyncio.sleep(5)
	print('Data process done')

	async def loaddata():
	print("loadingdata")
	await asyncio.sleep(5)
	print("data loaded")

	async def processdata():
	print("process data")
	await asyncio.sleep(5)
	print("data processed")

	if __name__ == "__main__":
	loop = asyncio.get_event_loop()
	loop.run_until_complete(processfeeds())

view raw asyncexample.py hosted with ❤ by GitHub

June 27, 2023

Data science = Data + Domain + AI + Commonsense

Many times I read up basics again and again, Over the years, I started with Windows98 Testing, C/C++ Adapters, Nestle production support, Application support, Supply chain QA / Performance / OLTP Development, SQL Developer, BI Developer, Setting up Teams, Warranty, Refurbishment, API / Supply chain, Website A/B testing, On call support. Retail product team setup/forecasting/scaling and then a long 2-year learning curve / paid lectures / back to basics mode. More learning started after that. Getting a break needs a lot of freelance / consulting/training / applied learning. Past 3 years very focused on learning/projects/production deployments.

Now when I teach the flow/work, there are different areas overall to understand products/domain/use cases

Stats, Probability A/B tests, LR
ML world - Decision trees, SVM, Logistic regression, Random forests
Some variations of it for anomaly detection, decision tree regressors, SVM regressor, loss functions, conditional random fields
The deep learning side of CNN, RNN, LSTM, Transformers
NLP side of token, embeddings, different architectures to latest state of art BERT, ChatGPT, Zero shot, few shot approaches
Forecast track with different models both regression/time series approaches
Recommendation track with basics to advanced hybrid models, user-user, item-item, hybrid, seasonal, and segment based
Vision side of custom object, classification, transfer learning, segmentation, applied use cases
World of genAI for text/vision
Apart from this the production/deployment architecture

Sometimes I wonder how many things we can teach someone to switch to AI / ML. Always leverage your strengths in domain/data knowledge. It is vast and increasing day by day the scope of it. To succeed it is hard to know everything but the end goal is to add value to the business / use it to fix current challenges. Balance both learning and implementation. It will be a long journey to just learn forever.

Always blend your ideas in DATA + DOMAIN + AI + Business Value to find the right use cases.

Keep Exploring!!!

10 Reasons why Gen AI will Work vs Fail

Let's list 10 reasons why GenAI will succeed

Saves time / provides ideas
Create Power ideas with richer inspirations, add emotions, and logic with words/statements
Copywriter / content writer effort draft versions can provide
Summarize the given range with critical points
Acts like assistant / chatbot
Visual inspiration with images
Generate different styles/notes/marketing/promo content
It can create content based on prompts
Human-like responses/content with proper grammar/flow
Share opinions by reasons

Let's list 10 reasons why GenAI will Fail

Cannot scale in every field. Cannot be considered for all domains
Cannot be factually right always
Draft content vs final content has to fill this space
Content generated needs human validation
Without relevant knowledge, we may not be able to spot issues
Needs iterations to prepare the high-quality output
Balance capabilities vs shortcomings based on the use case
Vision is a long way to go
Empathy is more of trained distribution words, ideally, it's all human-fed content
True reasoning / wide subject knowledge is really limited

Keep Thinking!!!

June 26, 2023

Background removal with Azure API

In Azure Portal

Under cognitive services you have below options

Fetch Endpoint and Keys after deployment

Example code in link

	#pip install azure-ai-vision

	#https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/how-to/background-removal?tabs=python
	import azure.ai.vision as sdk
	service_options = sdk.VisionServiceOptions("https://NAMEYOURSERVICE.cognitiveservices.azure.com/",
	"PROVIDEKEYS")

	analysis_options = sdk.ImageAnalysisOptions()
	analysis_options.segmentation_mode = sdk.ImageSegmentationMode.BACKGROUND_REMOVAL

	vision_source = sdk.VisionSource(
	url="https://learn.microsoft.com/azure/cognitive-services/computer-vision/media/quickstarts/presentation.png")


	image_analyzer = sdk.ImageAnalyzer(service_options, vision_source, analysis_options)
	result = image_analyzer.analyze()
	if result.reason == sdk.ImageAnalysisResultReason.ANALYZED:
	image_buffer = result.segmentation_result.image_buffer
	print(" Segmentation result:")
	print(" Output image buffer size (bytes) = {}".format(len(image_buffer)))
	print(" Output image height = {}".format(result.segmentation_result.image_height))
	print(" Output image width = {}".format(result.segmentation_result.image_width))
	output_image_file = "output.png"
	with open(output_image_file, 'wb') as binary_file:
	binary_file.write(image_buffer)
	print(" File {} written to disk".format(output_image_file));
	else:
	error_details = sdk.ImageAnalysisErrorDetails.from_result(result)
	print(" Analysis failed.")
	print(" Error reason: {}".format(error_details.reason))
	print(" Error code: {}".format(error_details.error_code))
	print(" Error message: {}".format(error_details.message))
	print(" Did you set the computer vision endpoint and key?")

	import cv2
	from google.colab.patches import cv2_imshow
	img = cv2.imread('output.png', cv2.IMREAD_UNCHANGED)
	cv2_imshow(img)

view raw bkgremovalazure.py hosted with ❤ by GitHub

Keep Exploring!!!

June 24, 2023

My Langchain Notes - Day 1

LLM are good at conditional generation
P(next token | prompt)
LLMs are not storing state
Token size is key for better answers
Langchain - Build apps with LLM

The different types of prompts - zero shot with limited prompts, reasoning preserve states, simplified step by step prompts

Prompt Template

A prompt template can contain:

instructions to the language model,
a set of few shot examples to help the language model generate a better response,
a question to the language model.

FewShotPromptTemplate

A few shot prompt template can be constructed from either a set of examples, or from an Example Selector object.

Few shot examples for chat models

Ref - Link

	!pip -q install openai langchain

	import os

	os.environ['OPENAI_API_KEY'] = 'your_key'

	from langchain.llms import OpenAI
	llm = OpenAI(model_name='text-davinci-003',
	temperature=0.9,
	max_tokens = 256)
	text = "Why did the recession came after covid"
	print(llm(text))

	"""## Prompt Templates"""
	from langchain import PromptTemplate

	restaurant_template = """
	I want you to act as a naming consultant for new restaurants.
	Return a list of restaurant names. Each name should be short, catchy and easy to remember. It shoud relate to the type of restaurant you are naming.
	What are some good names for a restaurant that is {restaurant_desription}?
	"""

	prompt = PromptTemplate(
	input_variables=["restaurant_desription"],
	template=restaurant_template,
	)

	# An example prompt with one input variable
	prompt_template = PromptTemplate(input_variables=["restaurant_desription"], template=restaurant_template)

	description = "Fresh South Indian Food with Idly, Sambar"
	description_02 = "Hyderbad Biryani and Tandoor items"
	description_03 = "Jain food and vegeterian menu"

	## to see what the prompt will be like
	prompt_template.format(restaurant_desription=description)

	## querying the model with the prompt template
	from langchain.chains import LLMChain
	chain = LLMChain(llm=llm, prompt=prompt_template)

	# Run the chain only specifying the input variable.
	print(chain.run(description_03))

	"""## with Few Shot Learning"""
	from langchain import PromptTemplate, FewShotPromptTemplate

	# First, create the list of few shot examples.
	examples = [
	{"word": "happy", "antonym": "sad"},
	{"word": "tall", "antonym": "short"},
	]

	# Next, we specify the template to format the examples we have provided.
	# We use the `PromptTemplate` class for this.
	example_formatter_template = """
	Word: {word}
	Antonym: {antonym}\n
	"""
	example_prompt = PromptTemplate(
	input_variables=["word", "antonym"],
	template=example_formatter_template,
	)

	# Finally, we create the `FewShotPromptTemplate` object.
	few_shot_prompt = FewShotPromptTemplate(
	# These are the examples we want to insert into the prompt.
	examples=examples,
	# This is how we want to format the examples when we insert them into the prompt.
	example_prompt=example_prompt,
	# The prefix is some text that goes before the examples in the prompt.
	# Usually, this consists of intructions.
	prefix="Give the antonym of every input",
	# The suffix is some text that goes after the examples in the prompt.
	# Usually, this is where the user input will go
	suffix="Word: {input}\nAntonym:",
	# The input variables are the variables that the overall prompt expects.
	input_variables=["input"],
	# The example_separator is the string we will use to join the prefix, examples, and suffix together with.
	example_separator="\n\n",
	)

	# We can now generate a prompt using the `format` method.
	print(few_shot_prompt.format(input="fast"))

	from langchain.chains import LLMChain
	chain = LLMChain(llm=llm, prompt=few_shot_prompt)
	# Run the chain only specifying the input variable.
	print(chain.run("fast"))

view raw langchainday1.py hosted with ❤ by GitHub

Keep Exploring!!!

June 22, 2023

Scaling Applications

We have AWS Lambda, GCP Cloud run servless function options. This will help effectively to autoscale.

For custom apps / rest / flask / fast end points how to we auto scale

Horizontal Pod Autoscaler (HPA):adjusts the number of replicas of an application.
HPA is a form of autoscaling that increases or decreases the number of pods

Ref - Link

HorizontalPodAutoscaler Walkthrough

Key Notes

kubectl autoscale subcommand, part of kubectl, that helps you do this.

kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10

# You can use "hpa" or "horizontalpodautoscaler"; either name works OK.

kubectl get hpa

kubernetes-fastapi

How to Test Autoscaling in Kubernetes

Keep Exploring!!!

Text Detection Models - Vision - GCP - Azure - Tesseract

GCP - Link

Azure

Tesseract Results

	!sudo apt install tesseract-ocr

	!pip install pytesseract

	import pytesseract
	import shutil
	import os
	import random
	try:
	from PIL import Image
	except ImportError:
	import Image

	from google.colab import files

	uploaded = files.upload()

	extractedInformation = pytesseract.image_to_string(Image.open('F1.jpg'))

	print(extractedInformation)

view raw ocrteseract.py hosted with ❤ by GitHub

Keep Exploring!!!

June 20, 2023

Yolo V8 Examples vs Azure vs Google Vision vs Image2Text vs LogMeal

Azure Vision

GCP Vision

Ref - Link

Yolo V8

Image to Text Model

	!pip install ultralytics

	import ultralytics
	ultralytics.checks()

	from google.colab import files
	files.upload()

	# Run inference on an image with YOLOv8n
	!yolo predict model=yolov8n.pt source='5.jpg'

	cd runs/detect/predict

	ls

	import cv2
	from google.colab.patches import cv2_imshow
	image = cv2.imread('5.jpg')
	cv2_imshow(image)

view raw yolocolabexample.py hosted with ❤ by GitHub

Log Meal - Link

Keep Exploring!!!

June 19, 2023

Virtual Try on - TryOnDiffusion: A Tale of Two UNets

Transfer clothes between source, target
Warping, blending
Occlusion is challenging
Diffusion models to handle issues

Warping -

Warping involves transforming an image's geometry, usually to correct distortions, align images, or change the perspective.

There are different types of warping, such as:

Affine warping: This type of warping preserves parallel lines and involves a linear transformation followed by a translation. It can represent transformations like rotation, scaling, and shearing.
Perspective (projective) warping: This type of warping can represent a more general transformation that includes perspective changes. It can correct distortions caused by the camera's viewpoint or create a "bird's-eye view" of a scene. Perspective warping requires four pairs of corresponding points in the input and output images to calculate the transformation matrix.
Warping is widely used in various applications, such as image stitching (for creating panoramas), rectifying images for OCR (Optical Character Recognition), and correcting lens distortions in photographs.

In the context of computer vision libraries like OpenCV, warping functions are available to apply these transformations to images, given the appropriate transformation matrix and input/output coordinates.

OpenCV Warping functions

cv2.warpAffine
cv2.warpPerspective
cv2.remap

OpenCV Blending functions

cv2.addWeighted
cv2.add
cv2.subtract

Note #1 - All segmentation done on low resolution
Note #2 - Super Resolution is added to cover up low res and give high res outputs
Note #3 - Running all tasks on high res is even more challenging

Paper - link

Walmart Try on

Things to note

Poster detection
More minmal clothes and superimposition approach
Full body posture + cloth overlap on it.

Keep Exploring!!!

June 16, 2023

Deep Network of Humans

Keep Exploring!!!

Concept Kullback-Leibler (KL) divergence, chi-square test

Kullback-Leibler (KL) divergence, also known as relative entropy. It is a measure of how one probability distribution is different from a second, reference probability distribution.

chi-square test is a statistical test used to determine whether there is a significant association between two categorical variables in a sample.

	#Kullback-Leibler (KL) divergence, also known as relative entropy
	#is a measure of how one probability distribution is different from a second, reference probability distribution.
	#It is used in various fields such as information theory, machine learning, and statistics.
	#The logarithm function in the KL divergence formula is not symmetric with respect to its arguments.
	#Specifically, log(P(i) / Q(i)) is not equal to log(Q(i) / P(i)). This asymmetry in the logarithm function contributes to the asymmetry of the KL divergence.

	import numpy as np

	def kl_divergence(p, q):
	return np.sum(np.where(p != 0, p * np.log(p / q), 0))

	# Example probability distributions
	p = np.array([0.4, 0.6])
	q = np.array([0.3, 0.7])

	# Calculate KL divergence
	kl_div = kl_divergence(p, q)
	print("KL divergence:", kl_div)

	kl_div = kl_divergence(q, p)
	print("KL divergence:", kl_div)

	import numpy as np
	from scipy.stats import entropy

	def kl_divergence(p, q):
	return entropy(p, q)

	# Example probability distributions
	p = np.array([0.4, 0.6])
	q = np.array([0.3, 0.7])

	# Calculate KL divergence
	kl_div = kl_divergence(p, q)
	print("KL divergence:", kl_div)

	kl_div = kl_divergence(q, p)
	print("KL divergence:", kl_div)

	#The chi-square test is a statistical test used to determine whether there is a significant association between two categorical variables in a sample.
	#It is based on comparing the observed frequencies in each category with the frequencies that would be expected under the assumption of independence between the variables.

	#Here's an example: Suppose we have data on the hair color and eye color of a group of people, and we want to test if there is an association between these two variables.
	#Brown Eyes Blue Eyes Green Eyes Total
	#Black Hair 50 20 30 100
	#Blonde Hair 30 40 30 100
	#Total 80 60 60 200
	#We can perform a chi-square test using Python and the scipy.stats library:

	import numpy as np
	from scipy.stats import chi2_contingency

	# Observed frequencies
	observed = np.array([
	[50, 20, 30],
	[30, 40, 30]
	])

	# Perform chi-square test
	chi2, p_value, _, _ = chi2_contingency(observed)

	print("Chi-square statistic:", chi2)
	print("P-value:", p_value)

	#In this example, the chi-square statistic is 10.0, and the p-value is approximately 0.0067.
	#If we choose a significance level of 0.05, we can reject the null hypothesis that hair color and eye color are independent, as the p-value is less than 0.05.
	#This suggests that there is a significant association between hair color and eye color in this sample.

	#Note that the chi-square test has some limitations:
	#It requires a sufficiently large sample size to be valid, as it is based on the approximation of the chi-square distribution.
	#It assumes that the observations are independent and identically distributed.
	#It is sensitive to the choice of categories and may give different results if the categories are combined or split in different ways.

view raw KL_divergence_chi-square.py hosted with ❤ by GitHub

Keep Exploring!!!

June 15, 2023

GCP Vertex Text Examples

	#!pip install "shapely<2.0.0"
	#!pip install google-cloud-aiplatform

	import vertexai
	from vertexai.language_models import TextGenerationModel

	vertexai.init(project="cohesive-gadget-166410", location="us-central1")
	parameters = {
	"temperature": 0.2,
	"max_output_tokens": 256,
	"top_p": 0.8,
	"top_k": 40
	}

	text = "Create one liner add for below product. Product meta data is - 2 IN 1 USB CARD READER - Whether Trans Flash or SD, enjoy the power to read and write on two cards simultaneously, and avoid constantly plugging in and out. HIGH-SPEED TRANSMISSION SPEED - USB 3.0 gives you access to Super Speed data transfer rates of up to 5Gbps with all enabled cards. CARD COMPATIBILITY - Support SD/SDHC/SDXCMMC/MMC…¡/RS MMC/MMC 4.0/ Ultra …¡ SD/ Extreme SD/ Extreme … SD /T-Flash / MICRO SDHC/MICRO SDXC 35/5000,Through the adapter can also read MINI SD / MMC micro Card. DESIGN WITH OTG - Truly plug & play of this OTG card reader. Easy installation with no device driver necessary, even the elderly don't have to worry about how to use it.Ideal for transferring high-resolution images and video recordings.."

	model = TextGenerationModel.from_pretrained("text-bison@001")
	response = model.predict(
	text,
	**parameters
	)
	print(f"Response from Model: {response.text}")

	#Response from Model: 2-in-1 USB 3.0 Card Reader: Read and write on two cards simultaneously at blazing fast speeds up to 5Gbps.

view raw gcpvertexeexample.py hosted with ❤ by GitHub

Keep Exploring!!!

June 14, 2023

Future of Learning

Happy Learning!!!

June 13, 2023

My top picklist from article on Palantir principles

Expertise = Experience = Domain + Data + Tech (Blend of all)

Do real customer work long enough to have full empathy and inspire.
Don’t just empathize with the user; be the user.
Built prototype solutions for the unique problems
Build features that magnify value over time
Consider using working products to iterate with users instead of designs and concepts.

Easier said, I can echo it to my past journey/projects. Solving for things the customer needs vs I know a tech what all I can do.

Ref - Link

Keep Exploring!!!

Thinking Questions

Reduction in Cart Abandonment

Key Notes

RCA
External Factors
Competitor Launches / Products
Affected Customer Segment
Macro Economic changes
Seasonality
User Journey in App
Data Captured - Gender / Age
Campaign related impacts
No correlation to compaign
Any product design changes
Catalog and inventory analysis of the product
Which product category etc. has the highest dip.
Any partnership change or any merchant backed.
Geographical distribution of the influx of users on the site like a flood, internet blackout
Compare the pricing of the product with the competitors
Ratings of the products getting moved out of the cart

Ref - Link

Solutions Architect

Key Notes

Challenges, Business Goals, Tech Goals
Feature / Product Demo
Product Integration aspects
Customer Success Stories
Next Steps

Ref - Link

MLOps

Key Projects

Loan scoring, Forecasting
MLOps Pipeline - Data Collection, Ingestion
Data cleaning + Feature Engineering
Different models for different products
Automate model selection / training steps
Model Validation / Testing phase

Ref - Link

System Design for Recommendations and Search

Key Notes

Batched - Store in DB, Precomputed, Refreshed, Key-value pairs
Real-time - Time-sensitive content

Key Concepts

Embedding creation of interests
Features mapping
Ranking / Retrieval
Behavior logs - candidate sets - recommendations
Top N neighbors, KNN, Indexes

Ref - link

Keep Exploring!!!

Model Deployment Architecture

My implementation experience and lessons :)

Product Implementation (2012-2014)

Integrated in product
Jobs scheduled for midnight
Workflow to monitor variations
Forecast updated every day for store
Everything custom-coded formula embedded
Weighted moving average
Step up / Step down moving average approach

Batched State of Art (2021)

Recommendations AWS

ETL / Glue jobs to get featured
Full pull/delta pull scripts
Feature engineering scripts
Custom segmentation scripts
Batch jobs to run models
Large-scale recommendations generation
Infra kubeflow setup
Leverage existing Kubeflow monitoring setup

Forecasting State of Art (2021)

Kubeflow + AWS

ETL / Glue jobs to get features
Full pull/delta pull scripts
Feature engineering scripts
Custom segmentation scripts
Batch jobs to run models
Kubeflow pipelines for the forecast
Results persist in Redshift DB
Infra kubeflow setup
Leverage existing kubeflow monitoring setup

Realtime State of Art (2022)

Real-time streaming / Vision Solution

AWS Lamdbda-based approach
Vision + Docker + AWS Lambda
Request monitoring / logging

Keep Exploring!!!

Vector Databases Reads

Milvus Notes - Index/consistency / availability options

#1. Index type - usecase

IVF_FLAT - High-speed query
IVF_PQ - Very high-speed query
HNSW - High-speed query

Inverted File (IVF): An IVF index divides the vector space into several clusters and holds an inverted file for each cluster, recording which vectors belong to the cluster.

IVF Flat: This is a combination of IVF and flat index. It uses the IVF index to partition the data into clusters and then uses the flat index (brute-force search) within each cluster.

Hierarchical Navigable Small World (HNSW): HNSW builds a multi-layer navigation graph to represent the vector space.

#2. Consistency levels - Strong, Bounded, Session or Eventually

Strong - Most strict
Bounded staleness - allows data inconsistency during a certain period of time.
Session - Like dirty reads
Eventually - weakest level among the four.

#3. HA - In-memory replicas help Milvus recover faster if a query node crashes.

#4. Vector search & Hybrid Search params offset, limit

offset - Number of results to skip in the returned set

limit - Number of the most similar results to return

How indexing and querying works

Trees – ANNOY - Annoy (Approximate Nearest Neighbors Oh Yeah)
Proximity graphs - HNSW Hierarchical Navigable Small World (HNSW) Graphs
Clustering - FAISS
Hashing - LSH - Locality-Sensitive Hashing (LSH)
Vector compression - PQ or SCANN. - ScaNN (Scalable Nearest Neighbors). Product Quantization (PQ): PQ index compresses vectors into compact codes and is beneficial for large-scale, high-dimensional data.

More Reads

Keep Exploring!!!

June 12, 2023

Vision Catalog Creation

Every problem statement needs to have

Selected products
Custom backgrounds
Present / Segmentation/options
Variations with easy to use approach

1. Define products/layouts

2. Custom layout for each type of product

3. Once the product positioned custom background

4. Generate a photoshoot

Keep Exploring!!!

June 11, 2023

How to train your own LLM - Copilot type LLMs

Notes

Scenarios to custom train
Privacy, IP, Customization
Smaller and Efficient Models
Restrict Information shared with LLM models

Code completion model by Replit

Stack

Databricks pipeline
Hugging Face for tokenizers / inference tools for code
MosaicML - GPU and model training

Training LLM Architecture

Extensive code base of Git / Stackoverflow

Data preprocessing
All preprocessing in distributed fashion
Lot of work on notebooks
Removed auto generated code from training
Anonymize data remove PII info
Remove code that does not compile
Remove Python2 code and keep it for one version
Maximum line length set

Custom Vocabulary creation
Custom tokenizer for domain specific dataset

MosaicML for training

Future

Optimal / Smaller LLM
Customized LLMs
LLM with reasoning

Keep Exploring!!!

June 10, 2023

DBScan vs KMeans Summary

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and K-means are two popular clustering algorithms used for unsupervised learning tasks. They have different approaches to clustering and are suitable for different types of data. Here's a comparison of the two algorithms:

Approach:

DBSCAN: DBSCAN is a density-based clustering algorithm. It groups together points that are closely packed together based on a distance measure (e.g., Euclidean distance) and a density threshold. It can find clusters of arbitrary shapes and is also able to identify noise points that do not belong to any cluster.

K-means: K-means is a centroid-based clustering algorithm. It partitions the data into K clusters by minimizing the sum of squared distances between the data points and their corresponding cluster centroids. K-means assumes that clusters are spherical and have similar sizes.

Number of clusters:

DBSCAN: The number of clusters is determined automatically by the algorithm based on the input parameters (distance threshold and minimum number of points). You don't need to specify the number of clusters beforehand.

K-means: You need to specify the number of clusters (K) beforehand. Choosing the optimal value of K can be challenging and often requires domain knowledge or using techniques like the elbow method or silhouette analysis.

Cluster shapes:

DBSCAN: DBSCAN can find clusters of arbitrary shapes, making it suitable for datasets with complex structures.

K-means: K-means assumes that clusters are spherical and have similar sizes, which may not be suitable for datasets with complex structures or clusters with different shapes and sizes.

Handling noise:

DBSCAN: DBSCAN can identify and separate noise points that do not belong to any cluster.

K-means: K-means is sensitive to noise and outliers, as they can significantly affect the position of the cluster centroids.

Scalability:

DBSCAN: DBSCAN can be slower than K-means for large datasets, especially if the distance matrix needs to be computed. However, there are optimized versions of DBSCAN (e.g., HDBSCAN) that can handle large datasets more efficiently.

K-means: K-means is generally faster and more scalable than DBSCAN, especially when using optimized implementations (e.g., MiniBatchKMeans in scikit-learn).

In summary, DBSCAN is more suitable for datasets with complex structures, arbitrary cluster shapes, and noise, while K-means is faster and more scalable but assumes spherical clusters with similar sizes. The choice between DBSCAN and K-means depends on the characteristics of the data and the specific requirements of the clustering task.

	#Both examples generate a sample dataset with 300 data points and 4 clusters using the make_blobs function. The K-means example requires specifying the number of clusters (4 in this case), while the DBSCAN example requires specifying the distance threshold (eps) and the minimum number of points required to form a dense region (min_samples). The resulting clusters are visualized using a scatter plot.
	#kmeans
	from sklearn.datasets import make_blobs
	from sklearn.cluster import KMeans
	import matplotlib.pyplot as plt

	# Generate sample data
	data, _ = make_blobs(n_samples=300, centers=4, random_state=42)

	# Create and fit the K-means model
	kmeans = KMeans(n_clusters=4)
	kmeans.fit(data)

	# Get the cluster assignments
	labels = kmeans.labels_

	# Plot the clusters
	plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')
	plt.title('K-means Clustering')
	plt.show()

	from sklearn.datasets import make_blobs
	from sklearn.cluster import DBSCAN
	import matplotlib.pyplot as plt

	# Generate sample data
	data, _ = make_blobs(n_samples=300, centers=4, random_state=42)

	# Create and fit the DBSCAN model
	dbscan = DBSCAN(eps=1.5, min_samples=5)
	dbscan.fit(data)

	# Get the cluster assignments
	labels = dbscan.labels_

	# Plot the clusters
	plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')
	plt.title('DBSCAN Clustering')
	plt.show()

	from sklearn.datasets import make_blobs
	from sklearn.cluster import KMeans

	# Generate sample data
	data, _ = make_blobs(n_samples=300, centers=4, random_state=42)

	# Calculate the Within-Cluster-Sum-of-Squares (WCSS) for different values of K
	wcss = []
	max_k = 10
	for k in range(1, max_k + 1):
	kmeans = KMeans(n_clusters=k)
	kmeans.fit(data)
	wcss.append(kmeans.inertia_)

	# Find the optimal K using the Elbow method
	optimal_k = 1
	for i in range(1, len(wcss) - 1):
	if (wcss[i - 1] - wcss[i]) / (wcss[i] - wcss[i + 1]) > 2:
	optimal_k = i + 1
	break

	print("Optimal number of clusters (K):", optimal_k)

	from sklearn.datasets import make_blobs
	from sklearn.cluster import KMeans
	from sklearn.metrics import silhouette_score

	# Generate sample data
	data, _ = make_blobs(n_samples=300, centers=4, random_state=42)

	# Calculate the Silhouette scores for different values of K
	silhouette_scores = []
	max_k = 10
	for k in range(2, max_k + 1):
	kmeans = KMeans(n_clusters=k)
	kmeans.fit(data)
	labels = kmeans.labels_
	silhouette_scores.append(silhouette_score(data, labels))

	# Find the optimal K using the Silhouette method
	optimal_k = silhouette_scores.index(max(silhouette_scores)) + 2

	print("Optimal number of clusters (K):", optimal_k)

view raw Clusteringexample.py hosted with ❤ by GitHub

Keep Exploring!!!

June 07, 2023

Data science takes time

Real world is not kaggle data
Its is very risky for more reliance on technology and less understanding of problem
Do not jump into solutions without knowing domain
Intent should not be solve fast but to solve with clarity
Have a open mind about Domain vs Data vs Algo
Be candid about opinions
If all problems are like kaggle, we should have seen a ton of production solutions
Interview questions may be products people spent years to build, Thought process / clarity is more important than quick working solutions

Keep Thinking!!!

June 05, 2023

Cashflow forecasting

Paper - Empowering cash managers to achieve cost savings by improving predictive accuracy

Cash management is concerned with optimizing the short-term funding requirements of a company

Time Series Forecasting with Transformer Models and Application to Asset Management

Sequence prediction - we often predict the next value of the sequence itself
Sequence generation - convert sequences from one domain into sequences from another domain, such as machine translation, text summarization, chatbots
Iterated multi-step forecasting
Direct multi-step forecasting

Self-attention is designed to capture the dependencies in the sequence, such as the relationship between each word with each other word in a senten

For a given query, we compare it with all keys K and get different weights for different values

Self-attention and multi-head attention are permutation-equivariant with respect to its inputs

In our experiment, we consider three different portfolio allocation methods:

Single-period MVO portfolio with monthly rebalancing
Risk parity portfolio with monthly rebalancing
Multi-period MVO portfolio with weekly rebalancing as described by Problem

How to Build a Cash Flow Forecast

Determine Your Forecasting Objective(s)
Short-term liquidity planning
Interest and debt reduction
Liquidity risk management
Growth planning

Cash payments and receipts. - Short-period forecasts: Short-term forecasts typically look two to four weeks into the future and contain a daily breakdown of cash payments and receipts.

The most common medium-term forecast is the rolling 13-week cash flow forecast.

Long-period forecasts: Longer-term forecasts typically look 6–12 months into the future and are often the starting point for annual budgeting processes

Mixed-period forecasts: Mixed-period forecasts use a mix of the three periods above and are commonly used for liquidity risk management.

Cash flow forecasting

Forecast your income or sales
Estimate cash inflows
Estimate cash outflows and expenses
Review your estimated cash flows against the actual

Preparing a cash flow forecast: Simple steps for vital insight

Decide how far out you want to plan for
List all your income
List all your outgoings

Empirical analysis of daily cash flow time series and its implications for forecasting

Cash management is concerned with the efficient use of a company’s cash and short-term investments such as marketable securities.

From these and other works, we observe that common assumptions on the statistical properties of cash flow time-series include:

Normality: cash flows follow a Gaussian distribution with observations symmetrically centered around the mean, and with finite variance.
Absence of correlation: the occurrence of past cash flows does not affect the probability of occurrence of the next ones.
Stationarity: the probability distribution of cash flows does not change over time and, consequently, its statistical properties such as the mean and variance remain stable.
Linearity: cash flows are proportional either to another (external) explanatory variable or to a combination of (external) explanatory variables.

Empowering cash managers to achieve cost savings by improving predictive accuracy

Kurtosis is a measure of the tailedness of a distribution. Tailedness is how often outliers occur

Transforming Financial Forecasting with Data Science and Machine Learning at Uber

Strategic planning
Operations
Insights

Modeling strategic investments as an optimization problem

Minimize spending
Maximize number of drivers or riders
Maximize number of first trips or total trips
Maximize gross bookings

With each optimization problem, we can also specify constraints, such as:

Maximum budget, overall or specific to certain channels (such as marketing versus rider promotion)
Minimum number of first trips or trips
Minimum month-to-month gross booking growth

Short-term use cases: Short-term use cases for cashflow forecasting include budgeting, forecasting sales, and managing cash flow. It can also be used to identify potential areas of overspending and to plan for future investments

Long-term use cases: Cashflow forecasting can be used to plan for long-term investments, such as capital expenditures and acquisitions. It can also be used to develop strategies for managing cash flow over the long-term, such as budgeting and debt management

Receivables forecast
Payable forecast

Ref - Link

Keep Exploring!!!

June 04, 2023

Forecast + Optimization

Regression to find optimal values of 'X' values
Add a constraint to make it an optimization problem
Optimization with minimum expense for each track

	#https://raw.githubusercontent.com/justmarkham/scikit-learn-videos/master/data/Advertising.csv
	#https://github.com/georgeblu1/Data-Projects/blob/master/Budget%20Optimization.ipynb

	import pandas as pd
	data = pd.read_csv(r'https://raw.githubusercontent.com/justmarkham/scikit-learn-videos/master/data/Advertising.csv')
	data.head()

	import pandas as pd
	import numpy as np
	import seaborn as sns
	import matplotlib.pyplot as plt
	import math


	import statsmodels.api as sm
	import statsmodels.formula.api as smf
	from statsmodels.tools.eval_measures import rmse

	from sklearn import metrics
	from sklearn.linear_model import LinearRegression
	from sklearn.linear_model import LogisticRegression
	from sklearn.model_selection import train_test_split

	#There should be a linear relationship between target and features. We can use scatterplot to visualize and validate for us.
	sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size = 4, aspect = 1)

	#Little or no multicollinearity between features
	sns.pairplot(data[['TV','Radio','Newspaper']])

	#Homoscedasticity
	#OLS assumes all residuals drawn from population has constant variance
	sns.residplot(x = data['TV'], y = data["Sales"])

	feature_cols = ['TV', 'Radio', 'Newspaper']
	X = data[feature_cols]
	y = data[["Sales"]]

	# instantiate and fit
	SkLearn_model = LinearRegression()
	SkLearn_result = SkLearn_model.fit(X, y)

	# print the coefficients
	print(SkLearn_result.intercept_)
	print(SkLearn_result.coef_)

	# include Newspaper
	X = data[['TV', 'Radio', 'Newspaper']]
	y = data.Sales

	# Split data
	X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


	# Instantiate model
	lm2 = LinearRegression()

	# Fit Model
	lm2.fit(X_train, y_train)

	# Predict
	y_pred = lm2.predict(X_test)

	# RMSE
	print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

	print(metrics.r2_score(y_test,y_pred))

	coefficient = lm2.coef_

	coefficient

	inter = lm2.intercept_

	inter

	!pip install pulp

	#Our budget should be less than 1000
	from pulp import *

	prob = LpProblem("Ads Sales Problem", LpMaximize)

	#x - tv
	#y - radio
	#z - newpaper

	x = LpVariable("x", 0, 200)
	y = LpVariable("y", 0, 500)
	z = LpVariable("z", 0, 500)

	prob += x + y + z <= 1000

	prob += 0.0548x + 0.1022y + 0.0007878*y + 4.6338

	status = prob.solve()

	LpStatus[status]

	print(prob)

	for v in prob.variables():
	print(v.name, "=", v.varValue)

	calculation = 0.0548200 + 0.1022500 + 0.0007878*0 + 4.6338

	calculation

	prob = LpProblem("Ads Sales Problem with minium in each track", LpMaximize)

	x = LpVariable("x", 50, 200)
	y = LpVariable("y", 50, 500)
	z = LpVariable("z", 50, 500)

	prob += x + y + z <= 1000

	prob += 0.0548x + 0.1022y + 0.0007878*y + 4.6338

	status = prob.solve()

	LpStatus[status]

	print(prob)

	for v in prob.variables():
	print(v.name, "=", v.varValue)

	calculation = 0.0548200 + 0.1022500 + 0.0007878*50 + 4.6338

	calculation

view raw forecast_Optimization.py hosted with ❤ by GitHub

Keep Exploring!!!

June 03, 2023

Promises and Lies of ChatGPT - understanding how it works

Key Notes

Basics

ChatGPT is the idea of n-gram models
Given n-1 words guess nth word likely to be
Distribution is learnt from sequence
People tried in small values of n
Sample from distribution of words
More likely words more often

With large data

Any N, Words next word
Frequency, Conditional probability
Generate words if the first word given
More likely words + Patterns

Large sentences/meanings

Abstract sequences
Different answers every time
Every sequence may be different generated distributions but a similar context is possible
Chatgpt = something well written

Why it works?

We believe in what seems realistic
Connect to human experience
Fact is different from possibility
Plausible or probable or reasonable answers

Similarity to humans

Humans are not always factual
It can be perception based
People can be finalized in civil society
Machines can suggest without knowing the consequences
Automation still may have a bias
Being close to the truth we are impressed

Predictive modeling

Train / predict

Conditional modeling

Can create bias in information
Discriminate learning learns a conditional model
Classifier then finds dogs vs generates dogs both different

Generative distribution - Joint distribution

The prior distribution of reasonable images
Teacher = Generative model
Learning generative model is costlier

The human brain works by on-demand stitching

chatgpt does something similar
All learning is compression
All learning is lossy compression
jpeg lossy - approximating
Representation of compressed details
Significant footprint available to train systems

Good writing for all

Picaso style pics
Shakespeare style writing
Racial profiling not required
Character and form are not connected
Generalizations help for survival
AI as creator / editor

Badly written with original thought is human writing

Harder to write original creative ways
Original vs Derivative thinking
Bad handwriting vs Good content
Bad package vs Good product
We have one scale good or bad
LLM learns from human language
Most likely completion given soceity is
Social Enginner on Data

Is this a good representation of all ethnicity ?

How it for fine tuned ?

RHLF
Show results
asks someone their likes
Thumbs up / down to change distribution
Re-learning it
Collectively offensive content on web vs making a decent prompt engine

Align to human values
Concentration campus, Genocide - Human values
Retrain for cultural norms
False positive
Different narrative, different takers

Make LLM overwrite conditional network through prompts
Adverserial learning prompts
How to put knobs how it behaves well

AI systems to work with

Basically put people to think about problem
With enough eye balls every downside can be shallow bug
We need more eyeballs to decide
ChatGPT will not generate grammatically incorrect sentence
Core problem of intelligent behavior - planning, diagnosis, reasoning

Keep Exploring!!!

June 02, 2023

AI - Image Generator - Approach

Under the hood training from tons of images we are generating distributions based on

Context generator = backgrounds
Object generator = cars / bikes
Object + Context = Car in beach
Finetune to similar pictures
Sharpen images / Super resolution
Fix shapes / corners

Keep Exploring!!!

June 28, 2023

June 27, 2023

June 26, 2023

June 24, 2023

June 22, 2023

June 20, 2023

June 19, 2023

June 16, 2023

June 15, 2023

June 14, 2023

June 13, 2023

June 12, 2023

June 11, 2023

June 10, 2023

June 07, 2023

June 05, 2023

June 04, 2023

June 03, 2023

June 02, 2023

Git Code Repository

About Me

What is your Expertise

Search This Blog

Translate

About Me and Disclaimer

Labels

Data Science Good Reads

Cloud, Datacentre, BigData and NOSQL Blogs

SQL Links

Archecture Blog List

Programming Problems

Startup - Reads

Perl-Python-Ruby-Linux-Oracle

Management + Leadership Blogs

Research Papers & Podcasts

My Wordpress

Interesting Reads

Useful Links - C# and .NET

Java, Selenium, QTP and Test Tools Learning

Agile Testing

Reverse Logistics Reads

Biztalk Blogs

MS BI Links

Process - Learnt it :)

Usability Guidelines - Building Better Sites

.NET Test Tools and Other Interesting Reads

Review Checklist

Blog Archive

Live Traffic

Total Pageviews

Popular Posts