Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): October 2023

October 30, 2023

Code - Domain - Data Knowledge

#Coding floods the scene, #DomainKnowledge remains elusive. Truly grasping #CustomerNeeds - now that's a rare spirit. As #HelloWorld apps multiply, remember #RealWorld applications need a #panoramic vision for solutions. #Data + #Domain + Understanding Customer Needs requires constant experimentation, ideation and iteration. #perspectives. Analogy from Ray Dalio, Shapers get both the big picture and the details right. To me, it seems that Shaper = Visionary + Practical Thinker + Determined.

Keep Exploring!!!

October 29, 2023

Introducing ChatGPT Enterprise

ChatGPT Enterprise

You own and control your business data in ChatGPT Enterprise
We do not train on your business data or conversations, and our models don’t learn from your usage

ChatGPT Enterprise is available today

Advanced Data Analysis (ChatGPT Enterprise version)

Advanced Data Analysis (ADA) has been upgraded to include three new capabilities aimed at enhancing the analysis of text-rich documents:

Synthesis - Analyze information from documents to generate new content or insights
Transformation - Alter the presentation of information without changing its underlying essence
Extraction - Identify and pull out specific pieces of information from a document
Supporting Formats - PDF (.pdf), Text (.txt), PowerPoint (.ppt), Word (.doc), Excel (.xlx), Comma-separated values (.csv)

Contact OpenAI sales team

Advanced-Data Analysis (ChatGPT Enterprise version) - pull out specific pieces of information from a document, Supports - PDF (.pdf), Text (.txt), PowerPoint (.ppt), Word (.doc), Excel (.xlx), Comma-separated values (.csv)

Keep Exploring!!!

Google Vision - Experiments - Vertex Vision

Cloud Video Intelligence API - Detects objects, explicit content, and scene changes in videos. It also specifies the region
Cloud Vision API - Image Content Analysis

Track objects in a streaming video

Track objects

Shot change detection tutorial

SHOT_CHANGE_DETECTION request
List of all shots that occur within the video
For each shot, provide the start and end time of the shot

Track objects in a local video file

All Video Intelligence code samples

AI-powered video archive for searching family videos

Video intelligence takes to the streets

Hello video data: Train an AutoML video classification model

Live Streaming on Google Cloud with Media CDN and Live Streaming API

Experiments

On lower resolution
Out-of-box detections
Select frames by Shot detection and evaluate
Offline mode evaluation (On videos from the bucket)
Online mode evaluation (Live Streaming API)
To find unknowns/limitations from sample videos
Exploratory analysis/video/image

Architecture options

Offline Video evaluation with video in GCP bucket
Offline Video evaluation + custom models
Offline Video + Shot Detection + Out of box object detection
Live Streaming API Evaluation

Vertex AI Vision pricing

Vision AI

Create Apps

Keep Exploring!!!

information asymmetry, Common-Knowledge Effect

It is a very relatable experience to information asymmetry from project experiences, domain understanding.

Some instances when

Customers share limited information
Unexplained features about transactions
Correlations were removed and data anonymized

This usually happens and results in poor forecasting

When two people do not have same level of info, the perception and understanding varies

Common-Knowledge Effect: A Harmful Bias in Team Decision Making

The common-knowledge effect is a decision-making bias where teams overemphasize the information most team members understand instead of pursuing and incorporating the unique knowledge of team members.

Preference Bias

We are more likely to discuss information that aligns with our initial preferences or preconceived notions.
Even when all information is shared with the group, we still process that information according to our initial preferences.

Social Comparison

We seek social acceptance and avoid conflict with teammates. We tend to adopt the group's prevailing view when evaluating information in unclear situations.
Information familiar to multiple team members becomes socially validated and more likely to be repeated and affirmed.

Keep Exploring!!!

70 hours work week VS Thinking perspective VS Consulting vs Domain Knowledge

Numbers do not reflect quality of outcome rather consistency, ideas, experimentation matters

Steve Jobs on Continuous Process Improvement

The theory behind the question of why we do ?
Ways of doing things? Question the basics? Shift to an Optimistic point of view of Relooking options?
The shift of perspective / optimistic point of view?
That's the way it's done vs Finding new ways/opportunities

Steve Jobs on Consulting

Owning / working extended period of time few years
See through actions / accumulate scar tissues
Learning a fraction vs Not owning results
Picture of a banana vs 2D vs Experience of doing vs 3D views
Knowing is 2D view vs doing is 3D view (Hands on Matters)
Take responsibility to work

This was the key lesson for Why I was able to rewrite warranty engine in Microsoft for 224 million serial numbers

Team - Team work

"My model for business is The Beatles. They were four guys who kept each other's kind of negative tendencies in check. They balanced each other, and the total was greater than the sum of the parts. That's how I see business: Great things in business are never done by one person, they're done by a team of people."

Ideas to Reality

This lies in process / constant innovation

It is compounding value over years :)

Hire people trustworthy plus have complimentary skills. Aligned on vision but diverse on skills

Arrogance - Keep a Tab

The biggest enemy of continuous growth and progress is arrogance.

“All of us need to be on guard against arrogance, which knocks at the door whenever you are successful.” —Steve Jobs pic.twitter.com/NOLL1MLjgl
— Vala Afshar (@ValaAfshar) December 9, 2023

Keep Exploring!!!

October 23, 2023

Text to Vision - Image - Survey - Techniques - Lessons

Text to Vision - Image - Survey - Techniques - Lessons

multimodal-to-text generation models (e.g. Flamingo)
image-text matching models (e.g. CLIP)
text-to-image generation models (e.g. Stable Diffusion).

A Survey of Diffusion Based Image Generation Models: Issues and Their Solutions

Difficulty generating images with multiple objects
Quality improvement of generated images

Concepts

To generate images with multiple objects, layout information such as bounding boxes or segmentation maps is added to the model.
Cross-attention maps have been found to play a crucial role in image generation quality
Techniques like “SynGen” [55] and “Attend-and-Excite” [9] have been introduced to improve attention maps

Mathematically, this process can be modeled as a Markov process.
The process of adding noise step by step from X0 to XT is called “forward process” or “diffusion process”
Reversely, from XT , the process of iterative remove the noise until getting the clear image is called the “reverse process”

Denoising Diffusion Probabilistic Models (DDPM)
Basic components of a diffusion model
Noise prediction module - U-net / pure transformer structure
Condition encoder - conditioned on something, such as text. T5 series encoder or CLIP text encoder is used in most of the current works.
Super resolution module - DALL·E 2 employs two super-resolution models in its pipeline
Dimension reduction module - Text encoder and image encoder of CLIP are components integrated into the DALL·E 2 model
Diffusion models can also encounter difficulties in accurately representing positional information
SceneComposer [75] and SpaText [1] concentrate on leveraging segmentation maps for image synthesis
Subject Driven Generation
Concept customization or personalized generation
Present an image or a set of images that represent a particular concept, and then generate new images based on that specific concept

Advantage of Blip-diffusion lies in its ability to perform “zero-shot” generation, as well as “few-shot” generation with minimal fine-tuning

QUALITY IMPROVEMENT OF GENERATED IMAGES

Mixture of experts (MOE) [60] is a technique that leverages the strengths of different models, and it has been adapted for use in diffusion models to optimize their performance
Employ Gaussian blur on the on certain area of the prediction according to self-attention map to extract this condition

Reverse Stable Diffusion: What prompt was used to generate this image?

new task of predicting the text prompt given an image generated by a generative diffusion model
DiffusionDB is the first large-scale text-to-image prompt dataset. It contains 14 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users.
Diffusion Explorer

Learning framework for prompt embedding estimation

Reversing the textto-image diffusion process

Predict a sentence embedding of the original prompt used to generate the input image
As underlying models, we consider three state-of-the-art architectures
that are agnostic to the generative mechanism of Stable Diffusion, namely ViT, CLIP and Swin Transformer
U-Net model from Stable Diffusion, which operates in the latent space.

Explain in laymen terms - U-Net model from Stable Diffusion, which operates in the latent space.

The U-Net model from Stable Diffusion is a type of artificial intelligence model used for various computer vision tasks like image segmentation, where it identifies and separates different objects or features within an image.

Imagine you have a picture that appears blurry, full of noise, or unclear. The U-Net model from Stable Diffusion operates like a sophisticated visual detective, which can work back through the noise, step by step, to try and reconstruct the original picture.

To do this, it operates in what we call the 'latent space', which is loosely analogous to the mind’s eye of the AI - it's where the AI forms a sort of abstract, compressed understanding of the different elements present in the image, their shapes, and how they relate to each other. You can think of the latent space as a box where the details of the image are stored in a compact form, almost like the raw components before they've been assembled into the complete picture.

So, the U-Net model from Stable Diffusion first takes a noisy image, maps or translates it into this intermediate latent space - compressing and organizing the information in a way it can handle - before then reconstructing the original, clearer image from that. It's essentially a way of moving from a jumble of details,into a structured "blueprint" in the latent space, and then using that blueprint to rebuild a clear and accurate image.

A key aspect of the U-Net model is its structure, which is like a U-shape (thus the name 'U-net'). The first half of the U shape takes the noisy image and condenses it down into the blueprint in the latent space (this is called encoding or downsampling). The second half then expands this blueprint back out into the clear image (known as decoding or upsampling). This U-shape structure, combined with the operation in the latent space, allows the model to effectively manage and recover the important details from the noisy input and improve the generated output's quality significantly.

So in simple terms, the U-Net model from Stable Diffusion operates like a skilled restorer, turning a distorted or noisy picture back into a clear and identifiable image by operating in its “mind’s eye” or latent space, using a special U-shaped structure to carefully manage detail extraction and restoration.

Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion

VLP: A Survey on Vision-Language Pre-training

Image Feature Extraction

By using the Faster R-CNN, VLP models obtain the OD-based Region feature embedding
CNNs end-to-end by using the grid features

Video Feature Extraction

VLP models [17, 18] extract the frame features by using the method mentioned above

Text Feature Models

For the textual features, following pretrained language model such as BERT [2], RoBERTa [24], AlBERT [25], and XLNet [26], VLP models [9, 27, 28] first segment the input sentence into a sequence of subwords

Ideas summary

Detection models for Region feature embedding
Grid based feature extraction with CNN
Super resolution module to the pipeline
Subject Driven Generation, Concept customization or personalized generation - Present an image or a set of images that represent a particular concept, and then generate new images based on that specific concept
Gaussian blur on the on certain area based on attention / relevance
Captioning, Category recognition
Category Recognition (CR) CR refers to identifying the category and sub-category of a product, such as {HOODIES, SWEATERS}, {TROUSERS, PANTS}
Multi-modal Sentiment Analysis (MSA) MSA is aimed to detect sentiments in videos by leveraging multi-modal signals

Text-to-image Diffusion Models in Generative AI: A Survey

The learning goal of DM is to reserve a process of perturbing the data with noise, i.e. diffusion, for sample generation

Diffusion Probabilistic Models (DPM), Score-based Generative model(SGM)

Denoising diffusion probabilistic models (DDPMs) are defined as a parameterized Markov chain

Forward pass. In the forward pass, DDPM is a Markov chain where Gaussian noise is added to data in each step until the images are destroyed
Reverse pass. With the forward pass defined above, we can train the transition kernels with a reverse process

Conditional diffusion model: A conditional diffusion model learns from additional information (e.g., class and text) by taking them as model input.

Guided diffusion model: During the training of a guided diffusion model, the class-induced gradients (e.g. through an auxiliary classfier) are involved in the sampling process.

Awesome Video Diffusion

Keep Exploring!!!

Interesting Product - sivi.ai

sivi.ai

The concept of blending image/text and providing ad variations is very impressive :)

The next question comes up / How does it compete against other models? Text to image generator options?

Current State of Art models struggle with creating the right mix of design with image and text content.

My Understanding

Have variations for text
Have Variations for image
Leverage past data
Position according to domain/data
Generate variations

30% image variations

30% text variations

40% templates and positioning based on domain / data / templates

Keep Exploring!!!

October 21, 2023

Baidu LLM Use cases

Our most important annual conference, #BaiduWorld2023, is live now! Join the tech extravaganza at this link: https://t.co/hqWwv0lXIg

Live updates 🧵
— Baidu Inc. (@Baidu_Inc) October 17, 2023

From 19 to 25 mins of talk

The use cases implemented are

Use case #1 - Custom background

Input - Photo of car
Prompt - New Energy vehicle, create new backgrounds of it
Output - New custom background added

Use Case #2 - Generate marketing poster. input is image and prompt about product details

Prompt - Take info from site, create poster of it
Output - Creates poster for it with image + text

Use case #3 - Write variations of poster with more quality info data. Create Five more advertising copy

Output -

Positive feature notes in different tones
Professional write up for marketing
Five copies created

Use Case #4 - Video use case generate Ad

Input - Website info, Existing content, Create digital content

Output

Video had person to explain
Different views of car
Images / references in Video

Keep Exploring!!!

October 20, 2023

Google Vertex Vision - Analytics - GenAI - Vertex Matching Engine

Vertex Vision

Feed real-time streaming video

Pick existing models

Plug custom vision models

Architecture references and GenAI - Vision + Text + Catalog Management

Summary items

Finetuning with sample images - Few shot learning
Step 1 - Image Embedding Extractor - Vertex AI Embedding Extractor to extract embedding for image
Step 2 - Vertex Matching Engine to fetch top and similar images
Step 3 - Create a new copy for images and new Text, and Upload to the product database

Pre-requisites - Catalog of images

Ref - Accelerate product catalog management with generative AI

Step 1 - Embedding Extract

Step 2 - Similar Products

Step 3 - Add Descriptions

Step 4 - Prompt based enrichment

Advantage - Language translations supported

Step 5 - Catalog image creation

Ref - Accelerating product innovation with generative AI

Step 1- Text data import - reviews, product info

Step 2 - Extract insights from uploaded info

Step 3 - QnA

Step 4 - Product Generation from concept

Summary

Concept one-liner (1 word)
Features from concept details (Few lines)
Prepare description with features (Product V1 Description)
Description to create Images (Image template creation)
Inspiration with details and images (Draft Product Ready)

Keep Exploring!!!

October 17, 2023

Machine Learning Interpretability / Explainability

Key Notes / Ideas

Key items from blog / Reposted

Create White-Box / Interpretable Models (Intrinsic): e.g., Linear Regression, Decision Trees.
Explain Black-Box / Complex Models (Post-Hoc): e.g., LIME, SHAP.
Enhance the Fairness of a Model: e.g., Fairness Indicators, Adversarial Debiasing.
Test Sensitivity of Predictions: e.g., Perturbation Analysis.

Local vs Global Interpretations:

Local: Dive into a single prediction to understand it. e.g., Individual SHAP values.
Global: Grasp the overall model behavior. e.g., Feature Importance Rankings.

Data Types & Applicable Interpretability Methods:

Tabular: e.g., Partial Dependence Plots.
Text: e.g., Word Embedding Visualizations.
Image: e.g., Grad-CAM for CNNs.
Graph: e.g., Node Influence Metrics.

Model Specificity:

Model Specific: Techniques that apply to a single model or a group of models. e.g., Feature Importances for Trees.
Model Agnostic: General methods applicable to any model. e.g., LIME.

Ref - Link

From AI Ethics institute key points Link

Transparency and explainability gains may be significant
Explainable by justification - Examples could get a better understanding
Explainability through feature importance - understanding of the effect of features - SHAP (SHapley Additive exPlanations
Abstracting key patterns identified in the deep learning models as actual features
Implications of different types of errors have, as well as what the right way of evaluating these errors should be.

Keep Exploring!!!