Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): datascieneprojects

Showing posts with label datascieneprojects. Show all posts

August 15, 2022

AI Projects - Ideas / Inspirations

Project #1 - Fashion Clothing Category Classification

Report - Link

Models - Link

Good start to work on fashion attributes

Project #2 - Time Series based Wikipedia Traffic prediction to aid Caching algorithms

Key Notes

Caching algorithms like LRU, LFU are one of the most widely used algorithms in the industry ranging from storage systems and in-memory key-value stores to routers
Unlike many other ML projects such as image recognition etc., the performance comparison of the ML for caching is not measured against human annotated ground truth, but against LRU
LSTM and CNN based architectures with custom loss function
Develop a custom loss function by adding a term to maximize recall to the binary cross-entropy and tune the new hyper-parameter
Tune the loss function parameter to place a higher weight on positive samples. The custom loss function is as shown below

Loss functions - Link

From link

A custom loss function can be created by defining a function that takes the true values and predicted values as required parameters. The function should return an array of losses. The function can then be passed at the compile stage.

Project #3 - Long Term Stock Prediction Based On Financial Statements

Feature engineering with key indicators, such as Price-to-Book Ratio, Price-to-Earnings Ratio, Debt-to-Equity Ratio
This project focuses on building an end-to-end LSTM model for long term stock prediction based on historical financial statements

Dataset - Link

Balance sheet features, including 30 data fields: cash and cash equivalents, short-term investments, net receivables, inventory, other current assets, total current assets, long-term investments, fixed assets, goodwill, intangible assets, other assets, deferred asset charges, total assets, accounts payable, short-term debt / current portion of long-term debt, other current liabilities, total current liabilities, long-term debt, other liabilities, deferred liability charges, misc. stocks, minority interest, total liabilities, common stocks, capital surplus, retained earnings, treasury stock, other equity, total equity, total liabilities and equity
Income statement features, including 18 data fields: total revenue, cost of revenue, gross profit, research and development, sales general and admin., non-recurring items, other
operating items, operating income, add’l income/expense items, earnings before interest and tax, interest expense, earnings before tax, income tax, minority interest, equity earnings/loss unconsolidated subsidiary, net income-cont. operations, net income, net income applicable to common shareholders.
Cash flow statement features, including 18 data fields: net income, depreciation, net income adjustments, accounts receivable, changes in inventories, other operating activities, liabilities, net cash flow-operating, capital expenditures, investments, other investing activities, net cash flows-investing, sale and purchase of stock, net borrowings, other financing activities, net cash flows-financing, effect of exchange rate, net cash flow.

Paper #4 - Stock Market Prediction using CNN and LSTM

Starting with a data set of 130 anonymous intra-day market features and trade returns
This study is based on a financial dataset extracted from the Jane Street Market Prediction competition on Kaggle [16]. The available dataset is composed of 2,390,491 record each defined using 130 anonymous features measured sequentially spanning 500 days at different time steps during each day.
Rolling cross validation

Predict bucket move 5 / 10 / 15 20 / break

Project #5 - Film Success Prediction Using NLP Techniques

Our dataset may be separated into two major parts: a set of structured categorical and numerical data retrieved from IMDb, and a set of scripts from which we generate word frequency vectors and scene description vectors
For the categorical structured data, we use a dense layer without bias to serve as a trainable embedding layer which returns a 128 dimensional embedding of the data.

Paper #6 - Generating Six-Word Stories

The six-word story is a format of flash storytelling that rose to popularity through the famous tale allegedly written by Ernest
Hemingway
PRAW (the Python Reddit API Wrapper)
Query data from the r/sixwordstories subreddit Use this to generate meaningful tweets

Project #7 - Changing people’s hair color in images

The training set is split into two sets trainA and trainB.
The images from trainB are presented to the discriminator with their actual hair color.
Images from trainA are given as input to the generator along with a target hair color that is randomly sampled from all the hair colors occuring in trainB.
The discriminator tries to classify images from trainB (labeled with their actual hair color) as 1 and generated images (labeled with the target hair color) as 0, whereas the generator tries to fool the discriminator with realistic generated images.
In order to be successful at this, the generator should have to match the target hair color in the generated image.

Code - Link

Project #8 - Detect Depression

Project #9 - Predict loan default

Project #10 - Link Prediction with Graph Neural Networks and Knowledge Extraction

Graph Neural Network. The number of GNN layers is limited due to the Laplacian smoothing

Knowledge Extraction: We use BERN [4] to extract named entities for the abstract of each articles.

Project #11 - Finding a hairstyle that fits your facial features

Adam and RMSProp and ended up using the AdamOptimizer described in the lecture

More Reads

Keep Exploring!!!

August 10, 2022

Day 3 - Projects - Features - Ideas

Project summary list -

2018 Lisr
2017 List
2019 List
2020 List
2018 List

Project #1 - A Machine Learning Approach to Assess Education Policies in Brazil

Report - Link

Regression model to predict current quality index of schools
Clustering model to identify groups of school with similar profiles
Classification model to predict goal achievement in schools
Key Features
Spendings with transportation for students
Spendings with food for students and workers
Spendings with constructions and maintenance of schools
Spendings with salaries of school employees
Number of students
Number of professors separated by level of education
Number of laboratories, computers and offices
School performance according to Ideb in 2013, 2015 and 2017
School goal for Ideb in 2013, 2015 and 2017

My Observations - We could use the same to apply for Indian schools, Based on publically available data we can predict the dropouts / aid / identify poor performers proactively

Project #2 - Fraud detection using Machine Learning

Dataset

PaySim - a Kaggle dataset for fraud detection
6 million + mobile payment transactions
6 different categories of transactions
8312 fraudulent transactions

Class weight-based approach

In a fraud detection system, it’s more critical to correctly detect fraud transactions and acceptable to misclassify certain number of non-fraud transactions.
Penalize misclassification of fraud transactions more than non-fraud transactions
Assign higher weights to fraud class to obtain high recall on that class and counter data imbalance.
Ensure no more than ~1% false positives

Project #3 - Image Super-Resolution Via a Convolutional Neural Network

Key Notes

SRCNN comprises three convolutional blocks corresponding to patch extraction, non-linear mapping, and reconstruction

SRCNN surpasses non-neural methods for the task of super-resolution

Image Super-Resolution using an Efficient Sub-Pixel CNN

Using The Super Resolution Convolutional Neural Network for Image Restoration

Project #4 - Explore Co-clustering on Job Applications

Kaggle Job Recommendation Challenge

~ 1.6m unique job applications
~ 360k unique jobs
~ 320k unique job applicants

Report Explore Co-clustering on Job Applications

Project #5 - FAKE NEWS IDENTIFICATION

Key Notes

Tokenize the body and headline with the Punkt statement tokenizer from the NLTK NLP library. This tokenizer runs an unsupervised machine learning algorithm pre-trained on a general English corpus, and can distinguish between scentence punctuation marks, and position of words in a statement.
Tokenize words with our algorithm, and take care of lemmatization.
Tag each sample with the tokens obtained from entire headline set, and body set.

Poster - Link

Keep Exploring!!!