- Longest and most difficult Phase (Preprocessing)
- Translate raw data into features based on domain knowledge
- Create good features, synthetic features
- Reasonable hypothesis for features that matter
- Different problems in same domain may need different features
#changing from range 0 to 1
#Re-scaling improves performance of gradient descent
features['price'] = (features['price']-min(price)/(max(price)-min(price))
#category columns
#one hot encoding technique
tf.feature_column.categorical_column_with_vocabulary_list('city',keys=['San Diego','Los Angeles','San Francisco','Sacremento'])
#preprocessing technique
#Provide range of values
features['capped_rooms']=tf.clip_by_value(features['rooms'],clip_value_min=0,clip_value_max=4)
#bucketize columns
lat = tf.feature_column.numeric_column('latitude')
dlat = tf.feature_column.bucketized_column(lat,boundaries=np.arrange(32,42,1).tolist())
Big Query
- Fully managed DW
- Compute aggregates
- Compute Stats
- For streaming data pipeline
- Time windowed stats
- Operate on google cloud storage data
- Change data from one format to another format
- BigQuery -> Cloud Storage processed data
- Python and Java based pipelines
- For scaling data we use Cloud ML
- Train Model
- Monitor
- Deploy it as Microservice
- Batching and Distributed Training
- Host as Rest API
- GCP Console based
- Use all existing google tools in GCP
- Specify region, bucket to host code
- Walkthrough of commands, project stuctue to execute the project in Google ML
- Copying data to google cloud
- Format to specify training / testing data
- Using Tensorboard
- Hosting as Rest API
- Federated Learning
No comments:
Post a Comment