"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

August 05, 2021

Research paper reads connected cars

Data science applications to connected vehicles

Key Notes

  • Data generated by sensors and actuators in Connected vehicles include noisy, anomalous, redundant, rapidly changing, correlated and heterogeneous data. 

Main findings

  • Multitude of formats and data types 
  • Data in Connected vehicles are generated and collected at high speed

Applications

  • Mobility
  • Understand patterns and trends in mobility data
  • Predicting traffic flow
  • Provide shortest or alternative routes

Safety

  • Driver behaviour and performance analysis
  • Infer real-time environmental conditions
  • Lane-changing assistance
  • Understand interactions between drivers and pedestrian at signalized intersections

Support

  • Guidance to parking spaces - Driver behaviour analysis (e.g., in the insurance domain, for calculating a safety score for the driver: pay-how-you-drive instead of insurance premiums based on population groups)
  • Vehicle predictive maintenance.

Connected vehicle data 

  • 560 GB/day
  • Data generated in CVs exhibit either temporal correlation, spatial correlation or both
  • A stream is a sequence of data elements ordered by time
  • Discrete signals, event logs, or any combination of time series data


  • Drift is more associated to gradual changes in the target concept


  • Sensory data stream
  • Spatial, temporal, and spatio-temporal attributes
  • Existence of missing data (absent readings). 
  • Real-time data cleaning
  • Knowledge discovery from data streams
  • Data windows are a way of looking at relevant slices of a data stream. 
  • Windowing models landmark, tilted, sliding and damped windows

Anomaly Detection with these properties - For every data type we might need to look data properties with respect to time - Recurring patterns, gradual increase, sudden increase, Lows and Highs.   

In stock market they do this in terms on candle stick patterns, looking for patterns in duration of 3months, 6months and see if something demonstrates. Anomaly is subjective to use case but properties of data (Sudden, incremental, Gradual, Recurring) about it will spot anomaly comparing historical vs current observations.

Sliding Window. Given a window with width w and current time point t, the interest is in the frequent patterns occurring in the window [t − w + 1, t].

Landmark Window identifies relevant points (the landmark) in the data stream and the aggregate operator uses all records seen so far after the landmark.

Damped Window Model. This model assigns greater weight to more recently arrived transactions.

Data pre-processing

  • Noise filtration
  • Outliers detection
  • Anomaly detection
  • Feature extraction
  • Sparsity handling

Knowledge management.

  • On-device
  • On-edge
  • Remote

Algorithms for clustering data streams

  • Stream and CluStream algorithms

State-of-the-art on clustering data streams

  • CluStream [1], DenStream [2], StreamKM++ [3], or ClusTree
  • DenStream [2] is an extension of DBSCAN algorithm
  • StreamKM++ [3] of k-means++, StrAP [4] of AP

On a record-at-a-time processing model, long-running stateful operators process records as they arrive, update the internal state, and send out new records

Micro-batching processing model runs each streaming computation as a series of deterministic batch computations on small time intervals

CluStream - The idea behind the CluStream [1] method is to divide the clustering process into an online component which periodically stores detailed summary statistics and an offline component which uses only this summary statistics.

StreamKM++ [3] is a two-phase (online-offline) algorithm which maintains a small outline of the input data using the merge-and-reduce technique. 

StrAP [4] is an extension of the Affinity Propagation (AP) [44] algorithm for data streams, which uses a reservoir for saving potential outliers

DenStream [2] is a density-based data stream clustering algorithm that also uses a feature vector based on the CF vector.

SOStream [50] is a density-based clustering algorithm inspired by both the principle of the DBSCAN algorithm and self-organizing maps (SOM)

More Reads

Keep Thinking!!!

















No comments: