"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

August 15, 2021

Research Paper Reads - Logs Monitoring

Ideas need a birds-eye view of the landscape to understand existing work. Papers are the only way to understand that. Bookmarking few notes for future reference

Paper #1 - Log-based software monitoring: a systematic mapping study

Key Notes

  • The Lifecycle of Log

  • Possible components would be Elasticsearch, Logstash, and Kibana
  • Kibana provides an interface for visualization, query, and exploration of log data

  • LOGGING - 1) empirical studies on logging practices, (2) requirements for application logs, and (3) implementation of log statements
  • LOG INFRASTRUCTURE -  (1) log parsing, and (2) log storage.
  • LOG ANALYSIS: : (1) anomaly detection, (2) security and privacy, (3) root cause analysis, (4) failure prediction, (5) quality assurance, (6) model inference and invariant mining, (7) reliability and dependability, and (8) log platforms
  • Log Parsing - “textual similarity” between the log messages.
  • Each log is converted to a binary vector, with each element representing whether the log contains that keyword
  • Transformer - TEMPLATE2VEC (as an alternative to WORD2VEC) to represent extracted templates from logs and LSTMs to learn common sequences of log sequences

Root Cause Analysis

  • By correlating log messages and resource consumption, their
  • approach builds relationships between changes in resource consumption and application events.
  • They propose a technique based on the correlation of console logs and resource usage information to link jobs with anomalous behavior and erroneous nodes.

Failure Prediction

  • Utilize system logs to predict failures by mining recurring event sequences that are correlated

Paper #2 - Multi-Source Anomaly Detection in Distributed IT Systems

Key Notes

  • Three categories-modalities: metrics, application logs, and distributed traces
  • Word frequencies and metrics derived from the logs (e.g TF-IDF)
  • Decompose the trace in its building blocks, the events/spans, and predict the next span in the sequence

Paper #3 -  LogBERT: Log Anomaly Detection via BERT

Key Notes


  • LogBERT leverages the Transformer encoder to
  • model log sequences and is trained by novel self-supervised tasks to capture the patterns of normal sequences. 

Baselines

  • Principal Component Analysis (PCA) [19]. PCA builds counting matrix based on the frequency of log keys sequences and then reduces the original counting matrix into a low dimensional space to detect anomalous sequences
  • One-Class SVM (OCSVM) [14]. One-Class SVM is a well-known one-class classification model and widely used for log anomaly detection [5,16] by only observing the normal data.
  • IsolationForest (iForest) [7]. Isolation forest is an unsupervised learning algorithm for anomaly detection by representing features as tree structures.
  • LogCluster [6]. LogCluster is a clustering based approach, where the anomalous log sequences are detected by comparing with the existing clusters.
  • DeepLog [2]. DeepLog is a state-of-the-art log anomaly detection approach.
  • DeepLog adopts recurrent neural network to capture patterns of normal log sequences and further identifies the anomalous log sequences based on the performance of log key predictions.
  • LogAnomaly [23]. Log Anomaly is a deep learning-based anomaly detection approach and able to detect sequential and quantitative log anomalies.

Paper #4 - A Survey on Automated Log Analysis for Reliability Engineering

  • Log event sequence: A sequence of log events recording system’s activities.
  • Log event count vector: A feature vector recording the log events occurrence


Analysis Insights / Thoughts
  • The query for selected values vs Bulk Upload of Data
  • Usage patterns segmented for weekday/weekend / Trading hours
  • Usage patterns across different time zones
  • Usage patterns across different sections of applications
  • Number of ad-hoc queries
  • Restrict bulk upload to certain timezones / non-peak hours
  • Two-stage commit - Upload and commit at a later stage
Big picture Notes
  • Limit users to App Access during peak hours (5 calls during peak hours)
  • Limit users to App Access during peak hours (10 calls during non-peak hours)
  • Refer to replicated data in case of data that has stop-gap 5 hours delay
  • Pagination of results
  • Cache/reuse of results
  • Identify maximum reported errors
  • Patterns of errors over a weekday 
  • User login activities and queries
  • User value - Application usage vs Revenue
  • User Action predictions 
  • Take top 100 users, Plot the sequence of usage and see common flow/patterns
Diagnosis Perspective
  • What is the blocking that happens between
  • Page load query vs Search query
  • Search query vs Data upload query
  • Data upload vs Report download query
  • Measure potential data conflicts that cause issues


Execution model
  • Understand problem statement
  • Understand data sources
  • Understand data access / permissions
  • Frame NLP / Data / User level details
  • Initial Analysis Scope
  • Application Understanding
  • Connects / Feedback
Diagnosis
  • User based - Create APIs / Read APIs / Update APIs - Simple / Bulk / Delete APIs - Single / Bulk
  • Do we track at UserId, Numberofcalls,Avgtime
  • Nature of transactions - Realtime vs Reporting vs Bulk inserts vs Bulk Updates
  • API calls across day by time
  • %% Mix of workflow and common tables mapped / accessed by them - Time dimension added for pattern
  • A,B,C @ Time T1
  • A,B at Time T2

More Reads

Keep Thinking!!!

No comments: