Elasticsearch
- Elasticsearch is a
Analytics&Full-text search engine - Analyze
application logsandsystem metrics:Application Performance Management(APM) - Forecasting
Elastic Stack
Kibana: Analytics & visualization platform. It's a dashboard, a web interface to the data in ES. Often used for log analysisLogstash: Data processing pipeline. It's a way to feed data into ESX-Pack: Security, monitoring, alerting,reporting, machine learning, graph exploration, elasticSQLBeats:- Filebeat: search logs to logstash
- Metricbeat: system level, cpu, memory
- Packetbeat: ...
Elasticsearch logical concepts
Document
Documentis a record of data. It's stored in JSON format- The
schemafor the document is the definition of what sort of information is stored. It's stored in the index
Index
- The
indexis where the documents are stored. It aggregates similar data. E.g., movies, ratings, users - In the index is defined the individual
fieldsand what data type they accept
Inverted index
- Map each word to the documents where it appear
-
They quickly map search terms to documents
-
Document 1: Space: The final frontier. These are the voyages... Document 2: He's bad, he's number one. He's the space cowboy with the laser gun!
| Word | Occurrence |
|---|---|
| space | 1,2 |
| the | 1,2 |
| final | 1 |
| frontier | 1 |
| he | 2 |
| bad | 2 |
Term Frequency (TF) & Inverse Document Frequency (IDF)
- Term Frequency: How often a term appear in a
givendocument - Document Frequency: How often a term appear in
alldocuments -
Term Frequency/Document Frequency: Measures the
relevanceof the term in a document -
TF*IDF: multiplication of both values
Elasticsearch Scaling
- A index is hashed into
shards. Oneprimaryshard and optionallyreplicashards
Shards
- A
shardis a self-contained instance of Lucene -
Shards can be located across different nodes in a cluster
-
Primary shard: - Receive read and write requests.
- Replicate the shards.
- If a primary shard fails, ES will elect another replica to be the new primary shard
- The number of primary shards cannot be changed later (if you do not want to re-index everything)
Replica shard:- Receive only read requests
- Adding more replica shards increases the read throughput