Skip to content

Extracting Data From Elasticsearch With Python (Scan API)⚓︎

Executive Summary⚓︎

Sometimes you need an easy way to save the full contents of a index out to disk, there is a helper API that makes this really easy.

helper.bulk⚓︎

The below code illustrates how to leverage this capability. At a high level the steps are; * Import the required packages * Setup some environment variables * Create the scan iterator * Then write all the data from the iterator to disk

## Load in Libraries
from elasticsearch import helpers
from elasticsearch.client import Elasticsearch
import json 

##set variables
elasticProtocol = 'http'
elastichost     = 'localhost'
elasticPrefix   = 'elasticsearch'
elasticport     = '9200'
elasticUser     = 'user'
elasticPassword = 'password'
elasticIndex    = 'my-index'
actions         = []
fileRecordCount = 160000 
fileCounter     = 0

## Generate RFC-1738 formatted URL
elasticURL = '%s://%s:%s@%s:%s/%s' % (elasticProtocol,elasticUser, elasticPassword, elastichost, elasticport, elasticPrefix  )

## Create Connection to Elasticsearch
es = Elasticsearch([elasticURL],verify_certs=True)

output = helpers.scan(es,
    index=elasticIndex,
    doc_type="_doc",
    size=1000,                              ### Obviously this can be increased
    query={"query": {"match_all": {}}},
)

## Write Everything Out to Disk
for record in output:
    actions.append(record['_source'])
    if len(actions) >= fileRecordCount:
        with open(elasticIndex + '-extract-' + str(fileCounter) + '.json' , 'w') as f:
          json.dump(actions, f, ensure_ascii=False, indent=4, sort_keys=True)
        actions = []
        print('file ' + str(fileCounter) + ' written')        
        fileCounter = fileCounter + 1

if len(actions) > 0:
    with open(elasticIndex + '-extract-' + str(fileCounter) + '.json' , 'w') as f:
        json.dump(actions, f, ensure_ascii=False, indent=4, sort_keys=True)
    print('file ' + str(fileCounter) + ' written')