Introduction

I have been using AWS ElasticSearch for near-real time analysis for API servers. AWS provides built-in Elasticsearch subscription filter for CloudWatch. With less than a hour of effort I can spin up an Elasticsearch cluster to visualize and analyze server logs. However AWS Elasticsearch is not cheap in production setup, also as logs accumulate, it needs some maintenance (lifecycle policy, JVM pressure, etc). Also dev team was not fully utilizing the elastic stack as much. I have been meaning to decommission the Elasticsearch and find alternative option for several month. And finally I decided that it's time to migrate to other solution for following reason.

Amazon: NOT OK - why we had to change Elastic licensing

After some survey, I listed requirements for the substitue

Anchoring CloudWatch logs are the inception point of all the logs(nginx, api logs, etc) There were couple of alternatives I could think about.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/7cfa8b93-d3e2-4395-8ea2-3c2ffa359840/Untitled.png

I was initially inclined to choose something from AWS since I prefer to have less moving pieces and do less coding. However I am already using BigQuery for Data warehouse. I concluded that if I was going to stash some data somewhere I might as well centralize all of them in single place. So I chose to go with lambda+Bigquery. For this post, I will be using nginx log for example, since these formats are common for most of people.

Load vs. Stream

Before going into actual setup process I must acknowledge couple things such as distinction betweenLoad and streaming. Load is when you load batch of data for once or in recurrence. In BigQuery, Load is free and it only charge for storage. However Stream incurs charge and each insert is $0.010 per 200 MB .

Prerequisite

Powered by Fruition