AWS Data Analytics

  • Athena
  • Redshift
    • OLAP (PostgreSQL)
  • OpenSearch
  • EMR (Elastic Map Reduce)
  • QuickSight
  • Glue
  • LakeFormation
  • Kinesis
  • MSK (Managed Streaming for Apache Kafka)
  • Big Data Ingestion Pipeline

# Athena

  • Serverless query service to analyze data stored in Amazon S3
  • standard SQL language to query files (Presto)
  • supports CSV, JSON, ORC, Avro, Parquet
  • $5.00 per TB of data scanned
  • with Amazon QuickSight for reporting/dashboards
  • use cases
    • business intelligence / analytics / reporting
    • analyze & query VPC flow logs, ELB logs, CloudTrail trails…
  • performance improvements
    • use columnar data for cost saving (less scan)
      • Parquet, ORC
      • use Glue to convert your data
    • compress data for smaller retrievals
    • partition datasets for easy querying on virtual columns s3://athena-examples/flight/parquet/year=1991/month=1/day=1
    • use larger files (> 128MB) to minimize overhead
  • federated query
    • allows you to run SQL queries across data stored in relational, non-relational, object, and custom data sources
    • uses data source connectors that run on AWS lambda to run federated queries
    • store the results back in S3

# Redshift

  • Redshift is based on PostgreSQL, but it’s not used for OLTP
  • It’s OLAP – online analytical processing (analytics and data warehousing)
  • 10x better performance than other data warehouses, scale to PBS of data
  • Columnar storage of data (instead of row based) & parallel query engine
  • Pay as you go based on the instances provisioned
  • Has a SQL interface for performing the queries
  • BI tools such as Amazon QuickSight or Tableau integrate with it
  • vs Athena: faster queries / joins / aggregations thanks to indexes
  • redshift cluster
    • leader node - for query planning, results aggregation
    • compute node - for performing the queries, send results to leader
    • provision the node size in advance
    • use Reserved Instances for cost savings
  • snapshots & DR
    • has “multi-AZ” mode for some clusters
    • snapshots are point-in-time backups of a cluster, stored internally in S3
    • snapshots are incremental
    • capable of restoring a snapshot into a new cluster
    • automated - every 8 hours, every 5GB, or on a schedule. set retention between 1 ~ 35 days
    • manual - snapshot is retained until you delete it
    • you can configure Redshift to automatically copy snapshots to another Region
  • loading data into redshift
    • Large inserts are much better
  • Redshift Spectrum
    • query data that is already in S3 without loading it
    • must have a redshift cluster available to start the query
    • the query is then submitted to thousands of redshift spectrum nodes

# OpenSearch

  • With OpenSearch, you can search any field, even partially matches
  • It’s common to use OpenSearch as a complement to another database
  • Two modes: managed cluster or serverless cluster
  • Does not natively support SQL (can be enabled via a plugin)
  • Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs
  • Security through Cognito & IAM, KMS encryption,TLS
  • Comes with OpenSearch Dashboards (visualization) (kibana)
  • patterns
    • dynamoDB table -> dynamoDB streams -> lambda function -> OpenSearch
    • CloudWatch logs -> subscription filter -> lambda (realtime) / kinesis data firehose (near realtime) -> OpenSearch

# EMR

  • EMR stands for “Elastic MapReduce”
  • EMR helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data
  • The clusters can be made of hundreds of EC2 instances
  • EMR comes bundled with Apache Spark, HBase, Presto, Flink…
  • EMR takes care of all the provisioning and configuration
  • Auto-scaling and integrated with Spot instances
  • Use cases: data processing, machine learning, web indexing, big data…
  • node types
    • master node - manage the cluster, coordinate, manage health - long running
    • core node - run tasks and store data - long running
    • task node - just to run task, usually spot

# QuickSight

  • Serverless machine learning-powered business intelligence service to create interactive dashboards
  • Fast, automatically scalable, embeddable, with per-session pricing
  • Use cases:
    • Business analytics
    • Building visualizations
    • Perform ad-hoc analysis
    • Get business insights using data
  • Integrated with RDS, Aurora, Athena, Redshift, S3…
  • In-memory computation using SPICE engine if data is imported into QuickSight
  • Enterprise edition: Possibility to setup Column-Level security (CLS)
  • dashboard & analysis
    • users (standard version) and groups (enterprise version)
      • users & groups only exist within QuickSight
    • dashboard is a read-only snapshot of an analysis, preserves the configuration of the analysis (filtering, parameters, controls, sort)
    • share the analysis or dashboard with users or groups
    • user who see the dashboard can also see the underlying data

# Glue

  • Managed extract, transform, load (ETL) service
  • fully serverless service
  • use cases
    • convert data into parquet format
    • catalog of dataset
  • high level things
    • job bookmarks - prevent re-processing old data
    • elastic views
      • combine and replicate data across multiple data stores using SQL
      • glue monitors for changes in the source data, serverless
      • leverages a “virtual table”
    • databrew - clean and normalize data using pre-built transformation
    • studio - new GUI to create, run and monitor ETL jobs
    • streaming ETL - compatible with Kinesis data stream, Kafka, MSK

# LakeFormation

Centralized permission control (row / column level) (backed by S3)

  • Data lake = central place to have all your data for analytics purposes
  • Fully managed service that makes it easy to setup a data lake in days
  • Discover, cleanse, transform, and ingest data into your Data Lake
  • It automates many complex manual steps (collecting, cleansing, moving, cataloging data, …) and de-duplicate (using ML Transforms)
  • Combine structured and unstructured data in the data lake
  • Out-of-the-box source blueprints: S3, RDS, Relational & NoSQL DB…
  • Fine-grained Access Control for your applications (row and column-level)
  • Built on top of AWS Glue

# Kinesis data analytics

  • Real-time analytics on Kinesis Data Streams & Firehose using SQL
  • Add reference data from Amazon S3 to enrich streaming data
  • Fully managed, no servers to provision
  • Automatic scaling
  • Pay for actual consumption rate
  • Output:
    • Kinesis Data Streams: create streams out of the real-time analytics queries
    • Kinesis Data Firehose: send analytics query results to destinations
  • Use cases:
    • Time-series analytics
    • Real-time dashboards
    • Real-time metrics

# MSK

  • Alternative to Amazon Kinesis
  • Fully managed Apache Kafka on AWS
    • Allow you to create, update, delete clusters
    • MSK creates & manages Kafka brokers nodes & Zookeeper nodes for you
    • Deploy the MSK cluster in your VPC, multi-AZ (up to 3 for HA)
    • Automatic recovery from common Apache Kafka failures
    • Data is stored on EBS volumes for as long as you want
  • MSK Serverless
    • Run Apache Kafka on MSK without managing the capacity
    • MSK automatically provisions resources and scales compute & storage

# Big Data Ingestion Pipeline

  • IoT Core allows you to harvest data from IoT devices
  • Kinesis is great for real-time data collection
  • Firehose helps with data delivery to S3 in near real-time (1 minute)
  • Lambda can help Firehose with data transformations
  • Amazon S3 can trigger notifications to SQS
  • Lambda can subscribe to SQS (we could have connecter S3 to Lambda)
  • Athena is a serverless SQL service and results are stored in S3
  • The reporting bucket contains analyzed data and can be used by reporting tool such as AWS QuickSight, Redshift, etc…

Get Things Done
Built with Hugo
Theme Stack designed by Jimmy