AWS Database

# types

  • RDBMS (= SQL / OLTP): RDS, [[#Aurora]] – great for joins
  • NoSQL database – no joins, no SQL : [[#DynamoDB]] (~JSON), ElastiCache (key / value pairs), [[#Neptune]] (graphs), [[#DocumentDB]] (for MongoDB), Keyspaces (for Apache Cassandra)
  • Object Store: [[#S3]] (for big objects) / Glacier (for backups / archives)
  • Data Warehouse (= SQL Analytics / BI): Redshift (OLAP), Athena, EMR
  • Search: OpenSearch (JSON) – free text, unstructured searches
  • Graphs: Amazon Neptune – displays relationships between data
  • Ledger1: Amazon Quantum Ledger Database
  • Time series: Amazon Timestream

# RDS

  • Managed PostgreSQL / MySQL / Oracle / SQL Server / MariaDB / Custom
  • Provisioned RDS Instance Size and EBS Volume Type & Size
  • Auto-scaling capability for Storage
    • underlines maybe not enabled by default
  • Support for Read Replicas and Multi AZ
  • Security through IAM, Security Groups, KMS , SSL in transit
  • Automated Backup with Point in time restore feature (up to 35 days)
  • Manual DB Snapshot for longer-term recovery
  • Managed and Scheduled maintenance (with downtime)
  • Support for IAM Authentication, integration with Secrets Manager
  • RDS Custom for access to and customize the underlying instance (Oracle & SQL Server)
  • Use case: Store relational datasets (RDBMS / OLTP), perform SQL queries, transactions

# Aurora

  • Compatible API for PostgreSQL/MySQL, separation of storage and compute
  • Storage: data is stored in 6 replicas, across 3AZ – highly available, self-healing, auto-scaling
  • Compute: Cluster of DB Instance across multiple AZ, auto-scaling of ReadReplicas
  • Cluster: Custom endpoints for writer and readerDB instances
  • Same security / monitoring / maintenance features as RDS
    • Security through IAM, Security Groups, KMS , SSL in transit
  • Know the backup & restore options for Aurora
  • Aurora Serverless – for unpredictable / intermittent workloads, no capacity planning
  • Aurora Global: up to 16 DB Read Instances in each region, < 1 second storage replication
  • Aurora Machine Learning: perform ML using SageMaker & Comprehend on Aurora
  • Aurora Database Cloning:new cluster from existing one, faster than restoring a snapshot
  • Use case: same as RDS, but with less maintenance / more flexibility / more performance / more features
  • endpoints
    • cluster - primary
    • reader
    • custom
    • instance

# ElastiCache

  • Managed Redis / Memcached (similar offering as RDS, but for caches)
  • In-memory data store, sub-millisecond latency
  • Select an ElastiCache instance type (e.g., cache.m6g.large)
  • Support for Clustering (Redis) and Multi AZ, Read Replicas (sharding)
  • Security through IAM, Security Groups, KMS, Redis Auth
  • Backup / Snapshot / Point in time restore feature
  • Managed and Scheduled maintenance
  • Requires some application code changes to be leveraged
  • Use Case: Key/Value store, Frequent reads, less writes, cache results for DB queries, store session data for websites, cannot use SQL.

# DynamoDB

  • AWS proprietary technology, managed serverless NoSQL database, millisecond latency
  • Capacity modes: provisioned capacity with optional auto-scaling or on-demand capacity
  • Can replace ElastiCache as a key/value store (storing session data for example, using TTL feature)
  • Highly Available, Multi-AZ by default, Read and Writes are decoupled, transaction capability
  • DAX cluster for read cache, microsecond read latency
  • Security, authentication and authorization is done through IAM
  • Event Processing: DynamoDB Streams to integrate with AWS Lambda, or Kinesis Data Streams
  • Global Table feature: active-active setup
  • Automated backups up to 35 days with PITR (restore to new table), or on-demand backups
  • Export to S3 without using RCU within the PITR window, import from S3 without using WCU
  • Great to rapidly evolve schemas
  • Use Case: Serverless applications development (small documents 100s KB), distributed serverless cache

# S3

  • S3 is a… key / value store for objects
  • Great for bigger objects, not so great for many small objects
  • Serverless, scales infinitely, max object size is 5 TB, versioning capability
  • Tiers: S3 Standard, S3 Infrequent Access, S3 Intelligent, S3 Glacier + lifecycle policy
  • Features: Versioning, Encryption, Replication, MFA-Delete, Access Logs… [[aws-big-data#Athena|Athena]]
  • Security: IAM, Bucket Policies, ACL, Access Points, Object Lambda, CORS, Object/Vault Lock
  • Encryption: SSE-S3, SSE-KMS, SSE-C, client-side,TLS in transit, default encryption
  • Batch operations on objects using S3 Batch, listing files using S3 Inventory
  • Performance: Multi-part upload, S3 Transfer Acceleration, S3Select
  • Automation: S3 Event Notifications (SNS, SQS, Lambda, EventBridge)
  • Use Cases: static files, key value store for big files, website hosting

# DocumentDB

  • DocumentDB is the same for MongoDB (which is a NoSQL database)
  • MongoDB is used to store, query, and index JSON data
  • Similar “deployment concepts” as Aurora
  • Fully Managed, highly available with replication across 3 AZ
  • DocumentDB storage automatically grows in increments of 10GB
  • Automatically scales to workloads with millions of requests per seconds

# Neptune

  • Fully managed graph database
  • A popular graph dataset would be a social network
    • Users have friends
    • Posts have comments
    • Comments have likes from users
    • Users share and like posts…
  • Highly available across 3 AZ, with up to 15 read replicas
  • Build and run applications working with highly connected datasets – optimized for these complex and hard queries
  • Can store up to billions of relations and query the graph with milliseconds latency
  • Highly available with replications across multiple AZ
  • Great for knowledge graphs (Wikipedia), fraud detection, recommendation engines, social networking

# Keyspaces

  • A managed Apache Cassandra-compatible database service
  • Serverless, Scalable, highly available, fully managed by AWS
  • Automatically scale tables up/down based on the application’s traffic
  • Tables are replicated 3 times across multiple AZ
  • Using the Cassandra Query Language (CQL)
  • Single-digit millisecond latency at any scale, 1000s of requests per second
  • Capacity: On-demand mode or provisioned mode with auto-scaling
  • Encryption, backup, Point-In-Time Recovery (PITR) up to 35 days
  • Use cases: store IoT devices info, time-series data, …

# QLDB

  • QLDB stands for ”Quantum Ledger1 Database”
  • Fully Managed, Serverless, High available, Replication across 3AZ
  • Used to review history of all the changes made to your application data over time
  • Immutable system: no entry can be removed or modified, cryptographically verifiable
  • 2-3x better performance than common ledger blockchain frameworks, manipulate data using SQL
  • Difference with Amazon Managed Blockchain: no decentralization component, in accordance with financial regulation rules

# Timestream

  • Fully managed, fast, scalable, serverless time series database
  • Automatically scales up/down to adjust capacity
  • Store and analyze trillions of events per day
  • 1000s times faster & 1/10th the cost of relational databases
  • Scheduled queries, multi-measure records, SQL compatibility
  • Data storage tiering: recent data kept in memory and historical data kept in a cost-optimized storage
  • Built-in time series analytics functions (helps you identify patterns in your data in near real-time)
  • Encryption in transit and at rest
  • Use cases: IoT apps, operational applications, real-time analytics, …

# Quick Catchup

  • RDS events only provide operational events such as DB instance events, DB parameter group events, DB security group events, and DB snapshot events. What we need in the scenario is to capture data-modifying events (INSERTDELETEUPDATE) which can be achieved through native functions or stored procedures.
  • Amazon RDS provides metrics in real time for the operating system (OS) that your DB instance runs on. You can view the metrics for your DB instance using the console, or consume the Enhanced Monitoring JSON output from CloudWatch Logs in a monitoring system of your choice. By default, Enhanced Monitoring metrics are stored in the CloudWatch Logs for 30 days. To modify the amount of time the metrics are stored in the CloudWatch Logs, change the retention for the RDSOSMetrics log group in the CloudWatch console.
  • DynamoDB
    • use partition keys with high-cardinality attributes, which have a large number of distinct values for each item

  1. A ledger is a book recording financial transactions ↩︎ ↩︎

Licensed under CC BY-NC-SA 4.0
Get Things Done
Built with Hugo
Theme Stack designed by Jimmy