Study Guide for AWS Data Engineer — Associate Exam
Amazon Athena
Introduction to AWS Athena
- Overview: Introduce AWS Athena as an interactive query service that allows users to use SQL to directly analyze data stored in Amazon S3.
Key Features of AWS Athena
Create Table As Select (CTAS) Queries
- Functionality: Describes how CTAS queries allow the creation of a new table in Athena from the results of a SELECT statement.
- Data Storage: Explains that the data files created by CTAS are stored in a specified S3 location.
- Example:
CREATE TABLE ctas_parquet_unpartitioned
WITH (format = ‘PARQUET’)
AS SELECT key1, name1, comment1
FROM table1;
Athena Federated Queries
- Purpose: Allows querying of data sources other than Amazon S3, expanding Athena’s applicability across diverse data environments.
Athena Workgroups
- Use Case: Facilitates the isolation of queries for different teams, applications, or workloads.
- Monitoring: Tracks query-related metrics for all workgroups, aiding in performance management and optimization.
Data Compression and Optimization in Athena
Compression Techniques
- Snappy: Focuses on speed rather than reducing file size significantly.
- LZ4: Prioritizes compression speed over compact file size.
- Gzip: The default compression algorithm, known for effective size reduction.
Data Optimization Techniques
- Partitioning: Discusses strategies like partitioning data by attributes such as country, date, or region and using strings as partition keys.
- Columnar Formats: Advocates for the use of compressed columnar formats like Parquet to enhance query performance.
Partition Projection
- Mechanism: Stores partition information as rules in the properties of a table in AWS Glue rather than in the Glue data catalog.
- Performance: By computing partitions to read in-memory when queried by Athena, this method significantly improves performance for datasets with a large number of partitions.
Advanced Athena Capabilities
Athena Views
- Functionality: Views in Athena are used to query subsets of data, combine multiple tables into one, and simplify the complexity of base queries while hiding details of underlying tables and columns.
Athena MSCK REPAIR TABLE Command
- Purpose: Updates metadata in the catalog after adding Hive-compatible partitions.
- Process: Scans the file system (e.g., S3) for Hive-compatible partitions and updates the catalog accordingly.
Query Result Reuse Feature
- Benefit: Allows users to choose to reuse the last stored query result to speed up subsequent queries.
Further Insight
- AWS Athena Federated Query to execute SQL queries that span both Amazon Timestream and Redshift tables
- Partition Projection supports the following data types
— Enum Type
— Integer Type
— Date Type
— Injected Type
AWS Glue
Introduction to AWS Glue
- Overview: Introduce AWS Glue as a fully managed Extract, Transform, Load (ETL) service.
- Capabilities: Highlight its serverless data integration service which includes data discovery, preparation, and ETL capabilities.
AWS Glue Components
AWS Glue Data Catalog
- Function: Serves as a persistent metadata store detailing data location, schema, types, and classification.
- Integration: Essential for creating data warehouses or lakes; metadata is stored in Glue, actual data resides in services like S3.
AWS Glue Database
- Purpose: Organizes metadata to represent a data store (e.g., S3).
- Structure: Consists of a set of associated data catalog table definitions grouped together.
AWS Glue Table
- Description: Represents the schema of data; actual data stored externally (e.g., in S3).
Partitions in AWS S3
- Efficiency: Stores files containing date segments like year/month/day to optimize queries by avoiding full data scans.
Glue Crawler
- Functionality: Connects to a data store, scans the data structure, and populates the AWS Glue Data Catalog with tables.
Glue Connections
- Utility: Contains properties required to connect to data sources, including connection strings with security credentials.
Glue Job
- Process: AWS Glue Jobs handle the ETL (Extract, Transform, Load) process by transforming data between various source and target formats and locations. They facilitate the integration and refinement of large data sets from disparate sources into a structured format suitable for analysis and reporting.
- Features:
- Autogenerates ETL code: Glue Jobs simplify the ETL process by automatically generating the code needed for data transformation, which can be customized as needed.
- DynamicFrames Usage: Utilizes DynamicFrames, which are similar to Spark's DataFrames but with enhancements to handle schema variations dynamically. This is particularly useful for handling semi-structured data or data with evolving schemas. - Job Run Monitoring:
- Overview: Provides snapshots of job runs, including status and timestamps, accessible via the AWS Management Console. - AWS Glue Job Profiler:
- Detailed Metrics: Offers metrics like execution time, data processing rates, and memory usage to identify performance bottlenecks.
- Optimization: These insights help optimize ETL job performance, improving efficiency and resource utilization.
Glue Triggers
- Types: Includes scheduled triggers (e.g., run a job daily at 8 PM), event triggers (triggered by specific events), and manual triggers.
Glue DataBrew
- Tool: AWS Glue DataBrew is a visual data preparation tool that enables users to clean, normalize, and enrich data without needing to write code. It simplifies the data preparation process by providing an intuitive graphical interface.
- Data Privacy: Glue DataBrew offers data masking features to protect personally identifiable information (PII). This includes methods like probabilistic encryption, nulling out or deleting specific values, and substituting sensitive data with anonymized equivalents.
Glue DataBrew Ruleset for Dataset Quality Check
To ensure the quality of datasets within Glue DataBrew, you can utilize rulesets which are sets of rules that can be applied to datasets to validate data quality.
- Defining Rulesets: In Glue DataBrew, a ruleset consists of one or more rules that you can define based on specific data quality requirements. For example, you might check for completeness, consistency, or accuracy of data fields
- Usage Scenarios: Rulesets are particularly useful in scenarios where data integrity is critical. By defining and applying rulesets, users can automatically check for issues such as missing values, duplicate entries, or invalid data formats.
- Benefits: Using rulesets in Glue DataBrew helps maintain high standards of data quality, which is essential for accurate analytics and machine learning models.
Glue Permissions
- Scope: Permissions can be controlled at the table level using the AWS Glue Data Catalog.
Glue Studio
- Interface: Offers a visual UI for creating and managing ETL jobs.
Glue Workflows
- Complexity Management: Facilitates the creation and visualization of complex ETL activities involving multiple crawlers, jobs, and triggers.
Advanced Features
Glue Pricing
- Model: Billing is based on the usage of crawlers and ETL jobs by the second, and a simplified monthly fee for the data catalog.
Data Processing Units (DPUs)
- Performance: Used to run ETL jobs; more DPUs mean faster processing but higher costs.
Glue Schema Registry
- Integration: Helps integrate AWS services to manage and enforce schemas centrally.
Glue PySpark Transforms
- Variety: Includes a range of operations from applying mappings, identifying duplicates, performing joins, to complex data transformations.
Glue Jobs with Pushdown Predicate
- Optimization: Enhances performance by applying filters directly to data partitions, reducing the amount of data processed.
Glue Flex Execution
- Cost-efficiency: Runs jobs on spare capacity to reduce costs, suitable for jobs where start time flexibility is acceptable.
Glue Data Quality
- Monitoring: Detects data quality issues using ML and enforces data quality checks throughout data catalog and ETL pipelines.
AWS Lake Formation
Introduction to AWS Lake Formation
- Definition: Introduce AWS Lake Formation as a service that simplifies the creation and management of data lakes.
- Data Lake Concept: Explain a data lake as a centralized repository that allows you to store all your structured and unstructured data at scale.
Key Features of AWS Lake Formation
- Access Control: Discuss the ability to control fine-grained permissions at multiple levels including database, table, column, row, and even cell levels.
Lake Formation Personas
- IAM Admin:
- Role: Acts as the superuser with all permissions except the ability to grant Lake Formation permissions. - Data Lake Admin:
- Permissions: Can register S3 locations, access the data catalog, create databases, run workflows, and grant Lake Formation permissions. - Read-only Admin:
- Access Level: Limited to viewing data and metadata; cannot make edits or grant permissions. - Data Engineer:
- Capabilities: Able to create databases, run crawlers and workflows, and grant permissions on data catalog tables created by the workflows they manage. - Data Analyst:
- Function: Primarily runs queries against the data lake to derive insights. - Workflow Persona:
- Operation: Executes workflows on behalf of users, automating data management tasks. - Database Creator:
- Privileges: Receives all permissions on databases they create, facilitating database management. - Table Creator:
- Privileges: Holds all permissions on tables they create, which aids in specific table management and operation.
Amazon OpenSearch Service
- What is OpenSearch?
- A Fully open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. - What is Amazon OpenSearch Service?
- A Fully managed service that makes it easy to deploy, operate, and scale OpenSearch clusters in the AWS Cloud. - Amazon OpenSearch Service uses dedicated master nodes to increase cluster stability.
- Amazon OpenSearch Service Cluster is called Domains.
- Amazon OpenSearch Zone Awareness promotes fault tolerance by distributing your OpenSearch cluster data nodes across multiple Availability Zones within the same AWS region.
- Amazon OpenSearch Service snapshots are taken to S3.
- Security for Amazon OpenSearch Service can be controlled using
- Resource-based policies
- Identity-based policies
- IP-based policies (Restricting access to Amazon OpenSearch Service domain by IP Address)
- Request signing
- VPC (Restrict access to Amazon OpenSearch Service inside a VPC. Cannot change Amazon OpenSearch Service domain from public to private)
- Cognito - Controlling access to OpenSearch Dashboards
- Enable SAML authentication for Dashboards
- Use fine-grained access control with HTTP basic authentication
- Cognito Authentication
- For public access domains, configure an IP-based access Policy
- For VPC access domains, use an open access policy and security group to control access. - Amazon OpenSearch anti-patterns(meaning the ways in which you should not utilise Amazon OpenSearch for)
- Online transaction processing (OLTP) — No support for transactions or processing on data manipulation. If the requirement is for a fast transactional system, then RDS or DynamoDB is better.
- Ad hoc data querying — If you are planning to run ad hoc queries or one-off queries against your data set is your use-case, then Amazon Athena is better option. - Storage tiers in Amazon OpenSearch Service for data nodes
- Hot storage — It is used for indexing and updating. It uses Instance store or EBS volumes. It provides fastest possible performance with high cost.
- UltraWarm storage — It uses Amazon S3 and a sophisticated caching solution to improve performance. It offers slower performance and significantly lower costs per GiB of data. It must have a dedicated master node.
- Cold Storage — It also uses Amazon S3 and so even cheaper. It much have dedicated master and have UltraWarm enabled too. - We can migrate data between different storage types as needed.
- Index State Management
- Automate index management policies
- Example
1. Delete old indices after a period of time
2. Move indices into read only state after a period of time
3. Move indices from hot →UltraWarm → cold storage over time
4. Reduce replica count over time
5. Automate index snapshots
- ISM policies are run every 30–48 minutes
- Can even send notifications when done
- Index rollups - Amazon OpenSearch Stability
- 3 dedicated master nodes is best scenario to avoid “split brain”
- Don’t run out of disk space — Use the below formula to estimate the storage size
Source data * (1 + number of replicas) * 1.45 = minimum storage requirement
- Choosing the number of shards properly
- Choosing instance type - Amazon OpenSearch performance
- If you are getting JVMMemoryPressure errors then having fewer shards can provide better performance. Deleting old or unused indices can be used to remove shards. - Amazon OpenSearch Serverless
- On-demand and auto-scaling configuration of Amazon OpenSearch Service.
- Instead of creating domains we create collections in Amazon OpenSearch Serverless
- Always encrypted with your KMS key
- Capacity measured in Opensearch Compute Units(OCUs)
Amazon QuickSight
- Cloud-scale business intelligence (BI) service
- QuickSight Data Sources
- Redshift
- Aurora / RDS
- Athena
- OpenSearch
- IoT Analytics
- EC2-hosted databases
- Files (S3 or no-premises)
1. CSV, TSV
2. Excel
3. Common or extended log format - SPICE (Super-fast, Parallel, In-memory Calculation Engine) is a robust in-memory engine that Amazon QuickSight uses.
- QuickSight Use Cases
- Visualization of data
- Dashboards and KPI’s - QuickSight Anti-Patterns (Not made for)
- Highly formatted canned reports — QuickSight is for ad-hoc queries, analysis and visualization
- ETL — Not used for ETL instead use Glue. - If you want to QuickSight to access Redshift cluster in another region, then you can authorize access from the IP range of QuickSight servers in the Redshift cluster.
- You can also create QuickSight inside a VPC for security
- QuickSight User Management
- Define users in IAM or email signup
- You can use Active Directory integration in QuickSight Enterprise Edition
Amazon Managed Workflows for Apache Airflow (MWAA)
- Managed Orchestration service for Apache Airflow
- Workflow management platform
- Commonly used for tasks like ETL jobs, ML Pipelines and automating DevOps tasks
- Used to programmatically author, schedule, and monitor sequences of processes and tasks
- Managed Workflow environment are preconfigured with high availability and automatic scaling
- Directed Acyclic Graph, or DAG is a collection of tasks. These DAGs are written in python programming language and are stored in Amazon S3
- Architecture
- You can establish a SSH connection the Amazon MWAA environment using the SSHOperator in a directed acyclic graph(DAG).
- Install theapache-airflow-providers-ssh
package on the web server via therequirement.txt
file
Amazon AppFlow
Introduction to Amazon AppFlow
- Overview: Briefly introduce Amazon AppFlow as a fully-managed integration service.
- Purpose: Highlight its main function — to securely transfer data between SaaS applications and AWS services.
Key Features of Amazon AppFlow
- Supported SaaS Applications: List examples such as Salesforce, Marketo, Slack, and ServiceNow.
- AWS Service Integrations: Mention integrations with AWS services like Amazon S3 and Amazon Redshift.
- Security and Compliance: Note the security measures that ensure data protection during transfers.
Flow Triggers in Amazon AppFlow
- Run on Demand:
- Description: Explain that users can manually initiate data flows as needed.
- Use Case: Ideal for ad-hoc data transfer needs or testing. - Run on Event:
- Description: Detail how AppFlow automatically triggers data flows in response to specific events from a SaaS application.
- Use Case: Perfect for real-time data integration when changes occur in the SaaS application. - Run on Schedule:
- Description: Discuss the ability to set flows to run on a recurring schedule.
- Use Case: Useful for regular data updates or daily summaries.
Amazon Simple Queue Service (SQS)
Introduction to AWS SQS Queues
- Overview: Briefly introduce AWS SQS (Simple Queue Service) and its role in managing message queues effectively.
Standard Queue vs FIFO Queue
- Throughput:
- Standard Queue: Offers unlimited throughput.
- FIFO Queue: Provides limited throughput to maintain order. - Delivery Guarantees:
- Standard Queue: Ensures at least once delivery, but may deliver messages more than once.
- FIFO Queue: Guarantees exactly once delivery. - Ordering:
- Standard Queue: Delivers messages in a best-effort order.
- FIFO Queue: Strictly preserves the order of messages.
Events That Can Remove Messages from SQS Queue
- DeleteMessage API Call: Direct method to remove a message from the queue.
- maxReceivedCount: Specifies how many times a message can be received before it is either deleted or moved to a dead letter queue.
- Queue Purging: Removes all messages in the queue at once without deleting the queue itself.
Understanding SQS Visibility Timeout
- Purpose: Prevents a message from being received by multiple consumers once a consumer has picked it up.
- Mechanism: Sets a timer during which the message is invisible to other consumers.
- Defaults: Starts at 30 seconds, with a maximum limit of 12 hours.
Dead Letter Queues (DLQ)
- Function: Holds messages that have failed to process from other queues.
- Compatibility: A DLQ for FIFO queues must also be a FIFO type to maintain the order of failed messages.
Does ReceiveMessage API Call Delete a Message?
- Explanation: The ReceiveMessage API call does not delete the message from the queue.
- Requirement: Consumers must explicitly use the DeleteMessage API to remove a message after processing.
Amazon Managed Streaming for Apache Kafka (MSK)
Introduction to Amazon MSK
- Definition: Introduce Amazon Managed Streaming for Apache Kafka (MSK) as a fully managed service that facilitates building and running applications that utilize Apache Kafka.
- Purpose: Emphasize that MSK manages the underlying Kafka infrastructure and administrative operations, allowing developers to focus on application development.
Key Features of Amazon MSK
- Managed Service: Discuss how MSK simplifies the setup, scaling, and management of Apache Kafka clusters.
- Control-Plane Operations: Highlight capabilities like creating, updating, and deleting Kafka clusters, which are managed by MSK.
Common Kafka Terminology
- Producers and Consumers:
- Producers: Entities that publish data to Kafka brokers.
- Consumers: Entities that subscribe to and consume data from Kafka brokers. - Brokers:
- Role: Servers in a Kafka cluster that store and manage the distribution of data.
- Replication: Explains automatic replication of messages from one broker to another to ensure data availability and fault tolerance. - Zookeeper:
- Function: Manages the state and configuration of the Kafka brokers.
- Setup: Operates typically in a cluster to ensure high availability and failover handling.
MSK Serverless
- Overview: Introduce MSK Serverless as a new offering that allows users to run Kafka without worrying about server provisioning or cluster management.
- Benefits:
- Scalability: Automatically scales the Kafka cluster based on the application’s needs.
- Cost Efficiency: Users pay only for the resources they use, avoiding overprovisioning and reducing costs.
- Ease of Use: Eliminates the complexity associated with Kafka operations, making it more accessible for users with varying levels of expertise.
Use Cases for Amazon MSK
- Real-Time Data Processing: Ideal for applications that require real-time data processing and streaming analytics.
- Decoupling of Systems: Enables decoupled architectures, where producers and consumers operate independently, improving system resilience and scalability.
- Event-Driven Architectures: Facilitates the development of event-driven architectures, which are responsive and scalable to business needs.
Amazon S3
- S3 Glacier Deep Archive is intended for long-term data storage that is accessed once or twice a year, with retrieval times typically ranging from 12 to 48 hours.
- Amazon S3 Glacier Flexible Retrieval, previously known as S3 Glacier, is a cost-effective storage solution for archiving data. You can configure retrieval times from minutes to hours, which is suitable for accessing data for compliance audits within a 10-hour window
- S3 Object Lock legal hold
- prevents an object version from being overwritten or deleted
- use S3 Batch Operations with Object Lock to add legal holds to many Amazon S3 objects at once
Amazon DynamoDB
Introduction to Amazon DynamoDB
- Overview: Briefly introduce Amazon DynamoDB as a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.
DynamoDB Indexes
- Types of Indexes:
- Global Secondary Index (GSI): Allows queries on any attribute (not just the primary key).
- Local Secondary Index (LSI): Permits alternative sort keys under the same partition key. - Benefits: Enhance query flexibility and performance by allowing different views of data based on varied query requirements.
DynamoDB PartiQL
- Definition: Introduce PartiQL as a SQL-compatible query language that allows you to perform SQL-like queries on your DynamoDB data.
- Use Cases: Ideal for developers familiar with SQL who want to apply similar syntax and operations in DynamoDB.
DynamoDB Accelerator (DAX)
- Functionality: Explain DAX as an in-memory cache for DynamoDB, designed to reduce response times from milliseconds to microseconds even at millions of requests per second.
- Advantages: Particularly beneficial for read-heavy application workloads where performance and latency are critical.
DynamoDB Streams
- Purpose: Describe how DynamoDB Streams capture time-ordered sequence of item-level modifications in any DynamoDB table.
- Integration: Useful for triggering automated workflows and integrating with other AWS services like AWS Lambda for real-time processing.
Time to Live (TTL) in DynamoDB
- Functionality: TTL lets you define a specific timestamp to delete expired items from your database automatically.
- Benefits: Helps reduce storage and manage data retention without involving manual overhead or custom scripts.
Integrating DynamoDB with Amazon S3
- Patterns:
- Data Archiving: Automatically move older, infrequently accessed data to S3 for cost-effective storage.
- Data Lake: Use S3 as a data lake to store massive datasets for analytics and business intelligence operations. - Tools: Leverage AWS Glue or AWS Data Pipeline for efficient data transfer between DynamoDB and S3.
DynamoDB Partitions
- Overview: Explain how DynamoDB uses partitions to automatically distribute data and traffic for tables over multiple servers.
- Scalability and Performance: Discuss how partitions enhance data retrieval and manage the database’s ability to scale in response to application demands.
DynamoDB Hot Partitioning
- Definition: Introduce the concept of hot partitioning, which occurs when a disproportionate amount of workload or traffic is directed at a single partition, leading to potential throttling and performance bottlenecks.
- Causes:
- Skewed Access Patterns: Commonly happens when a large number of read or write operations are concentrated on a few items or a single partition key.
- Uneven Key Distribution: Occurs when partition key values are not distributed uniformly, which can lead to uneven data distribution across partitions. - Mitigation Strategies:
- Distributed Access Patterns: Design access patterns to distribute reads and writes evenly across all partition keys.
- Use of Write Sharding: Implement write sharding techniques where the partition key values are appended with a random suffix or using calculated suffixes based on access patterns to distribute loads more evenly.
- GSI Overloading: Utilize Global Secondary Indexes (GSIs) to alternate hot paths and distribute the load. - Monitoring and Management:
- CloudWatch Metrics: Use Amazon CloudWatch to monitor and alert on metrics indicative of hot partitions, such asThrottledWriteRequests
orThrottledReadRequests
.
- Adaptive Capacity: DynamoDB can dynamically adjust the partition distribution and resource allocation to accommodate uneven access patterns to some extent, known as adaptive capacity.
Amazon Redshift
Introduction to Amazon Redshift
- Definition: Explain what Redshift is — an advanced, petabyte-scale data warehouse service.
- Purpose: Discuss how it stores data from various sources and supports large-scale data management.
Data Management with Redshift
- ETL Processes: Describe how ETL processes are used to move data into Redshift from different sources.
Redshift Configurations: Serverless vs. Provisioned
- Serverless Redshift:
- Auto-provisioning and Scaling: Automatically provisions and scales based on the workload.
- Billing: Payment is based on workload execution. - Provisioned Redshift:
- Management: Requires self-management.
- Billing: Charges are accrued per second of usage.
Redshift SQL Commands
- Essential Commands:
- COPY: Loads data into a Redshift table from an external data source.
- JOIN: Combines rows from two or more tables.
- UNLOAD: Exports the result of a query into formats like CSV, Parquet, or JSON, typically into S3.
- GRANT/REVOKE: Manages permissions within Redshift.
- CALL: Executes a stored procedure.
- CREATE DATASHARE: Initiates a new Redshift data share.
- VACUUM Operations: Optimizes and maintains database performance by reorganizing the data.
Redshift Materialized Views
- Functionality: Materialized views hold precomputed result sets based on SQL queries, enhancing query efficiency and speed.
- Benefits: Ideal for predictable and repeatedly executed queries.
Redshift Data Warehouse System Architecture
- Clusters and Nodes:
- Role of Leader Node: Manages communication and distributes SQL statements.
- Compute Nodes: Act as workers processing the queries.
- Node Slices: Each compute node is partitioned into slices with allocated memory and disk space.
Advanced Redshift Features
- Federated Queries: Allows querying across multiple data sources like databases, data warehouses, and lakes.
- Redshift Data Sharing: Enables secure access to live data across different Redshift clusters.
- Redshift Spectrum: Offers querying capabilities directly on S3 without loading data into Redshift.
- Sort Keys and Key Distribution:
- Sort Keys: Discuss the initial data load sorting and different types of sort keys (compound and interleaved).
- Key Distribution Strategies: Auto, Even, and Key distribution, including ‘All distribution’ for specific scenarios.
Querying Semistructured Data
Query Execution in Amazon Redshift: Role of the Leader Node
In Amazon Redshift, the leader node has a pivotal role in query execution, which varies depending on the type of tables or views a query references:
- User-Created and System Tables: When a query involves user-created tables or system tables (tables prefixed with STL or STV and system views prefixed with SVL or SVV), the leader node takes on the role of distributing the SQL statements to the compute nodes. This distribution is crucial for leveraging the parallel processing power of Redshift to efficiently handle and execute the query across multiple compute nodes.
- Catalog Tables: Conversely, if a query references only catalog tables (those with a PG prefix, such as PG_TABLE_DEF) or does not reference any tables at all, it is processed entirely by the leader node. These queries, typically metadata operations or administrative checks, do not require the computational resources of the compute nodes and are handled directly by the leader node.
- PartiQL for JSON: Explore how Redshift queries JSON data using PartiQL syntax resembling JavaScript.
Additional Features
- Redshift Staging Tables: Explain their role in preparing data changes.
- Redshift Streaming Ingestion: Detail how it supports rapid data ingestion from streams like KDS or MSK.
- Redshift Dynamic Data Masking: Discuss how it protects sensitive data and controls visibility based on user-defined rules.
Further Insight
Lock
command at the beginning of a transaction before insert data prevents read and write access to the database table.- Amazon Redshift has three lock modes
- AccessExclusiveLock — Acquired primarily during DDL operations, such asALTER TABLE
,DROP
, orTRUNCATE
. AccessExclusiveLock blocks all other locking attempts
- AccessShareLock: Acquired duringUNLOAD
,SELECT
,UPDATE
, orDELETE
operations. AccessShareLock blocks onlyAccessExclusiveLock
attempts. AccessShareLock doesn’t block other sessions that are trying to read or write on the table. - ShareRowExclusiveLock: Acquired during
COPY
,INSERT
,UPDATE
, orDELETE
operations.ShareRowExclusiveLock
blocksAccessExclusiveLock
and otherShareRowExclusiveLock
attempts but doesn’t blockAccessShareLock
attempts.
AWS Sagemaker
- Overview: Brief introduction to Amazon SageMaker as a comprehensive service to build, train, and deploy machine learning models.
SageMaker Data Wrangler
- Functionality:
- Import, prepare, transform, featurize, and analyze data.
- Conduct exploratory data analysis (EDA) within the platform. - Limitations:
- Lacks native functionalities to ensure data accuracy, completeness, and trustworthiness.
- Does not provide features to identify or mask personally identifiable information (PII).
SageMaker Feature Store
- Purpose: Acts as a storage and data management layer specifically designed for machine learning.
- Features:
- Allows users to create, store, and share features for machine learning models.
- Features is just variables or attributes in a dataset. This is a measurable piece of data that can be used for analysis. - Benefits:
- Enhances model accuracy and efficiency by reusing precomputed features.
- Facilitates consistent feature use across different models and projects. - Security:
- Encrypted at rest and in transit
- Works with KMS CMK
- Fine-grained access control with IAM
- Can be secured using AWS PrivateLink
SageMaker ML Lineage Tracking
- Functionality:
- Creates and stores information about each step in the machine learning workflow. - Benefits:
- Supports model governance and sets standards for audits.
- Ensures data accuracy and trustworthiness through detailed lineage tracking. - Use Case:
- Essential for regulated industries requiring strict compliance and transparency in model development processes.
SageMaker Processing
- Service Overview:
- A managed service designed to run various machine learning operations. - Capabilities:
- Supports processing workloads, data validation, and model evaluation. - Advantages:
- Simplifies the execution of complex data processing tasks.
- Ensures scalable and efficient handling of machine learning operations.
SageMaker Canvas
- No-Code Machine Learning: Amazon SageMaker Canvas is a feature designed to democratize machine learning, offering a no-code platform that caters to all technical backgrounds. It provides a user-friendly interface that automates the entire machine learning workflow — including data preparation, model selection, training, and deployment — enabling users to build predictive models effortlessly without coding. This simplifies the process from data cleaning to prediction, making advanced analytics accessible to everyone.
AWS Database Migration Service (DMS)
- AWS Database Migration Service(AWS DMS) is a cloud service that makes it possible to migrate relational database, data warehouses, NoSQL databases, and other data stores.
- You can use AWS DMS to transfer data to Amazon S3 from any database source
- AWS Database Migration Service tasks using Amazon S3 as a target, data from full loads and change data capture (CDC) is written in CSV format by default, ensuring easy accessibility and compatibility.
AWS Data Exchange
Introduction to AWS Data Exchange
- Service Overview: AWS Data Exchange is a platform that facilitates the exchange and use of third-party data on AWS. It allows customers to act as data providers or subscribers, streamlining the process of accessing diverse datasets.
Role of Providers and Subscribers
- Providers: Customers can become providers, sharing their chargeable datasets with vendors on the platform.
- Subscribers: Subscribers gain access to these datasets only after submitting a request form to the providers, ensuring controlled distribution.
Access and Charges
- Data Access: Subscribers can access the data only after their request has been approved by the provider, safeguarding data privacy and compliance.
- Billing: Once subscribers begin accessing the data, charges are applied to the provider’s account, facilitating an automated payment system.
Benefits of Using AWS Data Exchange
- Efficiency: Data sets can be securely shared with new subscribers with minimal administrative work.
- Scale and Reach: Providers can reach a wider audience of potential subscribers, expanding their market presence without additional marketing efforts.
Amazon Kinesis Data Streams
Introduction to Amazon Kinesis Data Streams
- Overview: Introduce Amazon Kinesis Data Streams as a scalable and durable real-time data streaming service that enables developers to continuously capture, store, and process large streams of data records.
Key Features of Amazon Kinesis Data Streams
- Real-Time Data Processing: Explain how Kinesis Data Streams allows for the processing of data in real-time, enabling immediate analysis and response to information as it arrives.
- High Throughput and Scalability: Discuss the ability of Kinesis Data Streams to handle thousands of data sources and scale seamlessly to match the volume of data input and throughput requirements.
How Kinesis Data Streams Work
- Data Producers: Describe the sources that can send data to a Kinesis data stream, such as log files, financial transactions, social media feeds, etc.
- Shards: Introduce the concept of shards, which are the base throughput units of a Kinesis data stream.
- Data Consumers: Talk about how applications can be built to consume and process data directly from Kinesis Data Streams.
Use Cases of Amazon Kinesis Data Streams
- Real-Time Analytics: Illustrate how companies use Kinesis for dashboard updates, real-time analytics, and anomaly detection in financial systems.
- Log and Event Data Collection: Explain how Kinesis is ideal for gathering and processing logs and event data in applications, thereby improving the monitoring and operational performance of systems.
- Internet of Things (IoT): Discuss its application in collecting and analyzing large streams of data from IoT devices and sensors.
Integration with Other AWS Services
- AWS Lambda: Detail how Kinesis Data Streams integrates with AWS Lambda to enable serverless data processing.
- Amazon S3 and Amazon Redshift: Describe the ability to archive data or load it into data warehouses for further analysis.
Best Practices for Using Amazon Kinesis Data Streams
- Stream Monitoring and Optimization: Suggest practices such as monitoring shard-level metrics and using Kinesis Client Library for effective stream management.
- Data Partitioning: Discuss the importance of effective data partitioning to ensure uniform data distribution across shards.
Further Insight
- Amazon Kinesis Data Streams stores data only for 7 days