Study Guide for AWS Data Engineer — Associate Exam

Amazon Athena

Rajesh Murali Nair

20 min readSep 25, 2024

Introduction to AWS Athena

Overview: Introduce AWS Athena as an interactive query service that allows users to use SQL to directly analyze data stored in Amazon S3.

Key Features of AWS Athena

Create Table As Select (CTAS) Queries

Functionality: Describes how CTAS queries allow the creation of a new table in Athena from the results of a SELECT statement.
Data Storage: Explains that the data files created by CTAS are stored in a specified S3 location.
Example:

CREATE TABLE ctas_parquet_unpartitioned 
WITH (format = ‘PARQUET’) 
AS SELECT key1, name1, comment1
FROM table1;

Athena Federated Queries

Purpose: Allows querying of data sources other than Amazon S3, expanding Athena’s applicability across diverse data environments.

Athena Workgroups

Use Case: Facilitates the isolation of queries for different teams, applications, or workloads.
Monitoring: Tracks query-related metrics for all workgroups, aiding in performance management and optimization.

Data Compression and Optimization in Athena

Compression Techniques

Snappy: Focuses on speed rather than reducing file size significantly.
LZ4: Prioritizes compression speed over compact file size.
Gzip: The default compression algorithm, known for effective size reduction.

Data Optimization Techniques

Partitioning: Discusses strategies like partitioning data by attributes such as country, date, or region and using strings as partition keys.
Columnar Formats: Advocates for the use of compressed columnar formats like Parquet to enhance query performance.

Partition Projection

Mechanism: Stores partition information as rules in the properties of a table in AWS Glue rather than in the Glue data catalog.
Performance: By computing partitions to read in-memory when queried by Athena, this method significantly improves performance for datasets with a large number of partitions.

Advanced Athena Capabilities

Athena Views

Functionality: Views in Athena are used to query subsets of data, combine multiple tables into one, and simplify the complexity of base queries while hiding details of underlying tables and columns.

Athena MSCK REPAIR TABLE Command

Purpose: Updates metadata in the catalog after adding Hive-compatible partitions.
Process: Scans the file system (e.g., S3) for Hive-compatible partitions and updates the catalog accordingly.

Query Result Reuse Feature

Benefit: Allows users to choose to reuse the last stored query result to speed up subsequent queries.

Further Insight

AWS Athena Federated Query to execute SQL queries that span both Amazon Timestream and Redshift tables
Partition Projection supports the following data types
— Enum Type
— Integer Type
— Date Type
— Injected Type

AWS Glue

Introduction to AWS Glue

Overview: Introduce AWS Glue as a fully managed Extract, Transform, Load (ETL) service.
Capabilities: Highlight its serverless data integration service which includes data discovery, preparation, and ETL capabilities.

AWS Glue Components

AWS Glue Data Catalog

Function: Serves as a persistent metadata store detailing data location, schema, types, and classification.
Integration: Essential for creating data warehouses or lakes; metadata is stored in Glue, actual data resides in services like S3.

AWS Glue Database

Purpose: Organizes metadata to represent a data store (e.g., S3).
Structure: Consists of a set of associated data catalog table definitions grouped together.

AWS Glue Table

Description: Represents the schema of data; actual data stored externally (e.g., in S3).

Partitions in AWS S3

Efficiency: Stores files containing date segments like year/month/day to optimize queries by avoiding full data scans.

Glue Crawler

Functionality: Connects to a data store, scans the data structure, and populates the AWS Glue Data Catalog with tables.

Glue Connections

Utility: Contains properties required to connect to data sources, including connection strings with security credentials.

Glue Job

Process: AWS Glue Jobs handle the ETL (Extract, Transform, Load) process by transforming data between various source and target formats and locations. They facilitate the integration and refinement of large data sets from disparate sources into a structured format suitable for analysis and reporting.
Features:
- Autogenerates ETL code: Glue Jobs simplify the ETL process by automatically generating the code needed for data transformation, which can be customized as needed.
- DynamicFrames Usage: Utilizes DynamicFrames, which are similar to Spark's DataFrames but with enhancements to handle schema variations dynamically. This is particularly useful for handling semi-structured data or data with evolving schemas.
Job Run Monitoring:
- Overview: Provides snapshots of job runs, including status and timestamps, accessible via the AWS Management Console.
AWS Glue Job Profiler:
- Detailed Metrics: Offers metrics like execution time, data processing rates, and memory usage to identify performance bottlenecks.
- Optimization: These insights help optimize ETL job performance, improving efficiency and resource utilization.

Glue Triggers

Types: Includes scheduled triggers (e.g., run a job daily at 8 PM), event triggers (triggered by specific events), and manual triggers.

Glue DataBrew

Tool: AWS Glue DataBrew is a visual data preparation tool that enables users to clean, normalize, and enrich data without needing to write code. It simplifies the data preparation process by providing an intuitive graphical interface.
Data Privacy: Glue DataBrew offers data masking features to protect personally identifiable information (PII). This includes methods like probabilistic encryption, nulling out or deleting specific values, and substituting sensitive data with anonymized equivalents.

Glue DataBrew Ruleset for Dataset Quality Check

To ensure the quality of datasets within Glue DataBrew, you can utilize rulesets which are sets of rules that can be applied to datasets to validate data quality.

Defining Rulesets: In Glue DataBrew, a ruleset consists of one or more rules that you can define based on specific data quality requirements. For example, you might check for completeness, consistency, or accuracy of data fields
Usage Scenarios: Rulesets are particularly useful in scenarios where data integrity is critical. By defining and applying rulesets, users can automatically check for issues such as missing values, duplicate entries, or invalid data formats.
Benefits: Using rulesets in Glue DataBrew helps maintain high standards of data quality, which is essential for accurate analytics and machine learning models.

Glue Permissions

Scope: Permissions can be controlled at the table level using the AWS Glue Data Catalog.

Glue Studio

Interface: Offers a visual UI for creating and managing ETL jobs.

Glue Workflows

Complexity Management: Facilitates the creation and visualization of complex ETL activities involving multiple crawlers, jobs, and triggers.

Advanced Features

Glue Pricing

Model: Billing is based on the usage of crawlers and ETL jobs by the second, and a simplified monthly fee for the data catalog.

Data Processing Units (DPUs)

Performance: Used to run ETL jobs; more DPUs mean faster processing but higher costs.

Glue Schema Registry

Integration: Helps integrate AWS services to manage and enforce schemas centrally.

Glue PySpark Transforms

Variety: Includes a range of operations from applying mappings, identifying duplicates, performing joins, to complex data transformations.

Glue Jobs with Pushdown Predicate

Optimization: Enhances performance by applying filters directly to data partitions, reducing the amount of data processed.

Glue Flex Execution

Cost-efficiency: Runs jobs on spare capacity to reduce costs, suitable for jobs where start time flexibility is acceptable.

Glue Data Quality

Monitoring: Detects data quality issues using ML and enforces data quality checks throughout data catalog and ETL pipelines.

AWS Lake Formation

Introduction to AWS Lake Formation

Definition: Introduce AWS Lake Formation as a service that simplifies the creation and management of data lakes.
Data Lake Concept: Explain a data lake as a centralized repository that allows you to store all your structured and unstructured data at scale.

Key Features of AWS Lake Formation

Access Control: Discuss the ability to control fine-grained permissions at multiple levels including database, table, column, row, and even cell levels.

Lake Formation Personas

IAM Admin:
- Role: Acts as the superuser with all permissions except the ability to grant Lake Formation permissions.
Data Lake Admin:
- Permissions: Can register S3 locations, access the data catalog, create databases, run workflows, and grant Lake Formation permissions.
Read-only Admin:
- Access Level: Limited to viewing data and metadata; cannot make edits or grant permissions.
Data Engineer:
- Capabilities: Able to create databases, run crawlers and workflows, and grant permissions on data catalog tables created by the workflows they manage.
Data Analyst:
- Function: Primarily runs queries against the data lake to derive insights.
Workflow Persona:
- Operation: Executes workflows on behalf of users, automating data management tasks.
Database Creator:
- Privileges: Receives all permissions on databases they create, facilitating database management.
Table Creator:
- Privileges: Holds all permissions on tables they create, which aids in specific table management and operation.

Amazon OpenSearch Service

What is OpenSearch?
- A Fully open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis.
What is Amazon OpenSearch Service?
- A Fully managed service that makes it easy to deploy, operate, and scale OpenSearch clusters in the AWS Cloud.
Amazon OpenSearch Service uses dedicated master nodes to increase cluster stability.
Amazon OpenSearch Service Cluster is called Domains.
Amazon OpenSearch Zone Awareness promotes fault tolerance by distributing your OpenSearch cluster data nodes across multiple Availability Zones within the same AWS region.
Amazon OpenSearch Service snapshots are taken to S3.
Security for Amazon OpenSearch Service can be controlled using
- Resource-based policies
- Identity-based policies
- IP-based policies (Restricting access to Amazon OpenSearch Service domain by IP Address)
- Request signing
- VPC (Restrict access to Amazon OpenSearch Service inside a VPC. Cannot change Amazon OpenSearch Service domain from public to private)
- Cognito
Controlling access to OpenSearch Dashboards
- Enable SAML authentication for Dashboards
- Use fine-grained access control with HTTP basic authentication
- Cognito Authentication
- For public access domains, configure an IP-based access Policy
- For VPC access domains, use an open access policy and security group to control access.
Amazon OpenSearch anti-patterns(meaning the ways in which you should not utilise Amazon OpenSearch for)
- Online transaction processing (OLTP) — No support for transactions or processing on data manipulation. If the requirement is for a fast transactional system, then RDS or DynamoDB is better.
- Ad hoc data querying — If you are planning to run ad hoc queries or one-off queries against your data set is your use-case, then Amazon Athena is better option.
Storage tiers in Amazon OpenSearch Service for data nodes
- Hot storage — It is used for indexing and updating. It uses Instance store or EBS volumes. It provides fastest possible performance with high cost.
- UltraWarm storage — It uses Amazon S3 and a sophisticated caching solution to improve performance. It offers slower performance and significantly lower costs per GiB of data. It must have a dedicated master node.
- Cold Storage — It also uses Amazon S3 and so even cheaper. It much have dedicated master and have UltraWarm enabled too.
We can migrate data between different storage types as needed.
Index State Management
- Automate index management policies
- Example
1. Delete old indices after a period of time
2. Move indices into read only state after a period of time
3. Move indices from hot →UltraWarm → cold storage over time
4. Reduce replica count over time
5. Automate index snapshots
- ISM policies are run every 30–48 minutes
- Can even send notifications when done
- Index rollups
Amazon OpenSearch Stability
- 3 dedicated master nodes is best scenario to avoid “split brain”
- Don’t run out of disk space — Use the below formula to estimate the storage size
Source data * (1 + number of replicas) * 1.45 = minimum storage requirement
- Choosing the number of shards properly
- Choosing instance type
Amazon OpenSearch performance
- If you are getting JVMMemoryPressure errors then having fewer shards can provide better performance. Deleting old or unused indices can be used to remove shards.
Amazon OpenSearch Serverless
- On-demand and auto-scaling configuration of Amazon OpenSearch Service.
- Instead of creating domains we create collections in Amazon OpenSearch Serverless
- Always encrypted with your KMS key
- Capacity measured in Opensearch Compute Units(OCUs)

Amazon QuickSight

Cloud-scale business intelligence (BI) service
QuickSight Data Sources
- Redshift
- Aurora / RDS
- Athena
- OpenSearch
- IoT Analytics
- EC2-hosted databases
- Files (S3 or no-premises)
1. CSV, TSV
2. Excel
3. Common or extended log format
SPICE (Super-fast, Parallel, In-memory Calculation Engine) is a robust in-memory engine that Amazon QuickSight uses.
QuickSight Use Cases
- Visualization of data
- Dashboards and KPI’s
QuickSight Anti-Patterns (Not made for)
- Highly formatted canned reports — QuickSight is for ad-hoc queries, analysis and visualization
- ETL — Not used for ETL instead use Glue.
If you want to QuickSight to access Redshift cluster in another region, then you can authorize access from the IP range of QuickSight servers in the Redshift cluster.
You can also create QuickSight inside a VPC for security
QuickSight User Management
- Define users in IAM or email signup
- You can use Active Directory integration in QuickSight Enterprise Edition

Amazon Managed Workflows for Apache Airflow (MWAA)

Managed Orchestration service for Apache Airflow
Workflow management platform
Commonly used for tasks like ETL jobs, ML Pipelines and automating DevOps tasks
Used to programmatically author, schedule, and monitor sequences of processes and tasks
Managed Workflow environment are preconfigured with high availability and automatic scaling
Directed Acyclic Graph, or DAG is a collection of tasks. These DAGs are written in python programming language and are stored in Amazon S3
Architecture

You can establish a SSH connection the Amazon MWAA environment using the SSHOperator in a directed acyclic graph(DAG).
- Install the apache-airflow-providers-sshpackage on the web server via the requirement.txt file

Amazon AppFlow

Introduction to Amazon AppFlow

Overview: Briefly introduce Amazon AppFlow as a fully-managed integration service.
Purpose: Highlight its main function — to securely transfer data between SaaS applications and AWS services.

Key Features of Amazon AppFlow

Supported SaaS Applications: List examples such as Salesforce, Marketo, Slack, and ServiceNow.
AWS Service Integrations: Mention integrations with AWS services like Amazon S3 and Amazon Redshift.
Security and Compliance: Note the security measures that ensure data protection during transfers.

Flow Triggers in Amazon AppFlow

Run on Demand:
- Description: Explain that users can manually initiate data flows as needed.
- Use Case: Ideal for ad-hoc data transfer needs or testing.
Run on Event:
- Description: Detail how AppFlow automatically triggers data flows in response to specific events from a SaaS application.
- Use Case: Perfect for real-time data integration when changes occur in the SaaS application.
Run on Schedule:
- Description: Discuss the ability to set flows to run on a recurring schedule.
- Use Case: Useful for regular data updates or daily summaries.

Amazon Simple Queue Service (SQS)

Introduction to AWS SQS Queues

Overview: Briefly introduce AWS SQS (Simple Queue Service) and its role in managing message queues effectively.

Standard Queue vs FIFO Queue

Throughput:
- Standard Queue: Offers unlimited throughput.
- FIFO Queue: Provides limited throughput to maintain order.
Delivery Guarantees:
- Standard Queue: Ensures at least once delivery, but may deliver messages more than once.
- FIFO Queue: Guarantees exactly once delivery.
Ordering:
- Standard Queue: Delivers messages in a best-effort order.
- FIFO Queue: Strictly preserves the order of messages.

Events That Can Remove Messages from SQS Queue

DeleteMessage API Call: Direct method to remove a message from the queue.
maxReceivedCount: Specifies how many times a message can be received before it is either deleted or moved to a dead letter queue.
Queue Purging: Removes all messages in the queue at once without deleting the queue itself.

Understanding SQS Visibility Timeout

Purpose: Prevents a message from being received by multiple consumers once a consumer has picked it up.
Mechanism: Sets a timer during which the message is invisible to other consumers.
Defaults: Starts at 30 seconds, with a maximum limit of 12 hours.

Dead Letter Queues (DLQ)

Function: Holds messages that have failed to process from other queues.
Compatibility: A DLQ for FIFO queues must also be a FIFO type to maintain the order of failed messages.

Does ReceiveMessage API Call Delete a Message?

Explanation: The ReceiveMessage API call does not delete the message from the queue.
Requirement: Consumers must explicitly use the DeleteMessage API to remove a message after processing.

Amazon Managed Streaming for Apache Kafka (MSK)

Introduction to Amazon MSK

Definition: Introduce Amazon Managed Streaming for Apache Kafka (MSK) as a fully managed service that facilitates building and running applications that utilize Apache Kafka.
Purpose: Emphasize that MSK manages the underlying Kafka infrastructure and administrative operations, allowing developers to focus on application development.

Key Features of Amazon MSK

Managed Service: Discuss how MSK simplifies the setup, scaling, and management of Apache Kafka clusters.
Control-Plane Operations: Highlight capabilities like creating, updating, and deleting Kafka clusters, which are managed by MSK.

Common Kafka Terminology

Producers and Consumers:
- Producers: Entities that publish data to Kafka brokers.
- Consumers: Entities that subscribe to and consume data from Kafka brokers.
Brokers:
- Role: Servers in a Kafka cluster that store and manage the distribution of data.
- Replication: Explains automatic replication of messages from one broker to another to ensure data availability and fault tolerance.
Zookeeper:
- Function: Manages the state and configuration of the Kafka brokers.
- Setup: Operates typically in a cluster to ensure high availability and failover handling.

MSK Serverless

Overview: Introduce MSK Serverless as a new offering that allows users to run Kafka without worrying about server provisioning or cluster management.
Benefits:
- Scalability: Automatically scales the Kafka cluster based on the application’s needs.
- Cost Efficiency: Users pay only for the resources they use, avoiding overprovisioning and reducing costs.
- Ease of Use: Eliminates the complexity associated with Kafka operations, making it more accessible for users with varying levels of expertise.

Use Cases for Amazon MSK

Real-Time Data Processing: Ideal for applications that require real-time data processing and streaming analytics.
Decoupling of Systems: Enables decoupled architectures, where producers and consumers operate independently, improving system resilience and scalability.
Event-Driven Architectures: Facilitates the development of event-driven architectures, which are responsive and scalable to business needs.

Amazon S3

S3 Glacier Deep Archive is intended for long-term data storage that is accessed once or twice a year, with retrieval times typically ranging from 12 to 48 hours.
Amazon S3 Glacier Flexible Retrieval, previously known as S3 Glacier, is a cost-effective storage solution for archiving data. You can configure retrieval times from minutes to hours, which is suitable for accessing data for compliance audits within a 10-hour window
S3 Object Lock legal hold
- prevents an object version from being overwritten or deleted
- use S3 Batch Operations with Object Lock to add legal holds to many Amazon S3 objects at once

Amazon DynamoDB

Introduction to Amazon DynamoDB

Overview: Briefly introduce Amazon DynamoDB as a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.

DynamoDB Indexes

Types of Indexes:
- Global Secondary Index (GSI): Allows queries on any attribute (not just the primary key).
- Local Secondary Index (LSI): Permits alternative sort keys under the same partition key.
Benefits: Enhance query flexibility and performance by allowing different views of data based on varied query requirements.

DynamoDB PartiQL

Definition: Introduce PartiQL as a SQL-compatible query language that allows you to perform SQL-like queries on your DynamoDB data.
Use Cases: Ideal for developers familiar with SQL who want to apply similar syntax and operations in DynamoDB.

DynamoDB Accelerator (DAX)

Functionality: Explain DAX as an in-memory cache for DynamoDB, designed to reduce response times from milliseconds to microseconds even at millions of requests per second.
Advantages: Particularly beneficial for read-heavy application workloads where performance and latency are critical.

DynamoDB Streams

Purpose: Describe how DynamoDB Streams capture time-ordered sequence of item-level modifications in any DynamoDB table.
Integration: Useful for triggering automated workflows and integrating with other AWS services like AWS Lambda for real-time processing.

Time to Live (TTL) in DynamoDB

Functionality: TTL lets you define a specific timestamp to delete expired items from your database automatically.
Benefits: Helps reduce storage and manage data retention without involving manual overhead or custom scripts.

Integrating DynamoDB with Amazon S3

Patterns:
- Data Archiving: Automatically move older, infrequently accessed data to S3 for cost-effective storage.
- Data Lake: Use S3 as a data lake to store massive datasets for analytics and business intelligence operations.
Tools: Leverage AWS Glue or AWS Data Pipeline for efficient data transfer between DynamoDB and S3.

DynamoDB Partitions

Overview: Explain how DynamoDB uses partitions to automatically distribute data and traffic for tables over multiple servers.
Scalability and Performance: Discuss how partitions enhance data retrieval and manage the database’s ability to scale in response to application demands.

DynamoDB Hot Partitioning

Definition: Introduce the concept of hot partitioning, which occurs when a disproportionate amount of workload or traffic is directed at a single partition, leading to potential throttling and performance bottlenecks.
Causes:
- Skewed Access Patterns: Commonly happens when a large number of read or write operations are concentrated on a few items or a single partition key.
- Uneven Key Distribution: Occurs when partition key values are not distributed uniformly, which can lead to uneven data distribution across partitions.
Mitigation Strategies:
- Distributed Access Patterns: Design access patterns to distribute reads and writes evenly across all partition keys.
- Use of Write Sharding: Implement write sharding techniques where the partition key values are appended with a random suffix or using calculated suffixes based on access patterns to distribute loads more evenly.
- GSI Overloading: Utilize Global Secondary Indexes (GSIs) to alternate hot paths and distribute the load.
Monitoring and Management:
- CloudWatch Metrics: Use Amazon CloudWatch to monitor and alert on metrics indicative of hot partitions, such as ThrottledWriteRequests or ThrottledReadRequests.
- Adaptive Capacity: DynamoDB can dynamically adjust the partition distribution and resource allocation to accommodate uneven access patterns to some extent, known as adaptive capacity.

Amazon Redshift

Introduction to Amazon Redshift

Definition: Explain what Redshift is — an advanced, petabyte-scale data warehouse service.
Purpose: Discuss how it stores data from various sources and supports large-scale data management.

Data Management with Redshift

ETL Processes: Describe how ETL processes are used to move data into Redshift from different sources.

Redshift Configurations: Serverless vs. Provisioned

Serverless Redshift:
- Auto-provisioning and Scaling: Automatically provisions and scales based on the workload.
- Billing: Payment is based on workload execution.
Provisioned Redshift:
- Management: Requires self-management.
- Billing: Charges are accrued per second of usage.

Redshift SQL Commands

Essential Commands:
- COPY: Loads data into a Redshift table from an external data source.
- JOIN: Combines rows from two or more tables.
- UNLOAD: Exports the result of a query into formats like CSV, Parquet, or JSON, typically into S3.
- GRANT/REVOKE: Manages permissions within Redshift.
- CALL: Executes a stored procedure.
- CREATE DATASHARE: Initiates a new Redshift data share.
- VACUUM Operations: Optimizes and maintains database performance by reorganizing the data.

Redshift Materialized Views

Functionality: Materialized views hold precomputed result sets based on SQL queries, enhancing query efficiency and speed.
Benefits: Ideal for predictable and repeatedly executed queries.

Redshift Data Warehouse System Architecture

Clusters and Nodes:
- Role of Leader Node: Manages communication and distributes SQL statements.
- Compute Nodes: Act as workers processing the queries.
- Node Slices: Each compute node is partitioned into slices with allocated memory and disk space.

Advanced Redshift Features

Federated Queries: Allows querying across multiple data sources like databases, data warehouses, and lakes.
Redshift Data Sharing: Enables secure access to live data across different Redshift clusters.
Redshift Spectrum: Offers querying capabilities directly on S3 without loading data into Redshift.
Sort Keys and Key Distribution:
- Sort Keys: Discuss the initial data load sorting and different types of sort keys (compound and interleaved).
- Key Distribution Strategies: Auto, Even, and Key distribution, including ‘All distribution’ for specific scenarios.

Querying Semistructured Data

Query Execution in Amazon Redshift: Role of the Leader Node

In Amazon Redshift, the leader node has a pivotal role in query execution, which varies depending on the type of tables or views a query references:

User-Created and System Tables: When a query involves user-created tables or system tables (tables prefixed with STL or STV and system views prefixed with SVL or SVV), the leader node takes on the role of distributing the SQL statements to the compute nodes. This distribution is crucial for leveraging the parallel processing power of Redshift to efficiently handle and execute the query across multiple compute nodes.
Catalog Tables: Conversely, if a query references only catalog tables (those with a PG prefix, such as PG_TABLE_DEF) or does not reference any tables at all, it is processed entirely by the leader node. These queries, typically metadata operations or administrative checks, do not require the computational resources of the compute nodes and are handled directly by the leader node.
PartiQL for JSON: Explore how Redshift queries JSON data using PartiQL syntax resembling JavaScript.

Additional Features

Redshift Staging Tables: Explain their role in preparing data changes.
Redshift Streaming Ingestion: Detail how it supports rapid data ingestion from streams like KDS or MSK.
Redshift Dynamic Data Masking: Discuss how it protects sensitive data and controls visibility based on user-defined rules.

Further Insight

Lock command at the beginning of a transaction before insert data prevents read and write access to the database table.
Amazon Redshift has three lock modes
- AccessExclusiveLock — Acquired primarily during DDL operations, such as ALTER TABLE, DROP, or TRUNCATE. AccessExclusiveLock blocks all other locking attempts
- AccessShareLock: Acquired during UNLOAD, SELECT, UPDATE, or DELETE operations. AccessShareLock blocks only AccessExclusiveLock attempts. AccessShareLock doesn’t block other sessions that are trying to read or write on the table.
ShareRowExclusiveLock: Acquired during COPY, INSERT, UPDATE, or DELETE operations. ShareRowExclusiveLock blocks AccessExclusiveLock and other ShareRowExclusiveLock attempts but doesn’t block AccessShareLock attempts.

AWS Sagemaker

Overview: Brief introduction to Amazon SageMaker as a comprehensive service to build, train, and deploy machine learning models.

SageMaker Data Wrangler

Functionality:
- Import, prepare, transform, featurize, and analyze data.
- Conduct exploratory data analysis (EDA) within the platform.
Limitations:
- Lacks native functionalities to ensure data accuracy, completeness, and trustworthiness.
- Does not provide features to identify or mask personally identifiable information (PII).

SageMaker Feature Store

Purpose: Acts as a storage and data management layer specifically designed for machine learning.
Features:
- Allows users to create, store, and share features for machine learning models.
- Features is just variables or attributes in a dataset. This is a measurable piece of data that can be used for analysis.
Benefits:
- Enhances model accuracy and efficiency by reusing precomputed features.
- Facilitates consistent feature use across different models and projects.
Security:
- Encrypted at rest and in transit
- Works with KMS CMK
- Fine-grained access control with IAM
- Can be secured using AWS PrivateLink

SageMaker ML Lineage Tracking

Functionality:
- Creates and stores information about each step in the machine learning workflow.
Benefits:
- Supports model governance and sets standards for audits.
- Ensures data accuracy and trustworthiness through detailed lineage tracking.
Use Case:
- Essential for regulated industries requiring strict compliance and transparency in model development processes.

SageMaker Processing

Service Overview:
- A managed service designed to run various machine learning operations.
Capabilities:
- Supports processing workloads, data validation, and model evaluation.
Advantages:
- Simplifies the execution of complex data processing tasks.
- Ensures scalable and efficient handling of machine learning operations.

SageMaker Canvas

No-Code Machine Learning: Amazon SageMaker Canvas is a feature designed to democratize machine learning, offering a no-code platform that caters to all technical backgrounds. It provides a user-friendly interface that automates the entire machine learning workflow — including data preparation, model selection, training, and deployment — enabling users to build predictive models effortlessly without coding. This simplifies the process from data cleaning to prediction, making advanced analytics accessible to everyone.

AWS Database Migration Service (DMS)

AWS Database Migration Service(AWS DMS) is a cloud service that makes it possible to migrate relational database, data warehouses, NoSQL databases, and other data stores.
You can use AWS DMS to transfer data to Amazon S3 from any database source
AWS Database Migration Service tasks using Amazon S3 as a target, data from full loads and change data capture (CDC) is written in CSV format by default, ensuring easy accessibility and compatibility.

AWS Data Exchange

Introduction to AWS Data Exchange

Service Overview: AWS Data Exchange is a platform that facilitates the exchange and use of third-party data on AWS. It allows customers to act as data providers or subscribers, streamlining the process of accessing diverse datasets.

Role of Providers and Subscribers

Providers: Customers can become providers, sharing their chargeable datasets with vendors on the platform.
Subscribers: Subscribers gain access to these datasets only after submitting a request form to the providers, ensuring controlled distribution.

Access and Charges

Data Access: Subscribers can access the data only after their request has been approved by the provider, safeguarding data privacy and compliance.
Billing: Once subscribers begin accessing the data, charges are applied to the provider’s account, facilitating an automated payment system.

Benefits of Using AWS Data Exchange

Efficiency: Data sets can be securely shared with new subscribers with minimal administrative work.
Scale and Reach: Providers can reach a wider audience of potential subscribers, expanding their market presence without additional marketing efforts.

Amazon Kinesis Data Streams

Introduction to Amazon Kinesis Data Streams

Overview: Introduce Amazon Kinesis Data Streams as a scalable and durable real-time data streaming service that enables developers to continuously capture, store, and process large streams of data records.

Key Features of Amazon Kinesis Data Streams

Real-Time Data Processing: Explain how Kinesis Data Streams allows for the processing of data in real-time, enabling immediate analysis and response to information as it arrives.
High Throughput and Scalability: Discuss the ability of Kinesis Data Streams to handle thousands of data sources and scale seamlessly to match the volume of data input and throughput requirements.

How Kinesis Data Streams Work

Data Producers: Describe the sources that can send data to a Kinesis data stream, such as log files, financial transactions, social media feeds, etc.
Shards: Introduce the concept of shards, which are the base throughput units of a Kinesis data stream.
Data Consumers: Talk about how applications can be built to consume and process data directly from Kinesis Data Streams.

Use Cases of Amazon Kinesis Data Streams

Real-Time Analytics: Illustrate how companies use Kinesis for dashboard updates, real-time analytics, and anomaly detection in financial systems.
Log and Event Data Collection: Explain how Kinesis is ideal for gathering and processing logs and event data in applications, thereby improving the monitoring and operational performance of systems.
Internet of Things (IoT): Discuss its application in collecting and analyzing large streams of data from IoT devices and sensors.

Integration with Other AWS Services

AWS Lambda: Detail how Kinesis Data Streams integrates with AWS Lambda to enable serverless data processing.
Amazon S3 and Amazon Redshift: Describe the ability to archive data or load it into data warehouses for further analysis.

Best Practices for Using Amazon Kinesis Data Streams

Stream Monitoring and Optimization: Suggest practices such as monitoring shard-level metrics and using Kinesis Client Library for effective stream management.
Data Partitioning: Discuss the importance of effective data partitioning to ensure uniform data distribution across shards.

Further Insight

Amazon Kinesis Data Streams stores data only for 7 days

Study Guide for AWS Data Engineer — Associate Exam

Amazon Athena

AWS Glue

AWS Lake Formation

Amazon OpenSearch Service

Amazon QuickSight

Amazon Managed Workflows for Apache Airflow (MWAA)

Amazon AppFlow

Amazon Simple Queue Service (SQS)

Amazon Managed Streaming for Apache Kafka (MSK)

Amazon S3

Amazon DynamoDB

Amazon Redshift

AWS Sagemaker

AWS Database Migration Service (DMS)

AWS Data Exchange

Amazon Kinesis Data Streams

Written by Rajesh Murali Nair

No responses yet