How to use Amazon KCL and set up Amazon EMR

Last updated on Dec 10 2021
Padmanabham Suresh

Table of Contents

How to use Amazon KCL and set up Amazon EMR

Amazon – Kinesis

Amazon Kinesis is a managed, scalable, cloud-based service that allows real-time processing of streaming large amount of data per second. It is designed for real-time applications and allows developers to take in any amount of data from several sources, scaling up and down that can be run on EC2 instances.

It is used to capture, store, and process data from large, distributed streams such as event logs and social media feeds. After processing the data, Kinesis distributes it to multiple consumers simultaneously.

It is used in situations where we require rapidly moving data and its continuous processing. Amazon Kinesis can be used in the following situations −

  • Data log and data feed intake − We need not wait to batch up the data, we can push data to an Amazon Kinesis stream as soon as the data is produced. It also protects data loss in case of data producer fails. For example: System and application logs can be continuously added to a stream and can be available in seconds when required.
  • Real-time graphs − We can extract graphs/metrics using Amazon Kinesis stream to create report results. We need not wait for data batches.
  • Real-time data analytics − We can run real-time streaming data analytics by using Amazon Kinesis.

Limits of Amazon Kinesis?

Following are certain limits that should be kept in mind while using Amazon Kinesis Streams −

  • Records of a stream can be accessible up to 24 hours by default and can be extended up to 7 days by enabling extended data retention.
  • The maximum size of a data blob (the data payload before Base64-encoding) in one record is 1 megabyte (MB).
  • One shard supports up to 1000 PUT records per second.
  • For more information related to limits, visit the following link − https://docs.aws.amazon.com/kinesis/latest/dev/service-sizes-and-limits.html

How to Use Amazon Kinesis?

Following are the steps to use Amazon Kinesis −

Step 1 − Set up Kinesis Stream using the following steps −

  • Sign into AWS account. Select Amazon Kinesis from Amazon Management Console.
  • Click the Create stream and fill the required fields such as stream name and number of shards. Click the Create button.

1 13

  • The Stream will now be visible in the Stream List.

Step 2 − Set up users on Kinesis stream. Create New Users & assign a policy to each user.(We have discussed the procedure above to create Users and assigning policy to them)

Step 3 − Connect your application to Amazon Kinesis; here we are connecting Zoomdata to Amazon Kinesis. Following are the steps to connect.

  • Log in to Zoomdata as Administrator and click Sources in menu.

2 12

  • Select the Kinesis icon and fill the required details. Click the Next button.

3 9

  • Select the desired Stream on the Stream tab.
  • On the Fields tab, create unique label names, as required and click the Next button.
  • On the Charts Tab, enable the charts for data. Customize the settings as required and then click the Finish button to save the setting.

Features of Amazon Kinesis

  • Real-time processing − It allows to collect and analyze information in real-time like stock trade prices otherwise we need to wait for data-out report.
  • Easy to use − Using Amazon Kinesis, we can create a new stream, set its requirements, and start streaming data quickly.
  • High throughput, elastic − It allows to collect and analyze information in real-time like stock trade prices otherwise we need to wait for data-out report.
  • Integrate with other Amazon services − It can be integrated with Amazon Redshift, Amazon S3 and Amazon DynamoDB.
  • Build kinesis applications − Amazon Kinesis provides the developers with client libraries that enable the design and operation of real-time data processing applications. Add the Amazon Kinesis Client Library to Java application and it will notify when new data is available for processing.
  • Cost-efficient − Amazon Kinesis is cost-efficient for workloads of any scale. Pay as we go for the resources used and pay hourly for the throughput required.

1. Amazon – Elastic MapReduce (EMR)

Amazon Elastic MapReduce (EMR) is a web service that provides a managed framework to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto in an easy, cost-effective, and secure manner.

It is used for data analysis, web indexing, data warehousing, financial analysis, scientific simulation, etc.

How to Set Up Amazon EMR?

Follow these steps to set up Amazon EMR −

Step 1 − Sign in to AWS account and select Amazon EMR on management console.
Step 2 − Create Amazon S3 bucket for cluster logs & output data. (Procedure is explained in detail in Amazon S3 section)
Step 3 − Launch Amazon EMR cluster.
Following are the steps to create cluster and launch it to EMR.

4 8

  • Leave the Tags section options as default and proceed.
  • On the Software configuration section, level the options as default.

5 8

  • On the File System Configuration section, leave the options for EMRFS as set by default. EMRFS is an implementation of HDFS, it allows Amazon EMR clusters to store data on Amazon S3.

6 7

  • On the Hardware Configuration section, select m3.xlarge in EC2 instance type field and leave other settings as default. Click the Next button.

7 6

  • On the Security and Access section, for EC2 key pair, select the pair from the list in EC2 key pair field and leave the other settings as default.
  • On Bootstrap Actions section, leave the fields as set by default and click the Add button. Bootstrap actions are scripts that are executed during the setup before Hadoop starts on every cluster node.
  • On the Steps section, leave the settings as default and proceed.
  • Click the Create Cluster button and the Cluster Details page opens. This is where we should run the Hive script as a cluster step and use the Hue web interface to query the data.

Step 4 − Run the Hive script using the following steps.

  • Open the Amazon EMR console and select the desired cluster.
  • Move to the Steps section and expand it. Then click the Add step button.
  • The Add Step dialog box opens. Fill the required fields, then click the Add button.

8 5

  • To view the output of Hive script, use the following steps −
    • Open the Amazon S3 console and select S3 bucket used for the output data.
    • Select the output folder.
    • The query writes the results into a separate folder. Select os_requests.
    • The output is stored in a text file. This file can be downloaded.

Benefits of Amazon EMR

Following are the benefits of Amazon EMR −

  • Easy to use − Amazon EMR is easy to use, i.e. it is easy to set up cluster, Hadoop configuration, node provisioning, etc.
  • Reliable − It is reliable in the sense that it retries failed tasks and automatically replaces poorly performing instances.
  • Elastic − Amazon EMR allows to compute large amount of instances to process data at any scale. It easily increases or decreases the number of instances.
  • Secure − It automatically configures Amazon EC2 firewall settings, controls network access to instances, launch clusters in an Amazon VPC, etc.
  • Flexible − It allows complete control over the clusters and root access to every instance. It also allows installation of additional applications and customizes your cluster as per requirement.
  • Cost-efficient − Its pricing is easy to estimate. It charges hourly for every instance used.

 

So, this brings us to the end of blog. This Tecklearn ‘How to use Amazon KCL and set up Amazon EMR’ helps you with commonly asked questions if you are looking out for a job in AWS and Cloud Computing. If you wish to learn AWS and build a career in Cloud Computing domain, then check out our interactive, AWS Solutions Architect Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/aws-solutions-architect-certification-training/

AWS Solutions Architect Certification Training

About the Course

Tecklearn’s AWS Architect Certification Training is curated by industry professionals as per the industry requirements and demands. The entire AWS training course is in line with the AWS Certified Solutions Architect exam. You will learn various aspects of AWS like Elastic Cloud Compute, Simple Storage Service, Virtual Private Cloud, Aurora database service, Load Balancing, Auto Scaling and more by working on hands-on projects and case studies. You will master AWS architectural principles and services such as IAM, VPC, EC2, EBS and elevate your career to the cloud, and beyond with this AWS solutions architect course.

Why Should you take AWS Architect Certification Training?

  • The Average salary of an AWS Certified Solutions Architect is $129k per annum – Indeed.com
  • AWS market is expected to reach $236 Billion by 2020 at a CAGR of 22% – Forrester
  • Netflix, Twitter, LinkedIn, Facebook, BBC, Baidu, ESPN & other MNCs worldwide use Amazon AWS Cloud

What you will Learn in this Course?

Overview of Cloud Computing and AWS

  • What is Cloud Computing
  • Definition of Cloud Computing
  • On Premises Vs Service Models
  • Advantages and Disadvantages of Cloud Computing
  • Cloud Computing Providers
  • Why AWS
  • What is AWS
  • AWS Benefits
  • AWS Services
  • Traditional Vs AWS Components
  • AWS Global Infrastructure
  • AWS Availability Zone
  • AWS Edge Locations
  • How to Access the AWS Services
  • AWS architecture
  • AWS Management Console
  • AWS offerings Listing (EC2, VPC, AMI, EBS, ELB, Backup)

Amazon Elastic Compute Cloud (EC2)

  • Overview of EC2
  • Elastic IP Vs Public IP
  • Launching of AWS EC2 instance demo
  • How to access EC2
  • EC2 Purchasing Options
  • Amazon Machine Images (AMI)
  • EC2 Storage for the Root Device
  • EC2 Creating AMI
  • EC2 Instance Types
  • Auto Scaling
  • Cost of EC2
  • Best Practices of EC2
  • EC2 Resizing
  • Placement Groups
  • Amazon Backup and various Concepts
  • EC2 Demo
  • Hands On

Networking and Monitoring Services: Amazon Virtual Public Cloud

  • Virtual Private Cloud (VPC) and its benefits
  • Default and Non-Default VPC
  • IP Address
  • CIDR – Classless Inter-domain Routing
  • Subnet: Subnet Mask and Subnet Mask Classes
  • Private and Public Subnet
  • IPv4 v/s IPv6 – As in AWS Infrastructure
  • Internet Gateway and Route Tables
  • Security Group with VPC
  • Access Control List, NACL and Security Group
  • NAT Devices: NAT Gateway and NAT Instance
  • Flow Logs
  • VPC Peering and its working
  • VPN and Direct Connect
  • VPC Limitations
  • Need for Monitoring Services
  • AWS CloudWatch and it’s working
  • AWS Command Line Interface
  • Use Cases
  • Hands On

Amazon Storage Services: Elastic Block Storage

  • What is Storage Services
  • What is Elastic Block Storage (EBS)
  • Persistent Storage
  • EBC Features
  • EBS Benefits
  • EBS Types
  • EBS Pricing
  • EBS Life Cycle
  • EBS Snapshot
  • EBS General Purposed SSD
  • EBS Provisioned IOPS SSD
  • EBS Throughput Optimized HDD
  • EBS Cold HDD
  • EBS Comparison
  • EBS Previous Generation Volumes
  • EBS How Incremental Snapshots Work
  • EBS Deleting an Amazon EBS Snapshot
  • EBS Summary
  • Hands On

Amazon Storage Services: Simple Storage Services (S3)

  • What is Amazon AWS S3
  • Simple Storage Services (S3) Advantages
  • S3 Buckets, Objects, Keys and Endpoints
  • S3 Data Consistency Model
  • S3 Transfer Acceleration
  • S3 Storage Types
  • S3 Versioning
  • S3 Life Cycle Management
  • S3 Data Protection
  • S3 Cross-Region Replication
  • S3 Hosting a Static Website
  • Hands On

Amazon Storage Services

  • Amazon Glacier Storage
  • Amazon Storage Gateway
  • Amazon Snowball (Data Import /Export)
  • Billing with Amazon CloudWatch
  • Hands On

AWS Database Services: Relational Database Service (RDS)

  • Overview of Databases and Relational Database Service (RDS)
  • What is Amazon RDS
  • AWS RDS Components
  • AWS RDS: Interface
  • AWS RDS: Charges
  • AWS RDS Multi-AZ: Benefits
  • AWS RDS Multi-AZ: Failover Process
  • NoSQL Database: Amazon DynamoDB
  • Overview of DynamoDB
  • DynamoDB Benefits
  • Hands On

AWS Database Services Continued

  • Data Warehouse: Amazon Redshift
  • Overview of Amazon Redshift
  • Redshift Architecture
  • Amazon Redshift features
  • In Memory Cache: Amazon ElasticCache
  • Redis Vs MemCache
  • Amazon ElasticCache Cluster
  • Database Migration: AWS Database Migration Service

Load Balancing in AWS

  • What is Fault Tolerant System
  • Features of Elastic Load Balancing
  • What is AWS ELB (Elastic Load Balancer)
  • Types of Load Balancer: Classic, Application and Network
  • Classic Load Balancer: Features, Health Check Configuration, Cross-Zone, Connection Draining, Sticky Sessions, Access Logs, Limitation
  • Application Load Balancer: Features, Application Flow, Limitation
  • Network Load Balancer
  • Access Elastic Load Balancing: AWS Management Console, AWS CLI, AWS SDKs, HTTPS Query API

Amazon Route 53

  • What is Amazon Route 53
  • Domain Name Registration
  • Routing Internet Traffic to Resources
  • Automated check of the health of Resources + Data Pipeline

AWS Identity and Access Management (IAM) – Control user access

  • Authentication (Who can use) and Authorization (Level of Access)
  • IAM Policies – JSON Structure
  • Users, Groups and their Roles
  • AWS IAM Features
  • User Sign-in to Account
  • Switch Role
  • Role to EC2 Instance
  • Password Policy
  • How to Access AWS
  • Multi-Factor Authentication (MFA)
  • Permissions and Permission Types
  • Policies Structure
  • User Based Policies
  • Resource Based Policies
  • Resource Based Permission
  • Policies Types
  • Request Flow
  • Limitations
  • IAM HTTPS API
  • Logging IAM Events with AWS CloudTail
  • Hands On

Amazon CloudWatch

  • What is Amazon CloudWatch
  • Features and Benefits
  • CloudWatch Architecture
  • Hands On

AWS Auto Scaling

  • What is AWS Auto Scaling
  • Auto Scaling Components
  • Auto Scaling Group
  • Auto Scaling Launch Configuration
  • Auto Scaling Benefits
  • Auto Scaling Lifecycle
  • Auto Scaling Plans
  • Manual Scaling
  • Schedule Scaling
  • Dynamic Scaling
  • Auto Scaling Step Adjustment
  • Auto Scaling Termination Policy
  • Default Termination Policy
  • Health Check
  • Hands On

Amazon Application Services

  • Elastic BeanStalk
  • Simple Email Services (SES)
  • Simple Queue Service (SQS)
  • Simple Notification Services (SNS)
  • AWS Lambda
  • Introduction to Elastic OpWorks
  • Hands On

About AWS Solution Architect Associate Exam

Got a question for us? Please mention it in the comments section and we will get back to you.

 

0 responses on "How to use Amazon KCL and set up Amazon EMR"

Leave a Message

Your email address will not be published. Required fields are marked *