I am glad that I joined the event powered by AWS about BigData and Data Analytics as it was definitely an interesting day.
I’ve found the session useful as it was characterised by a mix of architectural approach, demo as well as best practices when working with BigData and amazon web services (AWS).
A picture taken during the speech shows the aws cloud services that can be used across the typical workflow
Collect ==> Store ==> Analyze
With data analytics usually we have to deal with unstructured data that must be turned in structured ones before they can be analysed.
The next picture shows instead how those services could be combined together to process data depending on their nature.
Particular focus was given on the following services:
My thoughts about Amazon Kinesis
Amazon Kinesis consists of a set of services to process stream data in the cloud.
The first service launched was Kinesis Stream where the data stream capacity (shard) must be determined at creation time and a charge will occur accordingly.
For further reading, the following source contains all the key concepts like the following diagram offering a visual representation of what Kinesis is:
Once the records are sent to the stream from the producers, the Kinesis applications (consumers) can consume them.
No autoscaling in Kinesis Stream yet
Surprisingly the data stream capacity does not automatically scale up/down but a manual stream resizing operation must be carried out to increase/decrease the number of shards of the stream whereas it is desired. This is a straight forward operation from AWS console but I suppose that an automatic scale up/down would be useful.
Partition key as mandatory input
It is arguable the decision to make the partition key a mandatory input to get/put records from the stream. It is clear that the partition key is used to provide the record grouping feature but in my opinion it should be an optional one since a grouping/sorting might not be a requirement for each use case.
Leveraging Kinesis Stream, AWS has built Kinesis Firehose that allow the users to easily create a stream delivering data directly to Amazon Redshift or Amazon S3
With Firehose, in my opinion AWS has made a great step forward. While creating a delivery stream, there is no sizing to specify for the underlying shard nor any partition key to provide as input param since the service handles it transparently.
Why a temporary S3 bucket to deliver to Redshift?
My only remark regarding Kinesis Firehose is the temporary S3 bucket that must be specified upon its configuration against a Redshift target. In my opinion, the intermediate S3 bucket should be transparent to the user. Why a user should have an additional S3 bucket in his account just because Firehose has been implemented with an intermediate store in mind?
The following screenshot shows the concerned configuration: