VPC Flow Logs

Prashant Lakhera
9 min readNov 22, 2022

--

đź“š To read the complete blog https://www.101daysofdevops.com/courses/100-days-of-aws/lessons/day-45/

đź“– To view the complete course https://lnkd.in/gjeGAPd2

➡️ You can contact me via https://lnkd.in/dePjvNDw

Virtual Private Cloud(VPC) Flow Logs allow you to capture information about IP traffic sessions processed by elastic network interfaces in your VPC.

VPC Flow logs may be defined at one of the three different scopes:

  • VPC level: This will monitor logs for every subnet and every network interface within that VPC.
  • Subnet level: This will monitor every interface within that subnet.
  • Elastic Network Interface(ENI) level: This will monitor only that specific interface. This includes being created to support AWS service objects connected to your VPCs.

Flow logs defined at the VPC scope will apply to the subnet level up to the ENI level. Similarly, flow logs are defined at the subnet level, all the ENI associated with that subnet will record and submit data as defined by the flow log. Flow logs defined for a single elastic network interface would only apply to that ENI.

NOTE: Elastic Network Interfaces(ENI) that fall under the scope of more than one flow log definition will collect and submit their traffic data separately according to the settings of each applicable definition.

Flow log data can be sent to the following destinations:

  • CloudWatch log group: Data from all applicable ENI will be sent to CloudWatch in its own separate log stream(1 log stream per ENI)
  • S3 bucket: All available log data from all applicable ENIs will be periodically collected into a single log file which is compressed and then sent(every 5 minutes or 75MB)to the S3 bucket(1 log file object per publication). By default, files are delivered to the following location “bucket-and-optional-prefix/AWSLogs/account_id/vpcflowlogs/region/year/month/day/” and the log file name “aws_account_id_vpcflowlogs_region_flow_log_id_YYYYMMDDTHHmmZ_hash.log.gz”
  • Kinesis data firehose: You can publish flow log data directly to Kinesis Data Firehose

Limitations

  • Only captures Metadata: Flow logs don’t capture data, they only capture metadata, so you can’t do packet analysis even if it’s unencrypted. To capture the packet content, you need to install a packet sniffer.
  • No real-time data: Data reporting is not real-time, and each ENI aggregates data over 1 or 10min intervals (as per your requirement and configured within flow log definition). On top of that, there is an additional 5–10 min delay for publishing the data to its destination. You can’t rely on flow logs to provide real-time telemetry on network packet flow.
  • Only capture some of your IP traffic: There are some forms of IP traffic within a VPC that the VPC flow log will ignore. e.g. DNS queries, instance metadata, Amazon Time Sync Service, etc. Most importantly, it doesn’t capture any application data is not captured within that session as it operates on Layer3 and 4. The closest it gets, it captures the port used during that communication session. Check this doc for more info https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-logs-limitations
  • You can’t modify the VPC flow log definition once it’s created.
  • Flow log may not capture the original/correct IP address. In version2, the source address(srcaddr) and destination address(dstaddr) give you the source and destination IP of link-local traffic i.e, it only gives you information about two interfaces that directly communicate with each other. So if your use case is EC2 instance is handling traffic forwarded by Load Balancer, the request/log will show only the IP address of the load balancer. This seems to be a big issue, but TADA if you go to custom log format, you can use fields like pkt-srcaddr and pkt-dstaddr will display the original IP

Creating Flow Logs

  • Give your flow log group some meaningful name, what type of traffic you want to capture(allow or deny or both), aggregation interval(1 or 10 min), and where the destination for these logs(CloudWatch Logs, S3 bucket, Kinesis Firehose etc.)
  • OR you can choose the Amazon S3 bucket and specify the bucket ARN.
  • Specify the destination log group(if you want to send it to the Log group), associate an IAM role, check this doc for more info https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-cwl.html, and then the Log record format which can be either AWS default format or Custom Format. If you choose the S3 bucket as the destination, the automatically created bucket policy will apply the necessary privileges. Click on Create flow logs.

NOTE: Please ensure the trust policy allows the flow logs service to assume the role.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "vpc-flow-logs.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
#Different Fields Format
version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
#Accept SSH traffic on port 22
2 123456789010 eni-1235b8ca123456789 172.16.0.20 172.16.0.100 20641 22 6 20 4249 1418530010 1418530070 ACCEPT OK
# Reject SSH traffic on port 22
2 123456789010 eni-1235b8ca123456789 172.16.0.36 172.16.0.120 49761 22 6 20 4249 1418530010 1418530070 REJECT OK
  • An additional field is available in a custom format to give you further information. In version2, the source address(srcaddr) and destination address(dstaddr) give you the source and destination IP of link-local traffic i.e, it only gives you information about two interfaces that directly communicate with each other. So if your use case is EC2 instance is handling traffic forwarded by Load Balancer, the request/log will show only the IP address of the load balancer. This seems to be a big issue, but TADA, if you go to custom log format, you can use fields like pkt-srcaddr and pkt-dstaddr will display the original IP. If you go to the current highest version(Version 5), it will give you information about the AWS service name by using fields like pkt-src-aws-service and pkt-dst-aws-service and flow direction(ingress/egress). Also, the traffic going and from the VPC only shows the primary private IP address(so even if the secondary IP address is assigned it will not show under VPC flow logs). Check this doc for more info https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-logs-custom.

NOTE: The order you selected these fields will determine the way these fields will be later displayed.

  • You can configure flow logs at the subnet
  • OR at the ENI level
  • Now you have logs in the CloudWatch log group or S3 bucket, what we can do with these logs. Let’s start with the CloudWatch log group. If you can click on the configured CloudWatch log group, you will get the list of individual log streams you have configured. Each eni will send its data in its log stream.
  • Click on the log stream, and you will see the logs. Here you can apply a CloudWatch logs filter or simple text matches. e.g., to get all rejected packets, you can search via the REJECT keyword.
  • You can export these results to different file format ASCII or CSV.

CloudWatch Log Insight: This will help you to provide some basic analysis. CloudWatch insight can recognize data fields within the log streams. Click on Log Insights, select the CloudWatch log group and run the default query. The default query displays the text message sent in the log stream sorted by date and time and is limited to returning the first twenty records.

  • If you click on the right, AWS provides you with some sample queries, with an Apply button at the bottom.
  • Let’s look at the Top 20 source IP addresses with the highest number of rejected requests and click Apply button at the bottom.
filter action="REJECT"
| stats count(*) as numRejections by srcAddr
| sort numRejections desc
| limit 20
  • If you click on the visualization tab, you can create the chart based on the result of your queries.

NOTE: Visualization requires an aggregate function to apply to your flow logs, such as counting logs that match certain criteria.

NOTE: As you already realize, the syntax of the query is not SQL.

S3

Let’s explore S3 and see what we can do with our flow log. As mentioned earlier, all available log data from all applicable ENIs will be periodically collected into a single log file which is compressed and then sent(every 5 minutes or 75MB)to the S3 bucket(1 log file object per publication).

  • While you can’t view these compressed logs directly. You either need to download and decompress it locally or use the methods described below.
  • You can query these logs using S3 Select.
  • Keep all the fields as default except CSV delimiter, choose Custom, and add space in Custom CSV delimiter as our vpc flow logs field is space delimited.
  • Then run the default query
  • One gotcha to S3 select. It doesn’t understand the field header. So if you try to run the query to get a specific field e.g., srcaddr, in this case, it will fail.
  • So you need to specify the positional identifier for your field to make it work(_4 for srcaddr)
SELECT _4 FROM s3object s LIMIT 5

Athena

To dive deep into your log analysis, you need to use Athena. Using Athena, you can query flow logs using a SQL-like querying method. There are multiple ways to connect your VPC flow logs to Athena. I will create a table with an SQL statement in Athena and then Run the query. Check this doc for more info https://docs.aws.amazon.com/athena/latest/ug/vpc-flow-logs.html#create-vpc-logs-table.

CREATE EXTERNAL TABLE IF NOT EXISTS `vpc_flow_logs` (
`version` int,
`account_id` string,
`interface_id` string,
`srcaddr` string,
`dstaddr` string,
`srcport` int,
`dstport` int,
`protocol` bigint,
`packets` bigint,
`bytes` bigint,
`start` bigint,
`end` bigint,
`action` string,
`log_status` string
)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION 's3://<bucket name>/AWSLogs/<account id>/vpcflowlogs/<region name>/'
TBLPROPERTIES ("skip.header.line.count"="1");

NOTE: The default template include all the fields from all flow log versions. Make sure your athena definition only contains the field mentioned in your flow log definition e.g. in my case

version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
  • To view the records in your table, run the query
SELECT * FROM "default"."vpc_flow_logs" limit 10;

NOTE: All flow log activities will incur you charges đź’°đź’°. Gathering the data to CloudWatch log group incur charges. Storing the data in S3 incurs charges. Every query you run via S3 select or Athena will incur charges.

Other Alternatives Solutions

  • Amazon Redshift: If you are dealing with a large dataset. You can use Redshift, which can be used to query data in S3.
  • Kinesis: This will help to move data to other storage analysis tools like ElasticSearch or Splunk.

--

--

Prashant Lakhera

AWS Community Builder, Ex-Redhat, Author, Blogger, YouTuber, RHCA, RHCDS, RHCE, Docker Certified,4XAWS, CCNA, MCP, Certified Jenkins, Terraform Certified, 1XGCP