My road to Gremlin Chaos Engineering Practitioner Certificate
Chaos Engineering is one field which always draw my attention. I came to know about after I heard about Netflix Simian Army toolkit https://github.com/Netflix/SimianArmy . At initial glance, it’s hard to believe that someone that someone using Chaos tool in production which can randomly shutdown any production server(chaos monkey). Later on I watched Tammy Bryant Butow video on youtube and came to know about Gremlin. What Gremlin does it provides a hosted service which let you simply run the Chaos experminet. Finally after 1 week of study I am now Gremlin Chaos Engineering Practitioner Certified.
Exam Resources
I only followed below two resources to prepare for the exam
- Gremlin Tutorial: https://www.gremlin.com/community/tutorials/?ref=nav
- Gremlin YouTube Channel: https://www.youtube.com/channel/UC6PAoCqf2LSw6Hth-4M4yEQ
- If you need more practice and hands on experience you can attend Gremlin bootcamp https://www.gremlin.com/bootcamps/?ref=nav
Exam Format
- Number of Questions: 20
- Question Type: Single and Multiple Choice, Drag and Drop
- If you still have any doubts about exam format please watch this video https://www.youtube.com/watch?v=TL1j2MJBE0A&t=1248s
NOTE: Exam is free of cost, you only register via this link https://gremlin.coassemble.com/unlock/7Jan8Su
Exam Preparation
- In order to prepare for the exam, the first thing you can do is to create a free account on Gremlin website https://app.gremlin.com/?ref=nav
1. Get familiar with how to install gremlin agent
- In order for you to attack a host, gremlin agent need to install on that host. Gremlin support various operating system(Ubuntu, Centos, RHEL, Windows), you can even download the Docker image https://hub.docker.com/r/gremlin/gremlin or use the helm repo
helm repo add gremlin https://helm.gremlin.com
- This is how the architecture will look like
- In the case of ubuntu these are the steps you need to follow as shown in above diagram
* echo "deb https://deb.gremlin.com/ release non-free" | sudo tee /etc/apt/sources.list.d/gremlin.list* sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys XXXX* sudo apt-get update && sudo apt-get install -y gremlin gremlind
- Once these steps are done, you need to Register the installed Gremlin with the Gremlin Control Plane using your Team ID and Secret Key in Team Settings. To do that go to the Team Settings page, make a not of TeamID and SecretKey(In case you don’t know it, click on Reset button)
- Run gremlin init command and enter the Team ID and Secret you copied in previous steps
$ gremlin initMetadata set for [ gremlin-client-version: 2.20.0 ]Metadata set for [ os-type: Linux ]Metadata set for [ os-name: Ubuntu ]AWS metadata may be presentMetadata set for [ instance-id: i-0550fdb260931639b ]Metadata set for [ local-hostname: ip-172-31-28-103.ec2.internal ]Metadata set for [ local-ip: 172.31.28.103 ]Metadata set for [ public-hostname: ec2-184-73-139-79.compute-1.amazonaws.com ]Metadata set for [ public-ip: 184.73.139.79 ]Metadata set for [ azid: use1-az4 ]Metadata set for [ cloud: AWS ]Metadata set for [ image-id: ami-09e67e426f25ce0d7 ]Metadata set for [ instance-type: t2.micro ]Metadata set for [ region: us-east-1 ]Metadata set for [ zone: us-east-1c ]Unable to describe AWS tags. The error message is: No such file or directory (os error 2)Azure metadata may be presentPlease input your Team ID: <--------XXXXXXXXPlease input your Team Secret: <--------Using XXXXXX for Team IdUsing 172.31.28.103 for Gremlin identifier
- Go to the gremlin dashboard and you will see your newly added host.
- You were all set to perform various attacks by just clicking on attack button.
2. Get familiar with various type of attacks, you can perform via gremlin
Using Gremlin you can trigger various attack depend upon the Infrastructure to target(Hosts, Containers or Kubernetes)
For Hosts
Resource: Test against sudden changes in consumption of computing resources.
- CPU: Test that your application behaves as expected even when CPU capacity is limited or exhausted
- Disk: Test system and application behavior when storage space is limited or unavailable, and validate dynamic storage provisioning systems
- IO: Test against heavy IO operations to understand their effect on your applications
- Memory: Test your systems against memory consumption to ensure they can tolerate and perform given a sudden increase in usage
State: Test against unexpected changes in your environment such as power outages, node failures, clock drift, or application crashes.
- Process Killer: Test against application crashes and similar events by terminating specific sets of processes
- Shutdown: Test resilience to host failures by rebooting or shutting down targeted host operating systems
- Time Travel: Test for scenarios such as Daylight Saving Time (DST), clock drift between hosts, and expiring SSL/TLS certificates
Network: Test against unreliable network conditions.
- Blackhole: Test against unreachable dependencies by dropping network traffic between services
- DNS: Test against DNS outages, and validate both fallback DNS servers and DNS resolver configurations
- Latency: Test your system’s responsiveness under varying network conditions by injecting a controlled delay into outbound network traffic
- Packet Loss: Test your system’s end user experience when a percentage of outbound network packets are dropped or corrupted
Try to test and perform some of these attacks before exam. For e.g. in order to test shutdown, go to State and click on shutdown, you have an option to introduce delay as well as rebooting the host after shutdown.
- You can go to the host and see what command it’s executing
$ ps aux|grep -i gremlingremlin 2142 0.0 0.9 23420 9328 ? Ssl 04:42 0:00 /usr/sbin/gremlindgremlin 2362 0.0 0.8 23612 8516 ? Sl 05:07 0:00 gremlin attack shutdown -d 1 -r
- Gremlin also provides a nice UI, where you can view this
Similarly you can perform other kinds of attacks like CPU attacks. In the below scenario, we are running the test for the period of 60 sec, for CPU utilization of 50% and on all cores.
- You can go back to the host and check the CPU utilization using top command
- To use Gremlin with EKS please check this blog https://www.gremlin.com/community/tutorials/how-to-install-and-use-gremlin-with-eks/
- To use Gremlin with RDS https://www.gremlin.com/community/tutorials/how-to-use-gremlin-with-amazon-rds/
3. Get familiar with gremlin command line
$ gremlin -hgremlinUSAGE:gremlin <SUBCOMMAND>FLAGS:-h, --help Prints help informationSUBCOMMANDS:attack Run a new gremlin attack against this hostattack-container Run a new gremlin attack against the specified containercheck Show runtime troubleshooting datahelp Prints this message or the help of the given subcommand(s)init Initialize a new client session with the Gremlin servicelogout Remove this client from the Gremlin servicemeasure Measure then report dynamic system datarollback Interrupt an active attack, or revert the last impactrollback-container Interrupt an active attack against a Docker containerstatus Show the status of all gremlins or a specific attacksyscheck System check was a feature in Gremlin 2.8.x and is no longer supportedvalidate Validate a gremlinversion Show version information for the gremlin binary
In the end I will say this exam is straightforward, go through Gremlin doc and youtube(Bonus: If you can attend there bootcamp) and you should be good to go.
The best way to connect with me is via any of the below mediums
- Website: https://101daysofdevops.com/
- Linkedin: https://www.linkedin.com/in/prashant-lakhera-696119b/
- Twitter: @100daysofdevops OR @lakhera2015
- Facebook: https://www.facebook.com/groups/795382630808645/
- Medium: https://medium.com/@devopslearning
- GitHub: https://github.com/100daysofdevops/100daysofdevops
- YouTube Channel: https://www.youtube.com/user/laprashant/videos
- Slack: https://join.slack.com/t/100daysofdevops/shared_invite/zt-au03logz-YfDUp_FJF4rAUeDEbgWmsg
- Reddit: r/101DaysofDevops
- Meetup: https://www.meetup.com/100daysofdevops/