My road to Gremlin Chaos Engineering Practitioner Certificate

Chaos Engineering is one field which always draw my attention. I came to know about after I heard about Netflix Simian Army toolkit ​​https://github.com/Netflix/SimianArmy . At initial glance, it’s hard to believe that someone that someone using Chaos tool in production which can randomly shutdown any production server(chaos monkey). Later on I watched Tammy Bryant Butow video on youtube and came to know about Gremlin. What Gremlin does it provides a hosted service which let you simply run the Chaos experminet. Finally after 1 week of study I am now Gremlin Chaos Engineering Practitioner Certified.

Exam Resources

I only followed below two resources to prepare for the exam

Exam Format

NOTE: Exam is free of cost, you only register via this link https://gremlin.coassemble.com/unlock/7Jan8Su

Exam Preparation

1. Get familiar with how to install gremlin agent

  • In order for you to attack a host, gremlin agent need to install on that host. Gremlin support various operating system(Ubuntu, Centos, RHEL, Windows), you can even download the Docker image https://hub.docker.com/r/gremlin/gremlin or use the helm repo
helm repo add gremlin https://helm.gremlin.com
  • This is how the architecture will look like
  • In the case of ubuntu these are the steps you need to follow as shown in above diagram
* echo "deb https://deb.gremlin.com/ release non-free" | sudo tee /etc/apt/sources.list.d/gremlin.list* sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys XXXX* sudo apt-get update && sudo apt-get install -y gremlin gremlind
  • Once these steps are done, you need to Register the installed Gremlin with the Gremlin Control Plane using your Team ID and Secret Key in Team Settings. To do that go to the Team Settings page, make a not of TeamID and SecretKey(In case you don’t know it, click on Reset button)
  • Run gremlin init command and enter the Team ID and Secret you copied in previous steps
$ gremlin initMetadata set for [ gremlin-client-version: 2.20.0 ]Metadata set for [ os-type: Linux ]Metadata set for [ os-name: Ubuntu ]AWS metadata may be presentMetadata set for [ instance-id: i-0550fdb260931639b ]Metadata set for [ local-hostname: ip-172-31-28-103.ec2.internal ]Metadata set for [ local-ip: 172.31.28.103 ]Metadata set for [ public-hostname: ec2-184-73-139-79.compute-1.amazonaws.com ]Metadata set for [ public-ip: 184.73.139.79 ]Metadata set for [ azid: use1-az4 ]Metadata set for [ cloud: AWS ]Metadata set for [ image-id: ami-09e67e426f25ce0d7 ]Metadata set for [ instance-type: t2.micro ]Metadata set for [ region: us-east-1 ]Metadata set for [ zone: us-east-1c ]Unable to describe AWS tags.  The error message is: No such file or directory (os error 2)Azure metadata may be presentPlease input your Team ID: <--------XXXXXXXXPlease input your Team Secret: <--------Using XXXXXX for Team IdUsing 172.31.28.103 for Gremlin identifier
  • Go to the gremlin dashboard and you will see your newly added host.
  • You were all set to perform various attacks by just clicking on attack button.

2. Get familiar with various type of attacks, you can perform via gremlin

Using Gremlin you can trigger various attack depend upon the Infrastructure to target(Hosts, Containers or Kubernetes)

For Hosts

Resource: Test against sudden changes in consumption of computing resources.

  • CPU: Test that your application behaves as expected even when CPU capacity is limited or exhausted
  • Disk: Test system and application behavior when storage space is limited or unavailable, and validate dynamic storage provisioning systems
  • IO: Test against heavy IO operations to understand their effect on your applications
  • Memory: Test your systems against memory consumption to ensure they can tolerate and perform given a sudden increase in usage

State: Test against unexpected changes in your environment such as power outages, node failures, clock drift, or application crashes.

  • Process Killer: Test against application crashes and similar events by terminating specific sets of processes
  • Shutdown: Test resilience to host failures by rebooting or shutting down targeted host operating systems
  • Time Travel: Test for scenarios such as Daylight Saving Time (DST), clock drift between hosts, and expiring SSL/TLS certificates

Network: Test against unreliable network conditions.

  • Blackhole: Test against unreachable dependencies by dropping network traffic between services
  • DNS: Test against DNS outages, and validate both fallback DNS servers and DNS resolver configurations
  • Latency: Test your system’s responsiveness under varying network conditions by injecting a controlled delay into outbound network traffic
  • Packet Loss: Test your system’s end user experience when a percentage of outbound network packets are dropped or corrupted

Try to test and perform some of these attacks before exam. For e.g. in order to test shutdown, go to State and click on shutdown, you have an option to introduce delay as well as rebooting the host after shutdown.

  • You can go to the host and see what command it’s executing
$ ps aux|grep -i gremlingremlin     2142  0.0  0.9  23420  9328 ?        Ssl  04:42   0:00 /usr/sbin/gremlindgremlin     2362  0.0  0.8  23612  8516 ?        Sl   05:07   0:00 gremlin attack shutdown -d 1 -r
  • Gremlin also provides a nice UI, where you can view this

Similarly you can perform other kinds of attacks like CPU attacks. In the below scenario, we are running the test for the period of 60 sec, for CPU utilization of 50% and on all cores.

  • You can go back to the host and check the CPU utilization using top command

3. Get familiar with gremlin command line

$ gremlin -hgremlinUSAGE:gremlin <SUBCOMMAND>FLAGS:-h, --help    Prints help informationSUBCOMMANDS:attack                Run a new gremlin attack against this hostattack-container      Run a new gremlin attack against the specified containercheck                 Show runtime troubleshooting datahelp                  Prints this message or the help of the given subcommand(s)init                  Initialize a new client session with the Gremlin servicelogout                Remove this client from the Gremlin servicemeasure               Measure then report dynamic system datarollback              Interrupt an active attack, or revert the last impactrollback-container    Interrupt an active attack against a Docker containerstatus                Show the status of all gremlins or a specific attacksyscheck              System check was a feature in Gremlin 2.8.x and is no longer supportedvalidate              Validate a gremlinversion               Show version information for the gremlin binary

In the end I will say this exam is straightforward, go through Gremlin doc and youtube(Bonus: If you can attend there bootcamp) and you should be good to go.

The best way to connect with me is via any of the below mediums

AWS Community Builder, Ex-Redhat, Author, Blogger, YouTuber, RHCA, RHCDS, RHCE, Docker Certified,4XAWS, CCNA, MCP, Certified Jenkins, Terraform Certified, 1XGCP