<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-GB"><generator uri="https://jekyllrb.com/" version="3.9.2">Jekyll</generator><link href="https://jonathanbull.co.uk/blog/feed.xml" rel="self" type="application/atom+xml" /><link href="https://jonathanbull.co.uk/" rel="alternate" type="text/html" hreflang="en-GB" /><updated>2022-06-14T09:32:59+01:00</updated><id>https://jonathanbull.co.uk/blog/feed.xml</id><title type="html">Jonathan Bull</title><subtitle>Personal site for Jonathan Bull, an entrepreneur based in London.</subtitle><author><name>Jonathan Bull</name></author><entry><title type="html">The AWS Health Dashboard can’t be trusted</title><link href="https://jonathanbull.co.uk/blog/aws-health-dashboard-cannot-be-trusted/" rel="alternate" type="text/html" title="The AWS Health Dashboard can’t be trusted" /><published>2022-06-09T19:00:00+01:00</published><updated>2022-06-09T19:00:00+01:00</updated><id>https://jonathanbull.co.uk/blog/aws-health-dashboard-cannot-be-trusted</id><content type="html" xml:base="https://jonathanbull.co.uk/blog/aws-health-dashboard-cannot-be-trusted/">&lt;p&gt;I’m the co-founder of &lt;a href=&quot;https://emailoctopus.com&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;EmailOctopus&lt;/a&gt;, an email marketing platform with a slight difference. Users can connect our platform to their AWS account and send emails through Amazon’s Simple Email Service (SES). Amazon are &lt;em&gt;really&lt;/em&gt; good at delivering emails reliably and cheaply, which means our average user pays 20% of the price of our biggest competitor, Mailchimp.&lt;/p&gt;

&lt;p&gt;Amazon SES also provides metrics on how emails perform. It’s this data that powers our reporting features:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/posts/aws-health-dashboard-cannot-be-trusted/report-good.png&quot; alt=&quot;A screenshot of a normal EmailOctopus campaign report&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Thousands of our users rely on these reports every day. So when they stopped working last Thursday our support channels started getting busy:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/posts/aws-health-dashboard-cannot-be-trusted/ticket-1.png&quot; alt=&quot;A screenshot of a support ticket which asks why the report is showing no data&quot; /&gt;
&lt;img src=&quot;/assets/img/posts/aws-health-dashboard-cannot-be-trusted/ticket-2.png&quot; alt=&quot;A screenshot of a support ticket which asks why the report is showing no data&quot; /&gt;
&lt;img src=&quot;/assets/img/posts/aws-health-dashboard-cannot-be-trusted/ticket-3.png&quot; alt=&quot;A screenshot of a support ticket which asks why the report is showing no data&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We quickly determined that we were receiving fewer click and open events from SES than usual. In fact we were receiving none at all in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;us-east-1&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;us-west-2&lt;/code&gt; regions. Our users’ reports were a tumbleweed of activity:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/posts/aws-health-dashboard-cannot-be-trusted/report-bad.png&quot; alt=&quot;A screenshot of an EmailOctopus campaign report with zero clicks and opens&quot; /&gt;&lt;/p&gt;

&lt;p&gt;And our internal metrics weren’t looking much better – events dropped off a cliff around 12pm UTC:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/posts/aws-health-dashboard-cannot-be-trusted/metrics-bad.png&quot; alt=&quot;A chart showing a big dip&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;We were still receiving other event types, but opens and clicks make up the bulk of our activity.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is where things got frustrating. We fired off a message to Amazon’s premium support team on the off chance this could be an SES issue – but with ten hours having passed and nothing on the AWS health dashboard – we assumed it had to be an issue with our systems. All of our developers were online trying to figure this out, when all of a sudden events started coming through again:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/posts/aws-health-dashboard-cannot-be-trusted/metrics-good.png&quot; alt=&quot;A chart showing a big dip, then a trend back up&quot; /&gt;&lt;/p&gt;

&lt;p&gt;And while events continued to come in correctly, we continued to debug and monitor things on our end – just in case it was our fault. Until almost &lt;em&gt;four days&lt;/em&gt; later, when Amazon confirmed they had an outage totalling over 12 hours across two regions:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Between 10:00 AM and 10:55 PM [UTC] in the [us-east-1] region, and between 12:50 PM and 10:45 PM [UTC] in the [us-west-2] region we experienced an issue processing Opens &amp;amp; Clicks data. The notification events could not be saved during this period and events could not be reprocessed. The issue has been resolved and the service is operating normally.&lt;/p&gt;

&lt;/blockquote&gt;

&lt;p&gt;To be clear, every single click and open event that occurred in those regions over the course of 12 hours had been permanently lost. This has had a huge impact on our customers – and the fact we couldn’t tell them what had happened for almost four days didn’t make things any easier. We asked Amazon why this wasn’t noted in the Health Dashboard:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This class of issues are identified to go to customers Personal Health Dashboard, However due to issues with our instrumentation this wasn’t possible.&lt;/p&gt;

&lt;/blockquote&gt;

&lt;p&gt;This feels strikingly similar to the &lt;a href=&quot;https://www.theregister.com/2017/03/01/aws_s3_outage&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;S3 outage in 2017&lt;/a&gt;, when the dashboard embarrassingly showed green ticks for the entirety of the two hour outage because the dashboard relied on… S3.&lt;/p&gt;

&lt;p&gt;Almost five years on and the Health Dashboard still can’t be trusted to accurately report on the status of a service. And at the time of this blog post – seven days after this 12 hour outage – this update still hasn’t been posted to the AWS health dashboard. Nothing to see but green ticks:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/posts/aws-health-dashboard-cannot-be-trusted/green-ticks.png&quot; alt=&quot;A screenshot of the AWS Health Dashboard showing no issues for Amazon SES on 2 Jun&quot; /&gt;&lt;/p&gt;</content><author><name>Jonathan Bull</name></author><category term="aws" /><summary type="html">I’m the co-founder of EmailOctopus, an email marketing platform with a slight difference. Users can connect our platform to their AWS account and send emails through Amazon’s Simple Email Service (SES). Amazon are really good at delivering emails reliably and cheaply, which means our average user pays 20% of the price of our biggest competitor, Mailchimp.</summary></entry></feed>