ALARM: "prod-286-QUERY-queue-oldest-message-exceed-time"

Description

You are receiving this email because your Amazon CloudWatch Alarm "prod-286-QUERY-queue-oldest-message-exceed-time" in the US East (N. Virginia) region has entered the ALARM state, because "Threshold Crossed: 1 datapoint [68.0 (14/11/19 21:12:00)] was greater than or equal to the threshold (30.0)." at "Thursday 14 November, 2019 21:17:45 UTC".

Environment

None

Activity

Show:
John Hill
November 15, 2019, 9:22 PM

The top ten slow queries are all query run against the same view as above:

RUNTIME_MS

QUERY

41841553

SELECT * FROM syn9630847

41800411

SELECT * FROM syn9630847

39011004

SELECT * FROM syn9630847

38968715

SELECT * FROM syn9630847

12724478

SELECT * FROM syn9630847

12686868

SELECT * FROM syn9630847

7897951

SELECT * FROM syn9630847

7874808

SELECT * FROM syn9630847

7017594

SELECT * FROM syn9630847

6946215

SELECT * FROM syn9630847

John Hill
November 19, 2019, 9:02 PM

If we address PLFM-5966, the query alarms will no longer be triggered by that case and we should be able to resolve this issue.

John Hill
November 20, 2019, 1:43 AM

The number of files updated during the above event:

count

223731

John Hill
January 16, 2020, 11:39 PM
Edited

Prior to fixing PLFM-5954, a query against a view would first check if the view was up-to-date. If the view was found to be out-of-date, a rebuild of the view would be triggered, and the query message would be returned to the query queue. For cases where users are actively changing the files of a view, the process repeats until users stop making changes to the files. In some cases, file changes would occur for twelve hours or more. In such cases the view query message age would reflect the age of the file activity and trigger this alarm.

With PLFM-5954, we changed the behavior of view queries. We no longer check if a view is out-of-date before running the view query. Instead, the view query is allowed to run against the view in its current state. However, the view query does trigger an asynchronous process that will attempt to bring the view up-to-date while keeping the view available for query.

With these changes the age of a view query message should now reflect the time spent executing the query, regardless of any activity on the files in the view. If this alarm triggers after the changes to PLFM-5954, we will need to investigate the source of the issue.

Bruce Hoff
January 31, 2020, 1:30 AM

> If this alarm triggers after the changes to CLOSED , we will need to investigate the source of the issue.
The alarm was seen twice in stack 294 while the production stack on Jan. 28:
https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarmsV2:alarm/prod-294-QUERY-queue-oldest-message-exceed-time

I will close this issue but open a new one to investigate the alarms.

Fixed

Assignee

John Hill

Reporter

John Hill

Labels

None

Validator

Bruce Hoff

Development Area

Data Curation / Metadata

Release Version History

None

Components

Sprint

None

Fix versions

Priority

Major
Configure