Pergunta de entrevista da empresa Amazon

This system is failing intermittently; troubleshoot why/how.

Resposta da entrevista

Sigiloso

18 de jan. de 2018

First I would get clarity, what ‘this system’ is. In any system the major thing can be losing access to lets say a machine / VM. This is most likely a network issue. On the other hand the system might go unresponsive, while network is ok. Lets say that ‘this system’ is a Linux machine, which is going unresponsive intermittently. This means a couple of thing, there is a huge resource utilization which might be causing the issue. So I will start by checking the process utilization, may be some rogue app is hogging the cpu. Use ‘top’ command and see the cpu / memory utilization. Look for any app which might be using outrageously large memory or cpu. You might find some app there, and if so, you can kill the app using the kill command. If not, you can check the disk space which might be full. Use ‘df’ command to check the status and if there is such volume which is full, you’ll need to remove large files from it. Use ‘find’ command to find large files. Just make sure you’re not deleting some essential file. If you can’t find that, then now it’s a good idea to check application logs in ‘/var/log/messages’. All applications dump their logs here and you can see for some messages which indicate some issues with kernel or any application.