The first step in any investigation is knowing a job ID. If you started your job with
neuro run, the job's ID was printed in the output.
However, if you can't find the initial terminal output, you can use one of these commands to find a specific job:
neuro ps prints only running jobs.
neuro ps -a prints all jobs.
neuro ps -s failed prints all jobs with the Failed status.
neuro-flow ps to get the list of all jobs.
When you run
neuro-flow build IMAGE_NAME, neuro-flow uploads the build context to the platform and creates a platform job that uses Kaniko to build a docker image and push it to the platform registry.
If building fails, you can check the job's status and logs to get more information.
To check a job's status, run:
$ neuro status <job-ID>
The Status transitions section in the output can help you learn at which step the job failed.
To check builder logs, run:
$ neuro logs <job-ID>
There are a few main reasons your job may fail. Here are some of the most common:
This can happen if you have a typo in the image name or if the specified image was not built before running a job. List of all images can be accessed by running
neuro image ls. You can also list tags for a particular image via
neuro image tags <IMAGE_URI>.
You might have an invalid volume mounted to the job. For example, you've mounted a volume to the
/my-project folder, but your code expects
/my_project. You can double-check it in the logs.
If you see a Cluster Scale Up Failed error in the status, it usually means you’ve requested resources that are not available in the cluster at the moment. For example, all GPUs are busy, so your job can’t be scheduled.
You may have an error in your python script that prevents the job from running properly.
There are a few steps to troubleshooting such issues.
The first point of interest is whether you have an open HTTP port for your job. To check this, you can:
Next step would be to make sure your web app listens on 0.0.0.0, not on 127.0.0.1 or
localhost — otherwise it won't be able to accept incoming requests from the outside of the container.
And finally, if you can access your job via browser, but
curl and similar tools don’t work, most likely you didn’t disable HTTP authentication. The Neu.ro platform puts an HTTP authentication layer in front of your app by default for security reasons.
You can disable this behaviour manually when running jobs:
http_auth: False option.
Just like with Docker, you can get a shell in a running job to check its state. To do this, run:
$ neuro exec JOB_ID /bin/sh
Note: In Docker you would typically add the
-it parameters to the command, but they’re not necessary for