Inspired by my SO, I decided to write this, which hopes to tackle the notorious memory-related problem with Apache-Spark, when handling big data. The error’s most important messages are:
16/09/01 18:07:54 WARN TaskSetManager: Lost task 113.0 in stage 0.0 (TID 461, gsta32512.foo.com): ExecutorLostFailure (executor 28 exited caused by one of the running tasks) Reason: Container marked as failed: container_e05_1472185459203_255575_01_000183 on host: gsta32512.foo.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Killed by external signal 16/09/01 18:11:39 WARN TaskSetManager: Lost task 503.0 in stage 0.0 (TID 739, gsta31371.foo.com): java.lang.OutOfMemoryError: GC overhead limit exceeded at ... 16/09/01 18:11:39 WARN TaskSetManager: Lost task 512.0 in stage 0.0 (TID 748, gsta31371.foo.com): java.lang.OutOfMemoryError: Java heap space at ... 16/09/01 18:11:39 WARN TaskSetManager: Lost task 241.0 in stage 0.0 (TID 573, gsta31371.foo.com): java.io.IOException: Filesystem closed at ... 16/09/01 18:11:41 ERROR YarnScheduler: Lost executor 1 on gsta31371.foo.com: Container marked as failed: container_e05_1472185459203_255575_01_000004 on host: gsta31371.foo.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Killed by external signal
In order to tackle memory issues with Spark, you first have to understand what happens under the hood. I won’t expand as in memoryOverhead issue in Spark, but I would like one to have this in mind: Cores, Memory and MemoryOverhead are three things that one can tune to hope for making his job succeed.
memoryOverhead is simply to let your container (the driver or executor(s)) to run until its memory footprint reaches the memoryOverhead limit. Once it exceeds it, it is doomed to be assassinated by YARN. Here two relevant flags:
spark.yarn.executor.memoryOverhead 4096 spark.yarn.driver.memoryOverhead 8192
Memory is important too. You see, the cluster I am using currently has machines that can use up to 8 cores and 12G memory (that is heap memory). I am running Python with Spark (PySPark), so all the code of mine runs off the heap. For that reason, I have to allocate “not much” memory (since this will cut the memory I am allowed to use from the total memory; i.e. that if the total memory I am allowed to use is 20G and I am requesting 12G, then 8G will be left for my Python application to use. While if i request 4G, then 16G will be left for my Python application, since the memory you are requesting is heap memory). Set it with these flags:
spark.executor.memory 4G spark.driver.memory 4G
The number of Cores is also very important; The number of cores you configure (4 vs 8) affects the number of concurrent tasks you can run. With 4 cores you can run 4 tasks in parallel, this affects the amount of execution memory being used. The spark executor memory is shared between these tasks. So with 12G heap memory running 8 tasks, each gets about 1.5GB with 12GB heap running 4 tasks each gets 3GB memory. This is obviously just a rough approximation. Set it with these flags:
spark.executor.cores 4 spark.driver.cores 4
This page was written while me being on that case:
About the ‘java.lang.OutOfMemoryError: GC overhead limit exceeded’, I would suggest this tutorial: A Beginner’s Guide on Troubleshooting Spark Applications, which points to Tuning Java Garbage Collection for Apache Spark Applications. If the tasks are GC’ing still you should reconfigure the memory. For instance leave the number of cores the same say 4, but increase memory `spark.executor.memory=8G`.
Have questions? Comments? Did you find a bug? Let me know!😀
Page created by G. (George) Samaras (DIT)