I have the following docker containers that I have set up to test my web application:
- Jenkins
- Apache 1 (serving a laravel app)
- Apache 2 (serving a legacy codeigniter app)
- MySQL (accessed by both Apache 1 and Apache 2)
- Selenium HUB
- Selenium Node — ChromeDriver
The jenkins job runs a behat
command on Apache 1
which in turn connects to Selenium Hub, which has a ChromeDriver node to actually hit the two apps: Apache 1
and Apache 2
.
The whole system is running on an EC2 m2.small instance (1 core, 2GB RAM) with AWS linux.
The problem
The issue I am having is that if I run the pipeline multiple times, the first few times it runs just fine (the behat stage takes about 20s), but on the third and consecutive runs, the behat stage starts slowing down (taking 1m30s) and then failing after 3m or 10m or whenever I lose patience.
If I restart the docker containers, it works again, but only for another 2-4 runs.
Clues
Monitoring docker stats
each time I run the jenkins pipeline, I noticed that the Block I/O, and specifically the 'I' was growing exponentially after the first few runs.
For example, after run 1
After run 2
After run 3
After run 4
The Block I/O for the chromedriver container is 21GB and the driver hangs. While I might expect the Block I/O to grow, I wouldn't expect it to grow exponentially as it seems to be doing. It's like something is... exploding.
The same docker configuration (using
docker-compose
) runs flawlessly every time on my personal MacBook Pro. Block I/O does not 'explode'. I constrain Docker to only use 1 core and 2GB of RAM.
What I've tried
This situation has sent me down the path of learning a lot more about docker, filesystems and memory management, but I'm still not resolving the issue. Some of the things I have tried:
Memory
I set mem_limit
options on all containers and tuned them so that during any given run, the memory would not reach 100%. Memory usage now seems fairly stable, and never 'blows up'.
Storage Driver
The default for AWS Linux Docker is devicemapper in loop-lvm
mode. After reading this doc
I switched to the suggested direct-lvm
mode.
docker-compose restart
This does indeed 'reset' the issue, allowing me to get a few more runs in, but it doesn't last. After 2-4 runs, things seize up and the tests start failing.
My Question...
- What is happening that causes the block i/o to grow exponentially? I'm not clear if it's docker, jenkins, selenium or chromedriver that are causing the problem. My first guess is chromedriver.
- What is a good approach to tuning a system like this with multiple moving parts?
Additonal Info
My chromedriver container has the following environment set in docker-compose:
- SE_OPTS=-maxSession 6 -browser browserName=chrome,maxInstances=3
docker info
:
$ docker info
Containers: 6
Running: 6
Paused: 0
Stopped: 0
Images: 5
Server Version: 1.12.6
Storage Driver: devicemapper
Pool Name: docker-thinpool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 4.862 GB
Data Space Total: 20.4 GB
Data Space Available: 15.53 GB
Metadata Space Used: 2.54 MB
Metadata Space Total: 213.9 MB
Metadata Space Available: 211.4 MB
Thin Pool Minimum Free Space: 2.039 GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: overlay null host bridge
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.51-40.60.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.03
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.956 GiB
ID: 7XEV:LEON:H5R6:NDTT:C6PQ:I6UV:4ETW:WB2V:CBDD:DBRL:ONZU:SKTG
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
via meriial