Bug: server gets into weird state when doing massive parallel version loads
I've set up an instance of Terrareg with ~15 different modules backed by Git repositories. When attempting to load versions, one at a time per-module but in parallel across all modules, the server eventually gets into a state where it cannot do any version loads, throwing the following error for each attempt.
Using a MySQL DB backend, and EFS for the module data directory.
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2548, in __call__
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2528, in wsgi_app
response = self.handle_exception(e)
File "/usr/local/lib/python3.10/site-packages/flask_restful/__init__.py", line 271, in error_router
return original_handler(e)
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2525, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1822, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.10/site-packages/flask_restful/__init__.py", line 271, in error_router
return original_handler(e)
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1820, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1796, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/usr/local/lib/python3.10/site-packages/flask_restful/__init__.py", line 467, in wrapper
resp = resource(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/flask/views.py", line 107, in view
return current_app.ensure_sync(self.dispatch_request)(**kwargs)
File "/usr/local/lib/python3.10/site-packages/flask_restful/__init__.py", line 582, in dispatch_request
resp = meth(*args, **kwargs)
File "/app/terrareg/auth_wrapper.py", line 32, in wrapper
return func(*args, **kwargs)
File "/app/terrareg/server/error_catching_resource.py", line 41, in post
return self._post(*args, **kwargs)
File "/app/terrareg/server/api/module_version_create.py", line 27, in _post
previous_version_published = module_version.prepare_module()
File "/app/terrareg/models.py", line 2990, in prepare_module
self.create_data_directory()
File "/app/terrareg/models.py", line 2874, in create_data_directory
if not os.path.isdir(self._module_provider.base_directory):
File "/app/terrareg/models.py", line 1582, in base_directory
return safe_join_paths(self._module.base_directory, self._name)
File "/app/terrareg/models.py", line 891, in base_directory
return safe_join_paths(self._namespace.base_directory, self._name)
File "/app/terrareg/models.py", line 756, in base_directory
return safe_join_paths(terrareg.config.Config().DATA_DIRECTORY, 'modules', self._name)
File "/app/terrareg/config.py", line 82, in DATA_DIRECTORY
return os.path.join(os.environ.get('DATA_DIRECTORY', os.getcwd()), 'data')
FileNotFoundError: [Errno 2] No such file or directory
Looking at an strace, it seems like os.getcwd
is getting called on a directory which no longer exists. The issue seems to go away with a server re-deploy. I'm still digging into the issue, but wanted to file this.
The loads to work for some time, until this error happens, then they all fail.
Downstream report: https://github.com/MatthewJohn/terrareg/issues/9