{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1112878955", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1112878955, "node_id": "IC_kwDOBm6k_c5CVS9r", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-29T05:02:40Z", "updated_at": "2022-04-29T05:02:40Z", "author_association": "OWNER", "body": "Here's a very useful (recent) article about how the GIL works and how to think about it: https://pythonspeed.com/articles/python-gil/ - via https://lobste.rs/s/9hj80j/when_python_can_t_thread_deep_dive_into_gil\r\n\r\nFrom that article:\r\n\r\n> For example, let's consider an extension module written in C or Rust that lets you talk to a PostgreSQL database server.\r\n> \r\n> Conceptually, handling a SQL query with this library will go through three steps:\r\n> \r\n> 1.  Deserialize from Python to the internal library representation. Since this will be reading Python objects, it needs to hold the GIL.\r\n> 2.  Send the query to the database server, and wait for a response. This doesn't need the GIL.\r\n> 3.  Convert the response into Python objects. This needs the GIL again.\r\n> \r\n> As you can see, how much parallelism you can get depends on how much time is spent in each step. If the bulk of time is spent in step 2, you'll get parallelism there. But if, for example, you run a `SELECT` and get a large number of rows back, the library will need to create many Python objects, and step 3 will have to hold GIL for a while.\r\n\r\nThat explains what I'm seeing here. I'm pretty convinced now that the reason I'm not getting a performance boost from parallel queries is that there's more time spent in Python code assembling the results than in SQLite C code executing the query.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1112879463", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1112879463, "node_id": "IC_kwDOBm6k_c5CVTFn", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-29T05:03:58Z", "updated_at": "2022-04-29T05:03:58Z", "author_association": "OWNER", "body": "It would be _really_ fun to try running this with the in-development `nogil` Python from https://github.com/colesbury/nogil\r\n\r\nThere's a Docker container for it: https://hub.docker.com/r/nogil/python\r\n\r\nIt suggests you can run something like this:\r\n\r\n    docker run -it --rm --name my-running-script -v \"$PWD\":/usr/src/myapp \\\r\n      -w /usr/src/myapp nogil/python python your-daemon-or-script.py", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1112889800", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1112889800, "node_id": "IC_kwDOBm6k_c5CVVnI", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-29T05:29:38Z", "updated_at": "2022-04-29T05:29:38Z", "author_association": "OWNER", "body": "OK, I just got the most incredible result with that!\r\n\r\nI started up a container running `bash` like this, from my `datasette` checkout. I'm mapping port 8005 on my laptop to port 8001 inside the container because laptop port 8001 was already doing something else:\r\n```\r\ndocker run -it --rm --name my-running-script -p 8005:8001 -v \"$PWD\":/usr/src/myapp \\\r\n  -w /usr/src/myapp nogil/python bash\r\n```\r\nThen in `bash` I ran the following commands to install Datasette and its dependencies:\r\n```\r\npip install -e '.[test]'\r\npip install datasette-pretty-traces # For debug tracing\r\n```\r\nThen I started Datasette against my `github.db` database (from github-to-sqlite.dogsheep.net/github.db) like this:\r\n\r\n```\r\ndatasette github.db -h 0.0.0.0 --setting trace_debug 1\r\n```\r\nI hit the following two URLs to compare the parallel v.s. not parallel implementations:\r\n\r\n- `http://127.0.0.1:8005/github/issues?_facet=milestone&_facet=repo&_trace=1&_size=10`\r\n- `http://127.0.0.1:8005/github/issues?_facet=milestone&_facet=repo&_trace=1&_size=10&_noparallel=1`\r\n\r\nAnd... the parallel one beat the non-parallel one decisively, on multiple page refreshes!\r\n\r\nNot parallel: 77ms\r\n\r\nParallel: 47ms\r\n\r\n<img width=\"1213\" alt=\"CleanShot 2022-04-28 at 22 10 54@2x\" src=\"https://user-images.githubusercontent.com/9599/165889437-60d4200d-698a-4175-af23-7c03bb456e66.png\">\r\n\r\n<img width=\"1213\" alt=\"CleanShot 2022-04-28 at 22 10 21@2x\" src=\"https://user-images.githubusercontent.com/9599/165889445-2dfb8676-d823-405e-aecb-ad28ec3043da.png\">\r\n\r\nSo yeah, I'm very confident this is a problem with the GIL. And I am absolutely **stunned** that @colesbury's fork ran Datasette (which has some reasonably tricky threading and async stuff going on) out of the box!", "reactions": "{\"total_count\": 2, \"+1\": 2, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}