{"html_url": "https://github.com/simonw/sqlite-utils/issues/207#issuecomment-743701422", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/207", "id": 743701422, "node_id": "MDEyOklzc3VlQ29tbWVudDc0MzcwMTQyMg==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-12-12T04:37:14Z", "updated_at": "2020-12-12T04:38:25Z", "author_association": "OWNER", "body": "Prototype:\r\n```python\r\nfrom collections import namedtuple\r\n\r\nColumnDetails = namedtuple(\"ColumnDetails\", (\"column\", \"num_null\", \"num_blank\", \"num_distinct\", \"most_common\", \"least_common\"))\r\n\r\ndef analyze_column(db, table, column, values=10):\r\n num_null = db.execute(\"select count(*) from [{}] where [{}] is null\".format(table, column)).fetchone()[0]\r\n num_blank = db.execute(\"select count(*) from [{}] where [{}] = ''\".format(table, column)).fetchone()[0]\r\n num_distinct = db.execute(\"select count(distinct [{}]) from [{}]\".format(column, table)).fetchone()[0]\r\n most_common = None\r\n least_common = None\r\n if num_distinct != 1:\r\n most_common = [(r[0], r[1]) for r in db.execute(\r\n \"select [{}], count(*) from [{}] group by [{}] order by count(*) desc limit \".format(column, table, column, values)\r\n ).fetchall()]\r\n if num_distinct <= values:\r\n # No need to run the query if it will just return the results in revers order\r\n least_common = most_common[::-1]\r\n else:\r\n least_common = [(r[0], r[1]) for r in db.execute(\r\n \"select [{}], count(*) from [{}] group by [{}] order by count(*) limit {}\".format(column, table, column, values)\r\n ).fetchall()]\r\n return ColumnDetails(column, num_null, num_blank, num_distinct, most_common, least_common)\r\n\r\n\r\ndef analyze_table(db, table):\r\n for column in db[table].columns:\r\n details = analyze_column(db, table, column.name)\r\n print(details)\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 763283616, "label": "sqlite-utils analyze-tables command"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/207#issuecomment-743701599", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/207", "id": 743701599, "node_id": "MDEyOklzc3VlQ29tbWVudDc0MzcwMTU5OQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-12-12T04:38:52Z", "updated_at": "2020-12-12T04:39:07Z", "author_association": "OWNER", "body": "I'll add a `table.analyze_column(column)` method which is used by the CLI tool - with a note that this is an unstable interface which may change in the future.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 763283616, "label": "sqlite-utils analyze-tables command"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/207#issuecomment-743701697", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/207", "id": 743701697, "node_id": "MDEyOklzc3VlQ29tbWVudDc0MzcwMTY5Nw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-12-12T04:39:51Z", "updated_at": "2020-12-12T04:39:51Z", "author_association": "OWNER", "body": "CLI could be:\r\n\r\n sqlite-utils analyze-tables\r\n\r\nTo analyze all tables or:\r\n\r\n sqlite-utils analyze-tables table1 table2\r\n\r\nTo analyze specific tables.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 763283616, "label": "sqlite-utils analyze-tables command"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/pull/208#issuecomment-743707969", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/208", "id": 743707969, "node_id": "MDEyOklzc3VlQ29tbWVudDc0MzcwNzk2OQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-12-12T05:42:26Z", "updated_at": "2020-12-12T05:43:06Z", "author_association": "OWNER", "body": "Should truncate values in the least/most common JSON array to a sensible length, otherwise you end up with stuff like this:\r\n```json\r\n[\r\n [\r\n \"b'\\\\x00\\\\x05barry\\\\x03\\\\x01\\\\x02\\\\x00\\\\x00\\\\x03cat\\\\x03\\\\x01\\\\x03\\\\x00\\\\x00\\\\x03dog\\\\x08\\\\x01\\\\x01\\\\x01\\\\x03\\\\x00\\\\x01\\\\x03\\\\x00\\\\x00\\\\x07panther\\\\x05\\\\x01\\\\x01\\\\x02\\\\x02\\\\x00\\\\x01\\\\x03uma\\\\x05\\\\x02\\\\x01\\\\x02\\\\x02\\\\x00\\\\x00\\\\x04sara\\\\x05\\\\x02\\\\x01\\\\x01\\\\x02\\\\x00\\\\x00\\\\x05terry\\\\x08\\\\x01\\\\x01\\\\x01\\\\x02\\\\x00\\\\x01\\\\x02\\\\x00\\\\x00\\\\x06weasel\\\\x05\\\\x02\\\\x01\\\\x01\\\\x03\\\\x00'\",\r\n 1\r\n ]\r\n]\r\n```\r\nThis example also shows that binary values (like those in `_fts` tables) look a bit weird, but I think I'm OK with that since binary data can't be represented neatly in JSON anyway.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 763320133, "label": "sqlite-utils analyze-tables command and table.analyze_column() method"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/pull/208#issuecomment-743708080", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/208", "id": 743708080, "node_id": "MDEyOklzc3VlQ29tbWVudDc0MzcwODA4MA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-12-12T05:43:45Z", "updated_at": "2020-12-12T05:43:45Z", "author_association": "OWNER", "body": "CLI output looks like this at the moment, which is bad:\r\n```\r\n % sqlite-utils analyze-tables ../datasette/fixtures.db facetable\r\n1/10: ColumnDetails(table='facetable', column='pk', total_rows=15, num_null=0, num_blank=0, num_distinct=15, most_common=None, least_common=None)\r\n2/10: ColumnDetails(table='facetable', column='created', total_rows=15, num_null=0, num_blank=0, num_distinct=4, most_common=[('2019-01-17 08:00:00', 4), ('2019-01-15 08:00:00', 4), ('2019-01-14 08:00:00', 4), ('2019-01-16 08:00:00', 3)], least_common=[('2019-01-16 08:00:00', 3), ('2019-01-14 08:00:00', 4), ('2019-01-15 08:00:00', 4), ('2019-01-17 08:00:00', 4)])\r\n3/10: ColumnDetails(table='facetable', column='planet_int', total_rows=15, num_null=0, num_blank=0, num_distinct=2, most_common=[(1, 14), (2, 1)], least_common=[(2, 1), (1, 14)])\r\n4/10: ColumnDetails(table='facetable', column='on_earth', total_rows=15, num_null=0, num_blank=0, num_distinct=2, most_common=[(1, 14), (0, 1)], least_common=[(0, 1), (1, 14)])\r\n5/10: ColumnDetails(table='facetable', column='state', total_rows=15, num_null=0, num_blank=0, num_distinct=3, most_common=[('CA', 10), ('MI', 4), ('MC', 1)], least_common=[('MC', 1), ('MI', 4), ('CA', 10)])\r\n6/10: ColumnDetails(table='facetable', column='city_id', total_rows=15, num_null=0, num_blank=0, num_distinct=4, most_common=[(1, 6), (3, 4), (2, 4), (4, 1)], least_common=[(4, 1), (2, 4), (3, 4), (1, 6)])\r\n7/10: ColumnDetails(table='facetable', column='neighborhood', total_rows=15, num_null=0, num_blank=0, num_distinct=14, most_common=[('Downtown', 2), ('Tenderloin', 1), ('SOMA', 1), ('Mission', 1), ('Mexicantown', 1), ('Los Feliz', 1), ('Koreatown', 1), ('Hollywood', 1), ('Hayes Valley', 1), ('Greektown', 1)], least_common=[('Arcadia Planitia', 1), ('Bernal Heights', 1), ('Corktown', 1), ('Dogpatch', 1), ('Greektown', 1), ('Hayes Valley', 1), ('Hollywood', 1), ('Koreatown', 1), ('Los Feliz', 1), ('Mexicantown', 1)])\r\n8/10: ColumnDetails(table='facetable', column='tags', total_rows=15, num_null=0, num_blank=0, num_distinct=3, most_common=[('[]', 13), ('[\"tag1\", \"tag3\"]', 1), ('[\"tag1\", \"tag2\"]', 1)], least_common=[('[\"tag1\", \"tag2\"]', 1), ('[\"tag1\", \"tag3\"]', 1), ('[]', 13)])\r\n9/10: ColumnDetails(table='facetable', column='complex_array', total_rows=15, num_null=0, num_blank=0, num_distinct=2, most_common=[('[]', 14), ('[{\"foo\": \"bar\"}]', 1)], least_common=[('[{\"foo\": \"bar\"}]', 1), ('[]', 14)])\r\n10/10: ColumnDetails(table='facetable', column='distinct_some_null', total_rows=15, num_null=13, num_blank=0, num_distinct=2, most_common=[(None, 13), ('two', 1), ('one', 1)], least_common=[('one', 1), ('two', 1), (None, 13)])\r\n(sqlite-utils) sqlite-utils % \r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 763320133, "label": "sqlite-utils analyze-tables command and table.analyze_column() method"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/pull/208#issuecomment-743708169", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/208", "id": 743708169, "node_id": "MDEyOklzc3VlQ29tbWVudDc0MzcwODE2OQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-12-12T05:44:46Z", "updated_at": "2020-12-12T05:44:46Z", "author_association": "OWNER", "body": "If there are less than ten values is it worth outputting them twice, once in `most_common` and then in reverse in `least_common`? Feels redundant - I think I should leave `least_common` empty in that case.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 763320133, "label": "sqlite-utils analyze-tables command and table.analyze_column() method"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/pull/208#issuecomment-743708325", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/208", "id": 743708325, "node_id": "MDEyOklzc3VlQ29tbWVudDc0MzcwODMyNQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-12-12T05:46:27Z", "updated_at": "2020-12-12T05:46:27Z", "author_association": "OWNER", "body": "It would be neat if you could optionally specify a subset of columns to analyze, using `-c` or `--column`.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 763320133, "label": "sqlite-utils analyze-tables command and table.analyze_column() method"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/pull/208#issuecomment-743708524", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/208", "id": 743708524, "node_id": "MDEyOklzc3VlQ29tbWVudDc0MzcwODUyNA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-12-12T05:48:20Z", "updated_at": "2020-12-12T05:48:32Z", "author_association": "OWNER", "body": "```\r\n% sqlite-utils analyze-tables ../datasette/fixtures.db facetable --column pk\r\n1/1: ColumnDetails(table='facetable', column='pk', total_rows=15, num_null=0, num_blank=0, num_distinct=15, most_common=None, least_common=None)\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 763320133, "label": "sqlite-utils analyze-tables command and table.analyze_column() method"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1142#issuecomment-743732440", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1142", "id": 743732440, "node_id": "MDEyOklzc3VlQ29tbWVudDc0MzczMjQ0MA==", "user": {"value": 6622733, "label": "nitinpaultifr"}, "created_at": "2020-12-12T09:56:40Z", "updated_at": "2020-12-12T09:56:40Z", "author_association": "NONE", "body": "'Include all rows' seem like a fairly obvious alternative", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 763361458, "label": "\"Stream all rows\" is not at all obvious"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1142#issuecomment-743912875", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1142", "id": 743912875, "node_id": "MDEyOklzc3VlQ29tbWVudDc0MzkxMjg3NQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-12-12T22:16:38Z", "updated_at": "2020-12-12T22:16:38Z", "author_association": "OWNER", "body": "Yeah, maybe with the number of rows to make it completely clear. `Include all 2,455 rows` perhaps.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 763361458, "label": "\"Stream all rows\" is not at all obvious"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1142#issuecomment-743913004", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1142", "id": 743913004, "node_id": "MDEyOklzc3VlQ29tbWVudDc0MzkxMzAwNA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-12-12T22:17:46Z", "updated_at": "2020-12-12T22:17:46Z", "author_association": "OWNER", "body": "You're actually choosing between two options here: the 100 rows you can see on the screen, or the x,000 rows that match the current query.\r\n\r\nMaybe a radio box would be more obvious?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 763361458, "label": "\"Stream all rows\" is not at all obvious"}, "performed_via_github_app": null}