issue_comments: 1008232075

This data as json

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	issue	performed_via_github_app
https://github.com/simonw/sqlite-utils/issues/369#issuecomment-1008232075	https://api.github.com/repos/simonw/sqlite-utils/issues/369	1008232075	IC_kwDOCGYnMM48GGaL	9599	2022-01-09T05:13:15Z	2022-01-09T05:13:56Z	OWNER	I think the query that will help solve this is: `explain query plan select * from ny_times_us_counties where state = 1 and county = 2` In this case, the query planner needs to decide if it should use the index for the `state` column or the index for the `county` column. That's where the statistics come into play. In particular: \| tbl \| idx \| stat \| \|----------------------\|---------------------------------\|---------------\| \| ny_times_us_counties \| idx_ny_times_us_counties_date \| 2092871 2915 \| \| ny_times_us_counties \| idx_ny_times_us_counties_fips \| 2092871 651 \| \| ny_times_us_counties \| idx_ny_times_us_counties_county \| 2092871 1085 \| \| ny_times_us_counties \| idx_ny_times_us_counties_state \| 2092871 37373 \| Those numbers are explained by this comment in the SQLite C code: https://github.com/sqlite/sqlite/blob/5622c7f97106314719740098cf0854e7eaa81802/src/analyze.c#L41-L55 ``` There is normally one row per index, with the index identified by the name in the idx column. The tbl column is the name of the table to which the index belongs. In each such row, the stat column will be a string consisting of a list of integers. The first integer in this list is the number of rows in the index. (This is the same as the number of rows in the table, except for partial indices.) The second integer is the average number of rows in the index that have the same value in the first column of the index. ``` So that table is telling us that using a value in the `county` column will filter down to an average of 1,085 rows, whereas filtering on the `state` column will filter down to an average of 37,373 - so clearly the `county` index is the better index to use here! Just one catch: against both my` covid.db` and my `covid-analyzed.db` databases the `county` index is picked for both of them - so SQLite is somehow guessing that `county` is a better index even though it doesn't have statistics for that.	{"total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0}	1097091527