{"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790312268", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790312268, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDMxMjI2OA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T05:48:16Z", "updated_at": "2021-03-04T05:48:16Z", "author_association": "MEMBER", "body": "Wow, my mbox is a 10.35 GB download!", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790369076", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790369076, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDM2OTA3Ng==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T06:54:46Z", "updated_at": "2021-03-04T06:54:46Z", "author_association": "MEMBER", "body": "The Rich-powered progress bar is pretty:\r\n\r\n![rich](https://user-images.githubusercontent.com/9599/109923307-71f69200-7c73-11eb-9ee2-8f0a240f3994.gif)\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790370485", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790370485, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDM3MDQ4NQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T06:57:25Z", "updated_at": "2021-03-04T06:57:48Z", "author_association": "MEMBER", "body": "The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count:\r\n\r\nhttps://github.com/dogsheep/google-takeout-to-sqlite/blob/a3de045eba0fae4b309da21aa3119102b0efc576/google_takeout_to_sqlite/utils.py#L66-L67\r\n\r\nI'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790372621", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790372621, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDM3MjYyMQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T07:01:18Z", "updated_at": "2021-03-04T07:01:18Z", "author_association": "MEMBER", "body": "I'm not sure if it would work, but there is an alternative pattern for showing a progress bar against a really large file that I've used in `healthkit-to-sqlite` - you set the progress bar size to the size of the file in bytes, then update a counter as you read the file.\r\n\r\nhttps://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/cli.py#L24-L57 and https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/utils.py#L4-L19 (the `progress_callback()` bit) is where that happens.\r\n\r\nIt can be a bit of a convoluted pattern, and I'm not at all sure it would work for `mbox` files since it looks like that library has other reasons it needs to do a file scan rather than streaming it through one chunk of bytes at a time. So I imagine this would not work here.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790373024", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790373024, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDM3MzAyNA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T07:01:58Z", "updated_at": "2021-03-04T07:04:06Z", "author_association": "MEMBER", "body": "I got 9 warnings that look like this:\r\n```\r\nErrors: 1\r\nTraceback (most recent call last):\r\n File \"/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py\", line 103, in get_mbox\r\n message[\"date\"] = get_message_date(email.get(\"Date\"), email.get_from())\r\n File \"/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py\", line 167, in get_message_date\r\n datetime_tuple = email.utils.parsedate_tz(mail_date)\r\n File \"/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py\", line 50, in parsedate_tz\r\n res = _parsedate_tz(data)\r\n File \"/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py\", line 69, in _parsedate_tz\r\n data = data.split()\r\nAttributeError: 'Header' object has no attribute 'split'\r\n```\r\nIt would be useful if those warnings told me the message ID (or similar) of the affected message so I could grep for it in the `mbox` and see what was going on.\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790378658", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790378658, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDM3ODY1OA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T07:12:48Z", "updated_at": "2021-03-04T07:12:48Z", "author_association": "MEMBER", "body": "It looks like the `body` is being loaded into a BLOB column - so in Datasette default it looks like this:\r\n\r\n\"mbox__mbox_emails__753_446_rows\"\r\n\r\nIf I `datasette install datasette-render-binary` and then try again I get this:\r\n\r\n\"mbox__mbox_emails__753_446_rows\"\r\n\r\nIt would be great if we could store the `body` as unicode text instead. May have to do something clever to decode it based on some kind of charset header?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790379629", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790379629, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDM3OTYyOQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T07:14:41Z", "updated_at": "2021-03-04T07:14:41Z", "author_association": "MEMBER", "body": "Confirmed: removing the `len()` call does not speed things up, so it's reading through the entire file for some other purpose too.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790380839", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790380839, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDM4MDgzOQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T07:17:05Z", "updated_at": "2021-03-04T07:17:05Z", "author_association": "MEMBER", "body": "Looks like you're doing this:\r\n```python\r\n elif message.get_content_type() == \"text/plain\":\r\n body = message.get_payload(decode=True)\r\n```\r\nSo presumably that decodes to a unicode string?\r\n\r\nI imagine the reason the column is a `BLOB` for me is that `sqlite-utils` determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string.\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/issues/6#issuecomment-790384087", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/6", "id": 790384087, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDM4NDA4Nw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T07:22:51Z", "updated_at": "2021-03-04T07:22:51Z", "author_association": "MEMBER", "body": "#3 also mentions the conflicting version with other tools.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 821841046, "label": "Upgrade to latest sqlite-utils"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790668263", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790668263, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDY2ODI2Mw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T14:43:58Z", "updated_at": "2021-03-04T14:43:58Z", "author_association": "MEMBER", "body": "I added this code to output a message ID on errors:\r\n```diff\r\n print(\"Errors: {}\".format(num_errors))\r\n print(traceback.format_exc())\r\n+ print(\"Message-Id: {}\".format(email.get(\"Message-Id\", \"None\")))\r\n continue\r\n```\r\nHaving found a message ID that had an error, I ran this command to see the context:\r\n\r\n rg --text --context 20 '44F289B0.000001.02100@SCHWARZE-DWFXMI' ~/gmail.mbox\r\n\r\nThis was for the following error:\r\n```\r\n File \"/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py\", line 102, in get_mbox\r\n message[\"date\"] = get_message_date(email.get(\"Date\"), email.get_from())\r\n File \"/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py\", line 178, in get_message_date\r\n datetime_tuple = email.utils.parsedate_tz(mail_date)\r\n File \"/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py\", line 50, in parsedate_tz\r\n res = _parsedate_tz(data)\r\n File \"/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py\", line 69, in _parsedate_tz\r\n data = data.split()\r\nAttributeError: 'Header' object has no attribute 'split'\r\n```\r\nHere's what I spotted in the `ripgrep` output:\r\n```\r\n177133570:Message-Id: <44F289B0.000001.02100@SCHWARZE-DWFXMI>\r\n177133571-Date: Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop\ufffdische Sommerzeit)\r\n177133572-X-Mailer: IncrediMail (5002253)\r\n```\r\nSo it could it be that `_parsedate_tz` is having trouble with that `Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop\ufffdische Sommerzeit)` string.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790669767", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790669767, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDY2OTc2Nw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T14:46:06Z", "updated_at": "2021-03-04T14:46:06Z", "author_association": "MEMBER", "body": "Solution could be to pre-process that string by splitting on `(` and dropping everything afterwards, assuming that the `(...)` bit isn't necessary for correctly parsing the date.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790693674", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790693674, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDY5MzY3NA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T15:18:36Z", "updated_at": "2021-03-04T15:18:36Z", "author_association": "MEMBER", "body": "I imported my 10GB mbox with 750,000 emails in it, ran this tool (with a hacked fix for the blob column problem) - and now a search that returns 92 results takes 25.37ms! This is fantastic.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790695126", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 790695126, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MDY5NTEyNg==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-03-04T15:20:42Z", "updated_at": "2021-03-04T15:20:42Z", "author_association": "MEMBER", "body": "I'm not sure why but my most recent import, when displayed in Datasette, looks like this:\r\n\r\n\"mbox__mbox_emails__753_446_rows\"\r\n\r\nSorting by `id` in the opposite order gives me the data I would expect - so it looks like a bunch of null/blank messages are being imported at some point and showing up first due to ID ordering.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null}