html_url,issue_url,id,node_id,user,user_label,created_at,updated_at,author_association,body,reactions,issue,issue_label,performed_via_github_app https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790312268,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790312268,MDEyOklzc3VlQ29tbWVudDc5MDMxMjI2OA==,9599,simonw,2021-03-04T05:48:16Z,2021-03-04T05:48:16Z,MEMBER,"Wow, my mbox is a 10.35 GB download!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790369076,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790369076,MDEyOklzc3VlQ29tbWVudDc5MDM2OTA3Ng==,9599,simonw,2021-03-04T06:54:46Z,2021-03-04T06:54:46Z,MEMBER,"The Rich-powered progress bar is pretty: ![rich](https://user-images.githubusercontent.com/9599/109923307-71f69200-7c73-11eb-9ee2-8f0a240f3994.gif) ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790370485,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790370485,MDEyOklzc3VlQ29tbWVudDc5MDM3MDQ4NQ==,9599,simonw,2021-03-04T06:57:25Z,2021-03-04T06:57:48Z,MEMBER,"The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count: https://github.com/dogsheep/google-takeout-to-sqlite/blob/a3de045eba0fae4b309da21aa3119102b0efc576/google_takeout_to_sqlite/utils.py#L66-L67 I'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790372621,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790372621,MDEyOklzc3VlQ29tbWVudDc5MDM3MjYyMQ==,9599,simonw,2021-03-04T07:01:18Z,2021-03-04T07:01:18Z,MEMBER,"I'm not sure if it would work, but there is an alternative pattern for showing a progress bar against a really large file that I've used in `healthkit-to-sqlite` - you set the progress bar size to the size of the file in bytes, then update a counter as you read the file. https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/cli.py#L24-L57 and https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/utils.py#L4-L19 (the `progress_callback()` bit) is where that happens. It can be a bit of a convoluted pattern, and I'm not at all sure it would work for `mbox` files since it looks like that library has other reasons it needs to do a file scan rather than streaming it through one chunk of bytes at a time. So I imagine this would not work here.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790373024,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790373024,MDEyOklzc3VlQ29tbWVudDc5MDM3MzAyNA==,9599,simonw,2021-03-04T07:01:58Z,2021-03-04T07:04:06Z,MEMBER,"I got 9 warnings that look like this: ``` Errors: 1 Traceback (most recent call last): File ""/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py"", line 103, in get_mbox message[""date""] = get_message_date(email.get(""Date""), email.get_from()) File ""/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py"", line 167, in get_message_date datetime_tuple = email.utils.parsedate_tz(mail_date) File ""/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py"", line 50, in parsedate_tz res = _parsedate_tz(data) File ""/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py"", line 69, in _parsedate_tz data = data.split() AttributeError: 'Header' object has no attribute 'split' ``` It would be useful if those warnings told me the message ID (or similar) of the affected message so I could grep for it in the `mbox` and see what was going on. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790378658,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790378658,MDEyOklzc3VlQ29tbWVudDc5MDM3ODY1OA==,9599,simonw,2021-03-04T07:12:48Z,2021-03-04T07:12:48Z,MEMBER,"It looks like the `body` is being loaded into a BLOB column - so in Datasette default it looks like this: If I `datasette install datasette-render-binary` and then try again I get this: It would be great if we could store the `body` as unicode text instead. May have to do something clever to decode it based on some kind of charset header?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790379629,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790379629,MDEyOklzc3VlQ29tbWVudDc5MDM3OTYyOQ==,9599,simonw,2021-03-04T07:14:41Z,2021-03-04T07:14:41Z,MEMBER,"Confirmed: removing the `len()` call does not speed things up, so it's reading through the entire file for some other purpose too.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790380839,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790380839,MDEyOklzc3VlQ29tbWVudDc5MDM4MDgzOQ==,9599,simonw,2021-03-04T07:17:05Z,2021-03-04T07:17:05Z,MEMBER,"Looks like you're doing this: ```python elif message.get_content_type() == ""text/plain"": body = message.get_payload(decode=True) ``` So presumably that decodes to a unicode string? I imagine the reason the column is a `BLOB` for me is that `sqlite-utils` determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import, https://github.com/dogsheep/google-takeout-to-sqlite/issues/6#issuecomment-790384087,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/6,790384087,MDEyOklzc3VlQ29tbWVudDc5MDM4NDA4Nw==,9599,simonw,2021-03-04T07:22:51Z,2021-03-04T07:22:51Z,MEMBER,#3 also mentions the conflicting version with other tools.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",821841046,Upgrade to latest sqlite-utils, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790668263,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790668263,MDEyOklzc3VlQ29tbWVudDc5MDY2ODI2Mw==,9599,simonw,2021-03-04T14:43:58Z,2021-03-04T14:43:58Z,MEMBER,"I added this code to output a message ID on errors: ```diff print(""Errors: {}"".format(num_errors)) print(traceback.format_exc()) + print(""Message-Id: {}"".format(email.get(""Message-Id"", ""None""))) continue ``` Having found a message ID that had an error, I ran this command to see the context: rg --text --context 20 '44F289B0.000001.02100@SCHWARZE-DWFXMI' ~/gmail.mbox This was for the following error: ``` File ""/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py"", line 102, in get_mbox message[""date""] = get_message_date(email.get(""Date""), email.get_from()) File ""/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py"", line 178, in get_message_date datetime_tuple = email.utils.parsedate_tz(mail_date) File ""/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py"", line 50, in parsedate_tz res = _parsedate_tz(data) File ""/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py"", line 69, in _parsedate_tz data = data.split() AttributeError: 'Header' object has no attribute 'split' ``` Here's what I spotted in the `ripgrep` output: ``` 177133570:Message-Id: <44F289B0.000001.02100@SCHWARZE-DWFXMI> 177133571-Date: Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit) 177133572-X-Mailer: IncrediMail (5002253) ``` So it could it be that `_parsedate_tz` is having trouble with that `Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit)` string.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790669767,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790669767,MDEyOklzc3VlQ29tbWVudDc5MDY2OTc2Nw==,9599,simonw,2021-03-04T14:46:06Z,2021-03-04T14:46:06Z,MEMBER,"Solution could be to pre-process that string by splitting on `(` and dropping everything afterwards, assuming that the `(...)` bit isn't necessary for correctly parsing the date.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790693674,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790693674,MDEyOklzc3VlQ29tbWVudDc5MDY5MzY3NA==,9599,simonw,2021-03-04T15:18:36Z,2021-03-04T15:18:36Z,MEMBER,"I imported my 10GB mbox with 750,000 emails in it, ran this tool (with a hacked fix for the blob column problem) - and now a search that returns 92 results takes 25.37ms! This is fantastic.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790695126,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790695126,MDEyOklzc3VlQ29tbWVudDc5MDY5NTEyNg==,9599,simonw,2021-03-04T15:20:42Z,2021-03-04T15:20:42Z,MEMBER,"I'm not sure why but my most recent import, when displayed in Datasette, looks like this: Sorting by `id` in the opposite order gives me the data I would expect - so it looks like a bunch of null/blank messages are being imported at some point and showing up first due to ID ordering.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,WIP: Add Gmail takeout mbox import,