12 Mar, ’21

Hangouts JSON to Text

by Cal

I exported my Hangouts log from Google today, only to discover that it sends it to you as one single JSON file containing one single JSON object. Which, unless you never really used Hangouts, means it's going to be A Very Big File.

Obviously my first attempt to just load it into a Python object and then just pull what I needed didn't work; I ran out of memory pretty fast. So I pulled open the JSON, screamed incoherently at their data structure, and put together this.

from datetime import datetime
import ijson
from operator import itemgetter

convos = {}
people = {}
chatters = {}

print("parsing conversation and participant info")

with open("Hangouts.json",'rb') as file:
  data = ijson.kvitems(file, 'conversations.item.conversation')
  conversations = (v for k, v in data if k == 'conversation')
  for conversation in conversations:
    id = conversation['id']['id']
    if id not in convos.keys():
      convos[id] = []
    if id not in chatters.keys():
      chatters[id] = []
    for participant in conversation['participant_data']:
      p_id = participant['id']['gaia_id']
      if p_id not in people.keys():
        try:
          people[p_id] = participant['fallback_name']
        except:
          people[p_id] = 'N/A'
      if people[p_id] not in chatters[id]:
        chatters[id].append(people[p_id])

print("conversation info collected, continuing to events")

with open("Hangouts.json",'rb') as file:
  events = ijson.items(file, 'conversations.item.events.item')
  for event in events:
    log_item = {}
    log_item['sender'] = people[event['sender_id']['gaia_id']]
    log_item['timestamp'] = float(event['timestamp'])/1000000
    log_line = []
    if 'chat_message' in event.keys():
      if 'segment' in event['chat_message']['message_content'].keys():
        for span in event['chat_message']['message_content']['segment']:
          if span['type'] == 'TEXT' or span['type'] == 'LINK':
            text = span['text']
            if 'formatting' in span.keys():
              try:
                if span['formatting']['bold']:
                  text = f"*{text}*"
                if span['formatting']['italics']:
                  text = f"_{text}_"
                if span['formatting']['strikethrough']:
                  text = f"~~{text}~~"
                if span['formatting']['underline']:
                  text = f"__{text}__"
              except:
                pass
            log_line.append(text)
          elif span['type'] == 'LINE_BREAK':
            log_line.append("\n")
          else:
            break
        log_item['text'] = ' '.join(log_line)
        convos[event['conversation_id']['id']].append(log_item)

print("file parsed successfully, writing to files")

for id, convo in convos.items():
  sorted_convo = sorted(convo, key=itemgetter('timestamp'))
  fname = 'logs/' + id + '.log'
  with open(fname,'w', encoding='utf-8') as file:
    file.write(', '.join(chatters[id]))
    file.write("\n")
    for line in sorted_convo:
      time = datetime.fromtimestamp(line['timestamp'])
      timestamp = time.strftime('%Y-%m-%d %H:%M:%S')
      file.write(f"[{timestamp}] {line['sender']}: {line['text']}\n")

print("writing to files complete")

There's a lot of QOL stuff that could still be done if I wanted to make this a nice tool for The Public, like having configuration options for the timestamp format and the file name, but ehhh. I have usable logs now, I'm not motivated to keep working on it. Feel free to take it if you want.