Parse Reddit Corpus in zip format

There are a few folks seeding the entirety of Reddit, but the [Reddit Corpus project](https://convokit.cornell.edu/documentation/subreddit.html) provides archives of individual subreddits. This gives you the very useful ability to train in a particular domain. Here is a small example: [dadjokes2.corpus.zip](https://zissou.infosci.cornell.edu/convokit/datasets/subreddit-corpus/corpus-zipped/daddly~-~dankmemes/dadjokes2.corpus.zip)

The only problem is that they are not in the same format as your reddit_parse.py expects. They are zipped (.zip) in a bundle of five JSON files consisting of:

- users.json              
- conversations.json      
- corpus.json             
- index.json              
- utterances.jsonl 

 What is the shortest path for converting this to useable training data?

Wes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Reddit Corpus in zip format #68

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Parse Reddit Corpus in zip format #68

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions