parsing reddit api json - documentation

The documentation on the Reddit API JSON is very confusing to non-developers.

I wrote this guide to help you make sense of Reddit’s API JSON response.

We will also parse the response to show interesting data like this:

Join the Newsletter

    The guide is in Python, so if you don’t know how to use python, you can read my complete guide on Python for SEO, or just follow the steps with your favourite tool.

    View Reddit API’s JSON Response

    The Reddit API’s JSON response represents the format in which the Reddit API returns the data coming from their product.

    The simplest way to view the JSON response from the Reddit API is to open the API URL in your browser and simply view the JSON response.

    https://www.reddit.com/r/python/top.json?limit=100&t=month

    An API, or application programming interface, gives you access to their data through structured data, in this case, JSON.

    Parsing Reddit API’s JSON file

    There are 2 main ways to parse the JSON file returned from the Reddit API, using a JSON parser or using your favourite programming language.

    The simplest way to parse any JSON file is to use an online JSON parser. Here, you can copy and paste the response from the Reddit API endpoint into a free online JSON parser and parse the JSON object into its individual components.

    LAter we will learn how to parse the JSON response with Python.

    Where is the Subreddit’s Data in the JSON file?

    All the Subreddit’s data from the Reddit API JSON response is nested inside the ‘children’ object of the ‘data’.

    Now that we have a general sense of the structure of the Reddit API JSON file, will learn how to process the API using Python.

    How to use Python Requests to get the API’s JSON

    With Python, we can use the requests library to perform an HTTP request to the API endpoint and extract the JSON data from the Reddit API. Here, we will fetch the top 100 posts of the month in the r/python subreddit.

    The Reddit API limit is 100. If you want to get more rows of data, you will have to make multiple requests.

    If you don’t know how to do that, just read my post on using Reddit API without credentials.

    Fetching Reddit API with Python

    import requests
     
    subreddit = 'python'
    limit = 100
    timeframe = 'month' #hour, day, week, month, year, all
    listing = 'top' # controversial, best, hot, new, random, rising, top
     
    def get_reddit(subreddit,listing,limit,timeframe):
        try:
            base_url = f'https://www.reddit.com/r/{subreddit}/{listing}.json?limit={limit}&t={timeframe}'
            request = requests.get(base_url, headers = {'User-agent': 'yourbot'})
        except:
            print('An Error Occured')
        return request.json()
     
    r = get_reddit(subreddit,listing,limit,timeframe)
    

    Overview of the JSON

    By looking at the response r, you get a file that has that kind of structure.

    {
        "kind": "string", 
        "data": {
            "modhash": "string", 
            "dist": int, 
            "children": [{
                "kind": "string", 
                "data": {
                    "approved_at_utc":"string", 
                    "subreddit": "string", 
                    "selftext": "string" 
                    ...,
                    "is_video":"boolean"
                }],
            "after":"",
            "before:""
        }
    }
    

    Basically, as we will see details of it later, all the data is under r['data']['children'][i]['data']. With i being the number for the position of the post that you want to select (from 0 to 99 in our case).

    You can inside the dig object by object by looking at the keys of the dictionary.

    print(r.keys())
    # dict_keys(['kind', 'data'])
    

    How to Access Subreddit’s API Data in With Python

    To access the data returned from the subreddit’s API call, we simply need to select the nested ‘children’ element that is nested inside the ‘data’ object of Reddit API’s response object.

    r['data']['children']
    

    As you see we have a 100 posts in JSON object.

    len(r['data']['children'])
    # 100
    

    What can you Extract From a Post on Reddit?

    Now, we need to select the first post of the 100 posts using r['data']['children'][0].

    print(r['data']['children'][0].keys())
    # dict_keys(['kind', 'data'])
    

    List of the Reddit’s JSON Keys

    To access a JSON object within JSON array in python, you need to select the ‘data’ from ‘children’ object: r‘data’0.

    Furthermore, we will look at the keys of that dictionary to look at what we can extract.

    for k in r['data']['children'][0]['data'].keys():
        print(k)
    
    approved_at_utc
    subreddit
    selftext
    author_fullname
    saved
    mod_reason_title
    gilded
    clicked
    title
    link_flair_richtext
    subreddit_name_prefixed
    hidden
    pwls
    link_flair_css_class
    downs
    thumbnail_height
    top_awarded_type
    hide_score
    name
    quarantine
    link_flair_text_color
    upvote_ratio
    author_flair_background_color
    subreddit_type
    ups
    total_awards_received
    media_embed
    thumbnail_width
    author_flair_template_id
    is_original_content
    user_reports
    secure_media
    is_reddit_media_domain
    is_meta
    category
    secure_media_embed
    link_flair_text
    can_mod_post
    score
    approved_by
    author_premium
    thumbnail
    edited
    author_flair_css_class
    author_flair_richtext
    gildings
    content_categories
    is_self
    mod_note
    created
    link_flair_type
    wls
    removed_by_category
    banned_by
    author_flair_type
    domain
    allow_live_comments
    selftext_html
    likes
    suggested_sort
    banned_at_utc
    view_count
    archived
    no_follow
    is_crosspostable
    pinned
    over_18
    all_awardings
    awarders
    media_only
    link_flair_template_id
    can_gild
    spoiler
    locked
    author_flair_text
    treatment_tags
    visited
    removed_by
    num_reports
    distinguished
    subreddit_id
    mod_reason_by
    removal_reason
    link_flair_background_color
    id
    is_robot_indexable
    report_reasons
    author
    discussion_type
    num_comments
    send_replies
    whitelist_status
    contest_mode
    mod_reports
    author_patreon_flair
    author_flair_text_color
    permalink
    parent_whitelist_status
    stickied
    url
    subreddit_subscribers
    created_utc
    num_crossposts
    media
    is_video
    

    Extract Interesting Data from Reddit

    Last, all you have to do is select what you want from the list.

    to_extract = ['title','url','score','num_comments','view_count','ups','downs','selftext']
    
    for e in to_extract:
        print(f"{e}: {r['data']['children'][0]['data'][e]}")
    
    title: Spent 9hrs finding a bug yesterday, took 15mins to figure it out today
    url: https://www.reddit.com/r/Python/comments/koat5n/spent_9hrs_finding_a_bug_yesterday_took_15mins_to/
    score: 2204
    num_comments: 180
    view_count: None
    ups: 2204
    downs: 0
    selftext: I spent the whole day finding a bug yesterday, couldn't find it at the end of the day and got a headache due to stress. Woke up today and found the bug 15 mins after.
    
    Worrying about the delay in the project fogged my mind and I couldn't think logically, blind to different possibilities.
    
    Taking a break and having a clear mind is very important. This has happened to me a couple of times so decided to post this here today to remember not to repeat this ever lol.
    
    
    Edit: Thanks for the award kind stranger. I thought this was more of a personal problem, reading all the comments I'm happy to know I'm not alone. I feel more normal now ?.
    

    Understand the Other Objects of the Reddit’s JSON

    We now have covered the most important aspect of the Reddit JSON.

    Let’s look at the other objects.

    • Response’s Kinds
    • Other objects inside the data

    Reddit Response’s Kind

    Kind returns a string that tells the type of the object. You will not find any “data” in that key.

    print(r['kind'])
    # Listing
    

    In that case, ‘Listing’ represents a list of things. ‘Listing’ is used to paginate results when they are too long to display all at once.

    If we look at the actual post ‘kind’, you will see a different string identifier.

    r['data']['children'][0]['kind']
    # 't3'
    

    Here are the meanings of the other list identifiers.

    • t1: Comment
    • t2: Account
    • t3: Link
    • t4: Message
    • t5: Subreddit
    • t6: Award

    Other Objects Inside the Data

    Inside the data, we had 5 elements: modhash, dist, children, after, before.

    r['data'].keys()
    # dict_keys(['modhash', 'dist', 'children', 'after', 'before'])
    

    We have already covered the ‘children‘ element, let’s look at the others.

    print(f"Modhash: {r['data']['modhash']}")
    print(f"dist: {r['data']['dist']}")
    print(f"after: {r['data']['after']}")
    print(f"before: {r['data']['before']}")
    # Modhash: 
    # dist: 100
    # after: t3_kkwabd
    # before: None
    
    • Modhash: The modhash is to prevent CSRF, but since we did not log-in to make that request, we did not set modhash.
    • Dist: Is the number of items that you extracted
    • After: Name of the listing that follows after this page. None if no page after (i.e. you extracted the last result).
    • Before: Name of the listing that comes before this page

    Conclusion

    There it is. Hopefully, Reddit’s API JSON response makes a little more sense for you. If you want to dive deeper, you can always go to the dev documentation or to this outdated repository on the subject.

    4/5 - (5 votes)