My portfolio


Forum findings: How language learners discuss progress on Duolingo


For this project, I coded a 440-line program in Python analyzing how Duolingo users discuss learning a language in the app’s Discussions Forum. Working with a corpus of 319k words, my goal was to identify natural language patterns among distinct sets of users, exploring insights that could help learners stay motivated and continue their practice.

Duolingo, a language learning app, offers 81 courses to more than 300 million users.

Duolingo, a language learning app, offers 81 courses to more than 300 million users.


Duolingo is the world’s most popular way to learn languages. Offered on mobile and desktop, its mission is to make language education free and accessible to all.

The company’s 81 language courses are known for delivering a playful, gamified learning experience. One of its most prominent gaming features is the “streak,” or number of days in a row users meet their daily goal. Features like the streak have been touted for retaining and motivating users to continue learning languages everyday.

In 2017, the company reported that it is seeking to more ways to increase their daily active users (currently 25M of its 300M users) by focusing on those who have not used the app for more than 30 days. There are many ways the company could try to ‘resurrect’ inactive users. With so many possibilities, it begs the question: How can we learn from existing Duolingo users about what keeps them coming back?

Research Questions

One rich source of user discourse is the Duolingo Discussions ( By analyzing language patterns in the forum, I aim to discover:

  • What are most Duolingo users talking about in the forum?

  • How do users discuss their language practice and progress?

  • How do certain segments of users discuss their practice differently?


Programming in Python, I scraped the forum for a 62-day period (10/9-12/9/2018), extracting 318,555 words in 2911 posts across six forum channels. With this corpus, I sorted by title content, post content, streak day number, number of comments.

I also analyzed users by segment based on their streak level:

  • Level 1 = Streak 0-24 days

  • Level 2 = Streak 25-124 days

  • Level 3 = Streak 125-224 days, and

  • Level 4 = Streak 225+ days


[Activity] Certain users are less active in the forum.

  • In the past 62 days, most users have posted somewhat equally to the forum.

  • As the exception, users with streaks between 125-224 days tend to post ~3x less than other segments.

Posts by Streak Level

Percentage of 2911 forum posts in a 62-day period

[Discussion] All Duolingo users seek help; user segments discuss distinct topics.

  • All user segments seek help. Expectedly, the top words for Level 1-3 users were help-related (‘help’, ‘vs’, ‘tips’).

  • When seeking help, users commonly compare differences in words, tenses, etc.

  • Users discuss app features differently: Duolingo Stories are most discussed by Level 3 users, while Level 4 users most commonly discuss their lesson completion and streak progress (‘tree’, ‘streak’).

  • Level 3 users are the most likely to mention ‘quit’ (.07% of all words in their posts, compared to .04% average for the forum).

[Engagement] While users all tend to ask questions, posts receive more comments when related to reaching goals, achievements.

  • Most posts (58.1%) receive less than ten comments.

  • Users tend to ask the most questions in their titles (ie as topics).

  • Beyond Level 4 users, most users ask a similar percentage of questions in their posts.

  • Mid-commented posts tend to discuss practice.

  • Generally, posts with a higher percentage of questions tend to yield less comments.

  • Users comment most on posts re: goal completion: Streak, reaching final tree level (25).

Posts by Number of Comments

Percentage of 2911 forum posts in 62-day period


In my findings, I see early evidence suggesting opportunities where Duolingo could better support its growing community of learners:

  1. Since users frequently discuss comparisons (‘vs’), it could be valuable to design features teaching the differences between certain uses of vocabulary, verb tenses, etc.

  2. Users tend to engage most with achievements (streak number, tree completion). Could there be a way to gamify and thereby enhance the celebration process, allowing users to spend currency on unique gifts or ways to celebrate?

  3. Level 3 users have language patterns unique to their segment: They tend to post up to three times less than other segments, are the only segment to frequently discuss Duolingo ‘Stories’, and are more likely to discuss ‘quitting’. Why is this subset unique, and does it tell us something about their progress? I would recommend analyzing whether there are effects between posting less and using Stories more: Does the use of Stories better meet needs of users to self-learn? Or does posting less discourage how users discuss their practice? Why are they more likely to discuss quitting, and at this critical stage, how can Duolingo incentivize them to continue and reach ‘super user’ status?

  4. Level 1 users tend to ask more questions and seek help, but posts that focus on help or comparisons tend to get less comment engagement than those that discuss streaks. Is there a way to incentivize moderators or super users to engage more with beginner users?


I’ve discovered some limitations to the study and offer solutions on improving the corpus for future exploration:

  • Non-English texts: This study did not analyze non-English language in the forum.

  • Streak ‘volatility’: The streak number is recorded at the time of extraction (12/9/18), not the number at the time of the posting.

  • Duolingo moderators: The corpus did not exclude posts by company moderators.

  • Language channels: There were distinct language channels chosen from the forum, which have differences in product features (as not all courses have equal lesson features).

  • Comparison metrics: This corpus was not compared to other forum texts, from other gaming or language learning products.

  • Selection: This report assumes that users at all segments are equally capable of learning a language. It also only analyzes users that post to the forum, which presumably are a distinct subset of the larger population. To better prove causal effect of these findings, it could be valuable to utilize propensity score matching (PSM) to ensure we’re analyzing similar users to each other.


By analyzing the Discussions forum, we delve into the language patterns of users to better understand their expressed motivations, needs, and struggles when learning a language. Hopefully, through this analysis, there are lessons on how to not only support current users but to also reach inactive and new users, to improve the language learning experience for millions across the world.

To explore the full analysis, view my report (PDF).