I just finished the Google Summer of Code Program, wherein I worked on the Python machine learning package scikit-learn. Since I began working with the project in November 2015, I've occasionally received emails asking how one should get started contributing. In this blog post, I'll describe my journey in open source and give some tips for getting started.

In the beginning

I've always been interested in machine learning and the prospect of drawing reliable conclusions from incomplete data. Broadly, machine learning is the study of creating programs and applications that learn from experience and data.

With this interest in mind, I wanted to learn as much as I could. Early in my learning, I stumbled upon an elementary tutorial on Kaggle using scikit-learn to predict the survival outcome of Titanic passengers.

Despite having no prior knowledge of Python, I was surprised by and fell in love with how easy the package made the proecss. All I had to do was create the estimator, call fit() on my data, and call predict() on new samples! I dove into the API and documentation, and gradually began to learn about the complex mathematics and statistics underlying the deceptively simple API for the various machine learning algorithms implemented. At the same time, I learned more about Python through looking through the scikit-learn examples and writing machine learning applications on my own (though I did eventually comb through a book for a more formal approach). It would not be an understatement to say that I learned Python through scikit-learn (interestingly, I met several others at SciPy 2016 that also had this experience).

Finding slight errors

A few years later, while running one of the documented scikit-learn examples for visualizing the stock market structure (in release 0.16), I noticed that the code failed to run on my computer. Digging deeper, I saw that the problem was due to a deprecated function in matplotlib that was not in my recent version. I patched up the function myself, and saw that it ran satisfactorily on my machine.

The First Pull Request

At this point in time, I realized that I could contribute my correction back to the library. I had never been involved in an open source community before, so I was naturally quite apprehensive. I painstakingly followed scikit-learn's contributing guide to set up a proper development environment, and made my first pull request to the project. I was quite unfamiliar with git at the time, and messed this up somehow; I promptly opened another one with the correct contents.

Despite all my pre-reading and preparation, Gael Varoquaux and I worked together to make several changes to my simple pull request before it was ready for merging.

Through just my first pull request, I learned an immense amount; my code was initially rejected because it broke compatibility with past versions of matplotlib, which is something I had never even considered! Additionally, I had no experience with continuous integration systems and limited git knowledge. After taking a while to figure out how to "squash my commits" properly, the pull request was merged.

In hindsight, I'm extremely grateful to Gael and Tom for their support and patience; their kindness was a huge factor in making me feel welcomed in the project. I was happy that my work had been accepted by the project reviewers, and that I had the ability to contribute back to a project that I used so much. I wanted to get more involved and was curious of what else I could help out with. I began to follow the issue tracker and contributed back to the library whenever I could pitch in.

Participating in Google Summer of Code

In February 2016, I found out about a program called Google Summer of Code while browsing the scikit-learn project wiki and seeing the past proposals that had been submitted for consideration to the project. I thought the program would be a perfect opportunity to work more closely with community members as mentors and contribute a solid body of work.

I expressed my interest in participating on the mailing list, and I was contacted by Raghav a few weeks before the program opened. He proposed a project involving working with the tree module, and I drafted a proposal and was eventually accepted to the program.

I was fortunate to have great mentors in Raghav and Jacob and was able to learn a lot from them through our frequent communication on Google Hangouts and in person (Jacob and I are both at the University of Washington). For technical details about my Google Summer of Code project, check out my previous blog posts on them.

Post GSoC

Since the end of the program, I've been working on reviewing more pull requests. scikit-learn is mostly limited by reviewer bandwidth, as there are far more pull requests than contributors have time to critique. I hope that, with time, I can develop this skill and contribute to the project in this facet as well (while also contributing code, of course!).

Getting started with open source

Starting open source contributions can be a difficult experience; there's so much out there that it's difficult to find how you can pitch in. I'll try to provide a short guide on how to get started contributing below.

Finding something to work on

This is often the hardest part for new contributors. The best way to get started is to simply jump in! There are a myriad of ways to contribute to an open source project. Obviously, writing code to fix bugs, add new features, or enhance existing ones are useful. However, you don't have to write code to help out! Documentation is a critical part of any open source project, and there's always something to help out with in this department.

If you find an issue that you want to tackle, it's generally good practice to leave a comment asking whether you can work on it / that you will take it. This reduces the probability of duplicate patches.

While working on your contribution

While you're preparing your patch, make sure to read the contributing guidelines of the project. Often, this is a file in the root directory of the project. Alternatively, it could be in the documentation. Following the protocol of the project will reduce both the amount of effort needed by reviewers and you in terms of future corrections.

After you've submitted a pull request

Great, so you solved the issue and opened a pull request on the project. At the point, you should wait for reviewers to comment and suggest improvements. Don't be afraid to talk to them and ask them questions; you both have the same goal of improving the project. Address reviewer concerns in a timely manner, and your code will eventually be merged if the improvement is deemed necessary.

Be a good community member

I highly recommend that prospective contributors "Watch" the projects that they are interested in on Github. This allows you to easily keep up with the various conversations happening throughout the project and see how you can help out. Don't be afraid to join these conversations and make your voice heard; user input is generally quite welcome in discussions regarding new enhancements, APIs, and future releases. Of course, all rules of normal conversation apply to online forums as well --- treat others as you would like to be treated, and be a good person.

To conclude, I'd like to emphasize that everyone at one time was a beginner. Don't be afraid to ask "stupid" questions, because there's a natural learning curve involved in open source contribution; the most important thing by far is the willingness to try and learn.

Thanks to YenChen Lin and Victor Chen for reading and providing feedback on early drafts of this post

Thanks for reading! Feel free to follow me on GitHub or subscribe to blog updates via email.