Sharke App 09/04/2024

Tech 2024

In 2018, I set out to build my own data science community at the University of Waterloo (UW) and joined a few communities that drew me together: Python for Data Science, Hadoop + Apache Spark, SQL + Machine Learning… I wrote about them here, but I’ll repeat it now. And while I’ve gotten all tp and moved on from some others, I decided that starting with just myself would be better in the long term. So I took over as lead developer/owner of The Data Science Community.

I have tried to do this for 2 years without much success, mostly because I haven’t really been able to connect to people with other backgrounds in tech. If I can get things going on my own anyway, then so be it. Maybe you can too! This is why I’m writing this post now: To open up how to start your own Python 3 or R project, to show what tools you need to learn data science, and where else to go if you want help choosing which libraries to use. That last point is one of the most important for getting good results with PySpark or Jupyter. Both will teach you exactly how they are going to work when you start a real program. Plus, they give you an overview of why you should use that and what you will run into along the way.

All data science projects follow these basic rules:

You choose a problem for the first week (or two). Then you spend a few weeks trying to figure out what libraries will make the most sense, and decide what kind of system to setup to execute those functions. You decide which tool(s) are needed. What packages are available for each one. When you don’t know everything, you try to find something that you think will be enough for your project. Use GitHub for that part, etc… I wrote about more of those here. Eventually you decide, and write down a tutorial and then implement everything with Github. Make sure your code compiles well when you push a branch to git (I always make sure that mine does), but also remember to merge back any PRs (that isn’t yours yet) afterwards that show up in Github, so someone else can see what changes you made. The end result will look like this

You can do it all with Github, GitLab, and github-public. But I have found that sometimes it takes longer than one day of hard coding and digging to come up with all the pieces. Also, I’m tired of typing commands for every loop. It was only in 2019 when we started using PySpark (and I’ve used Jupyter) for our python programming that I realized it was a lot easier than not only dealing with missing pieces of code but figuring out what to do next. After running multiple examples of the tutorials and writing comments for the ones I did write, I realized it wasn’t that bad after all.

The code I’m sharing for a project like this doesn’t even start that much different from github’s syntax. There is still that little bit of extra boilerplate, but that’s just what happens once you find what you’re looking for and makes the code cleaner and simpler.

This goes all the way up to what libraries to install when you start the project. I am using PySpark because it has many functionalities that are useful. Libraries like numpy (which is necessary for statistics, linear algebra, etc.) are very handy. Others aren’t essential (you’ll probably want pandas, matplotlib, scipy, or seaborn to make your model easier to read and more easily understand). Some libraries are more optional. For example, you might want to install some libraries before plotting a histogram. Or maybe you’ll want to install certain libraries after creating a variable. We have a list here for finding any needed libraries. Once you find what you’re working with, put everything under its name. No folder structure. Just add everything under “a”, as in “a_package”. Don’t forget anything else under that. You might as well keep it all under source control. Do that for everything.

Project Folder Structure

Here is what you have to add to your new folder if you want to put all the files under some structure: Note that there are a couple more folders here too, so that you have a separate folder for all the different models you’ll work on: training, validation, test, validations, etc. These are all very small chunks of the codes for different models to be prepared in order to do what is called feature engineering.

The reason I created this is to show how easy it is to get your initial dataset, clean it, split it, prepare some input files for training the model etc. It also helps remind me of the problems I would face before moving towards deploying any machine learning software because my background doesn’t match machine learning, so I would have to deal with missing features when I was setting up infrastructure for deploying. I always go through the steps of what I’m doing by hand and copy that information over into the source files, but thanks to the internet I can simply click and save.

My file structure for a model I’ve started building for is pretty simple:

Train Folder

I am making three datasets: train1, train2, and train3. Train1 has already been cleaned to remove unnecessary columns, train2 is left untouched since I didn’t do time series analysis (it would take me too long to explain to you why), and the third one is the testing dataset and will be split into two: testing (we’re doing feature engineering here) and unseen (we’ll talk about the latter later on). The goal of having these different kinds of datasets is that you need to have two datasets to feed into a pipeline and have the same rows on both data sets, otherwise the model won’t scale right and you end up with mismatched datasets. My original plan was to add some features to the training dataset, so I could load it again (to fit a larger sample of train2 ) or even create multiple instances of data (to fit a smaller subset of train2 ) and add those rows to the trained1 and tested3 datasets. A quick google search showed me how hard it is to do such analyses without actually going through the code, so I’ll skip that. Fortunately some libraries (like statsmodels) let us combine data frames and that is basically what I used to accomplish this. If I had known there wouldn’t be any preprocessing (like removing missing values, or grouping data by categories or such) that would allow to do this analysis I wouldn’t have bothered.

If I hadn’t done all of this during the second half of 2020 I wouldn’t have been able to get it completed at all if I hadn’t also done it during the spring break semester I spent at UW. At least I would not have had all of the dependencies installed on the local machine I’m renting until late June and would have had some issues installing. Those were tough lessons I learned, especially when you learn something that is new and it is not compatible with something you did before.

A screenshot from one of my notebooks

That will get you started if you have access to a Unix terminal and know how to type things up there. However, if you are new to Linux or MacOS, a Mac or PC, or another operating system, you do not need to worry about connecting directly to an OS. You can install everything in Git, clone the repository, fork the repo, move to a new directory, clone it, and start working on a new notebook.

First few lines from an initial version of a notebook (they might change)

Here are the instructions for getting it ready for publishing: https://github.com/pierrondott/py_sphark_notebook

It gets more straightforward from there. Yes, this first step is not as tricky as it seems, but some times what you can achieve with these tools is way outside your grasp. As a beginner it feels like you can write whatever you want, but as your knowledge grows you might be discouraged and think that it isn’t worth it to figure it all yourself. That is all just sad truth, especially when you can rely on help from StackOverflow or Quora to pull you through.

After you have uploaded a notebook please leave it alone. Someone else can check and review it and see if it’s good or not. Keep the link on the gist, but don’t edit it, make it public, or anything else you think might confuse anyone who may see it in their future when you upload new stuff. When I see something in a notebook I try to make a comment about it on the web, but only after someone notices the link (I don’t really like it, but at least I’m keeping a copy of the whole thing for reference). I tried to make a note to include the exact same thing I did before (there is no documentation on GitHub for those), but I forgot about it and didn’t see it.

There are several ways to reach out if you need help building a similar project as mine. One option is obviously Medium or Twitter. I usually go on a random walk to my spot and blog about it on a daily basis. I’d love to hear of someone who uses this approach too .

( HOW TO DOWNLOAD APPLIACTION )

Open File Explorer and find the zipped folder .To unzip the entire folder, right-click to select Extract All, and then follow the instructions .To unzip a single file or App, double-click the zipped folder to open it. Then, drag or copy the item from the zipped .

If the application does not work for you, turn on the VPN to work for you .