HDBox v3.0.7 App

In 2018, I set out to build my own data science community at the University of Waterloo (UW) and joined a few communities that drew me together: Python for Data Science, Hadoop + Apache Spark, SQL + Machine Learning… I wrote about them here, but I’ll repeat it now. And while I’ve gotten all tp and moved on from some others, I decided that starting with just myself would be better in the long term. So I took over as lead developer/owner of The Data Science Community.

I have tried to do this for 2 years without much success, mostly because I haven’t really been able to connect to people with other backgrounds in tech. If I can get things going on my own anyway, then so be it. Maybe you can too! This is why I’m writing this post now: To open up how to start your own Python 3 or R project, to show what tools you need to learn data science, and where else to go if you want help choosing which libraries to use. That last point is one of the most important for getting good results with PySpark or Jupyter. Both will teach you exactly how they are going to work when you start a real program. Plus, they give you an overview of why you should use that and what you will run into along the way.

All data science projects follow these basic rules:

You choose a problem for the first week (or two). Then you spend a few weeks trying to figure out what libraries will make the most sense, and decide what kind of system to setup to execute those functions. You decide which tool(s) are needed. What packages are available for each one. When you don’t know everything, you try to find something that you think will be enough for your project. Use GitHub for that part, etc… I wrote about more of those here. Eventually you decide, and write down a tutorial and then implement everything with Github. Make sure your code compiles well when you push a branch to git (I always make sure that mine does), but also remember to merge back any PRs (that isn’t yours yet) afterwards that show up in Github, so someone else can see what changes you made. The end result will look like this

You can do it all with Github, GitLab, and github-public. But I have found that sometimes it takes longer than one day of hard coding and digging to come up with all the pieces. Also, I’m tired of typing commands for every loop. It was only in 2019 when we started using PySpark (and I’ve used Jupyter) for our python programming that I realized it was a lot easier than not only dealing with missing pieces of code but figuring out what to do next. After running multiple examples of the tutorials and writing comments for the ones I did write, I realized it wasn’t that bad after all.

The code I’m sharing for a project like this doesn’t even start that much different from github’s syntax. There is still that little bit of extra boilerplate, but that’s just what happens once you find what you’re looking for and makes the code cleaner and simpler.

This goes all the way up to what libraries to install when you start the project. I am using PySpark because it has many functionalities that are useful. Libraries like numpy (which is necessary for statistics, linear algebra, etc.) are very handy. Others aren’t essential (you’ll probably want pandas, matplotlib, scipy, or seaborn to make your model easier to read and more easily understand). Some libraries are more optional. For example, you might want to install some libraries before plotting a histogram. Or maybe you’ll want to install certain libraries after creating a variable. We have a list here for finding any needed libraries. Once you find what you’re working with, put everything under its name. No folder structure. Just add everything under “a”, as in “a_package”. Don’t forget anything else under that. You might as well keep it all under source control. Do that for everything.

Project Folder Structure

Here is what you have to add to your new folder if you want to put all the files under some structure: Note that there are a couple more folders here too, so that you have a separate folder for all the different models you’ll work on: training, validation, test, validations, etc. These are all very small chunks of the codes for different models to be prepared in order to do what is called feature engineering.

The reason I created this is to show how easy it is to get your initial dataset, clean it, split it, prepare some input files for training the model etc. It also helps remind me of the problems I would face before moving towards deploying any machine learning software because my background doesn’t match machine learning, so I would have to deal with missing features when I was setting up infrastructure for deploying. I always go through the steps of what I’m doing by hand and copy that information over into the source files, but thanks to the internet I can simply click and save.

My file structure for a model I’ve started building for is pretty simple:

There are several ways to reach out if you need help building a similar project as mine. One option is obviously Medium or Twitter. I usually go on a random walk to my spot and blog about it on a daily basis. I’d love to hear of someone who uses this approach too .

( HOW TO DOWNLOAD APPLIACTION )

Open File Explorer and find the zipped folder .To unzip the entire folder, right-click to select Extract All, and then follow the instructions .To unzip a single file or App, double-click the zipped folder to open it. Then, drag or copy the item from the zipped .

 If the application does not work for you, turn on the VPN to work for you .

Downlaod