Skip to content

The overall objective of this toolkit is to provide and offer a free collection of data analysis and machine learning that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run this collections either in Jupyter notebook, python alone or the html version.

License

Notifications You must be signed in to change notification settings

syedDS/Complete-Data-Science-Toolkits

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Complete-Data-Science-Toolkits

The overall objective of this toolkit is to provide and offer a free collection of data analysis and machine learning that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run this collections either in Jupyter notebook, python alone or the html version.

Features

Machine Learning

  • Cross-Validation
  • Evaluating Classification Metrics
  • Evaluating Clustering Metrics
  • Evaluating Regression Metrics
  • Grid Search
  • Preprocessing Encoding Categorical Features
  • Preprocessing Binarization
  • Preprocessing Imputing Missing Values
  • Preprocessing Normalization
  • Preprocessing StandardScaler
  • Randomized Parameter Optimization

Numpy

  • Adding, Removing, and Splitting Arrays
  • Sorting arrays
  • Matrix object
  • Statistics Vector Math
  • Structured Arrays
  • Import, Export, Slicing, Indexing
  • Data to from string

Pandas

  • Complete pandas
  • Groupby in Pandas
  • Mapping
  • Filtering
  • Applying

Visualization

  • BarPlots
  • Customization Matplotlib
  • Working with Image
  • Working with text

Naming Conventions

  • The naming convections I followed is:
  • [yyyy-mm-dd-in-project-name-library].extention
  • yyyy = stands for year
  • mm = stands for month
  • dd = stands for day
  • in = my initial, for example: Saleban Olow = so
  • library = numpy, pandas, sklearn, matplotlib
  • project-name = each project name
  • extention = .ipynb, .py, .html
  • Example: 2017-25-11-so-cross-validation-sklearn.ipynb

Code Samples:

Cross Validation

from sklearn.model_selection import cross_val_score
model = SVC(kernel='linear', C=1)
# let's try it using cv
scores = cross_val_score(model, X, y, cv=5)

Grid Search

from sklearn.grid_search import GridSearchCV
params = {"n_neighbors": np.arange(1,5), "metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn, param_grid=params)
grid.fit(X_train, y_train)
print(grid.best_score)
print(grid.best_estimator_.n_neighbors)

Preprocessing Imputing Missing Values

from sklearn.preprocessing import Imputer
impute = Imputer(missing_values = 0, strategy='mean', axis=0)
impute.fit_transform(X_train)

Randomized Parameter Optimization

from sklearn.grid_search import RandomizedSearchCV
params = {"n_neighbors" : range(1,5), "weights": ["uniform", "distance"]}
rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params, cv=4, n_iter=8, random_state=5)
rsearch.fit(X_train, y_train)
print(rsearch.best_score_)

Model fitting supervised and unsupervised learning

#supervised learning
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
#unsupervised learning
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
pca_model = pca.fit_transform(X_train)

Working with numpy arrays

import numpy as np 
#appends values to end of arr
np.append(arr, values)
#inserts values into arr before index 2
np.insert(arr, 2, values)

Indexing and Slicing arrays

import numpy as np 
#return the element at index 5
arr = np.array([[1,2,3,4,5,6,7]])
arr[5]
#returns the 2D array element on index 
arr[2,5]
#assign array element on index 1 the value 4
arr[1] = 4
#assign array element on index [1][3] the value 10
arr[1,3] = 10

Creating DataFrame

import pandas as pd 
#specify values for each rows and columns
df = pd.DataFrame(
	[[4,7,10],
	 [5,8,11],
	 [6,9,12]],
	 index=[1,2,3],
	 columns=['a','b','c'])

groupby pandas

import pandas as pd 
import pandas as pd 
#return a groupby object, grouped by values in column named 'cities'
df.groupby(by="Cities")

handling missing values

import pandas as pd 
#drop rows with any column having NA/null data.
df.dropna()
#replace all NA/null data with value
df.fillna(value)

Melt function

import pandas as pd 
#most pandas methods return a DataFrame so that
#this improves readability of code
df = (pd.melt(df)
	  .rename(columns={'old_name':'new_name', 'old_name':'new_name'})
	  .query('new_name >= 200')
)

Save plot

mport matplotlib.pyplot as plt 
#saves plot/figure to image
plt.savefig('pic_name.png')

Marker, lines

import matplotlib.pyplot as plt 
#add * for every data point
plt.plot(x,y, marker='*')
#adds dot for every data point
plt.plot(x,y, marker='.')

Figures, Axis

import matplotlib.pyplot as plt 
#a container that contains all plot elements
fig = plt.figures()
#Initializes subplot
fig.add_axes()
#A subplot is an axes on a grid system, rows-cols num
a = fig.add_subplot(222)
#adds subplot
fig, b = plt.subplots(nrows=3, ncols=2)
#creates subplot
ax = plt.subplots(2,2)

Working with text plot

import matplotlib.pyplot as plt 
#places text at coordinates 1/1
plt.text(1,1, 'Example text', style='italic')
#annotate the point with coordinates xy with text 
ax.annotate('some annotation', xy=(10,10))
#just put math formula
plt.title(r'$delta_i=20$',fontsize=10)

About

The overall objective of this toolkit is to provide and offer a free collection of data analysis and machine learning that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run this collections either in Jupyter notebook, python alone or the html version.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 65.8%
  • Jupyter Notebook 33.7%
  • Python 0.5%