Why Statistics for data science?

Sarabu Sai Prashanth
4 min readMay 18, 2021

--

Statistics is the most important area to learn when solving business problems using data. When dealing with problems in machine learning, deep learning & natural language processing problems. Good statistics knowledge helps us understand the data much better to get an overview of the data instead of just using it theoretically. statistics is necessary to use to understand structured data, unstructured data, and semi-structured data.

Statistics is all about analyzing the data, visualizing the data, summarizing the data, and then concluding the data. It is used to know if the dataset is positive or negative while checking the data in the dataset.

Statistical understanding helps us understand key insights of the dataset before applying machine learning operations to it. And it is also equally important to interpret it after the learning outcomes by algorithms.

It is also very important to visualize the data while solving problems based on the business goal given. Understanding all the necessary statistics will make it easy to understand the data more clearly in the dataset.

There are 2 types of statistics.

Inferential Statistics

Descreptive statistics

Mean, median, mode & standard deviation will define the dataset to know the spread of the data while summarizing the data is called descriptive statistics. The description of the data set depends upon the central tendency of the data.

A measure of central tendency is a summary statistic that represents the center point or typical value of a dataset. These measures indicate where most values in a distribution fall and are also referred to as the central location of a distribution.

Descriptive statistics is all about summarizing the whole data in the dataset. Using graphs, visualizations to get an overview of the data to know the insights in the dataset.

Mean, median, mode, and the standard deviation is used to know the spread of the data of variables in the dataset which will help to know the importance of variables.

In statistics, the three most common measures of central tendencies are Mean, Median, and Mode.

Mean-The mean (average) of a data set is found by adding all numbers in the data set and then dividing by the number of values in the set.

Median- The median is the middle value when a data set is ordered from least to greatest

Mode- The mode is the number that occurs most often in a data set

Standard deviation-The standard deviation is a summary measure of the differences of each observation from the mean.

Inferential statistics is analyzing its sample from its population. Inferences are made based on the sample of the whole population. Because the entire population is not available to us or impossible to draw conclusions from whole data or data unavailability.

Sample Vs Population

Population

It includes all the elements from the data set and measurable characteristics of the population such as mean and standard deviation are known as a parameter. For example, All people living in India indicate the population of India.

Sample

It includes one or more observations that are drawn from the population and the measurable characteristic of a sample is a statistic. Sampling is the process of selecting the sample from the population. For example, some people living in India are the sample of the population.

There are 2 types of data to remember before working with datasets.

Categorical data

Numerical data

Categorical variables represent types of data that may divide into groups.

Examples: age group, educational level, race, sex, etc….

There are 2 type main types of categorical data, namely; nominal data and ordinal data

Numerical data is a type of data that is expressed in terms of numbers rather than natural language descriptions.

There are 2 type main types of numerical data, namely; continuous data and discrete data

Examples: weight, height, IQ, etc….

Categorical and numerical data types are important to remember while before applying machine learning algorithms to the dataset. A categorical variable as a target variable is a classification problem whereas a numerical variable as a target variable is a regression problem.

I hope you find this article useful.

I will come up with more series of articles on this topic.

In my next article, I will be covering more in-depth content on statistics for data science with examples and this series will be continued.

Thank you :)

--

--