20 Data Science Interview Questions & Answers

1. What are sets in Python? Describe a few of the characteristics of sets.

An unordered collection called a set can contain any Python object. Membership tests, or determining whether an object is a set member, and storing a collection of unique objects are two common uses for sets. Sets are defined by enclosing a sequence of values in curly brackets ({ and }) and then splitting them up using commas.

The following are some essential Python set properties:

You can not index or slice sets the way you can with lists or tuples because sets do not have a set order.

Because sets only permit unique items, attempting to add a duplicate object to a set will result in its exclusion.

Since sets are customizable, you may add or remove objects from them using the add and remove methods.

Because sets do not enable indexing or slicing, you cannot use an index to get specific elements within a set.

You can not hash sets: Because sets are mutable, they cannot be items in other sets or keys in dictionaries. To work with mutable objects as keys or elements in sets, you can either use a tuple or an immutable set, the latter known as a frozen set.

2. Explain the logical operations in Python.

With Python’s and, or, and not logical operators, you can do boolean operations on True and False data.

If both operands are True, the and-operator returns True; otherwise, it returns False.

If one of the operands is True, the or operator returns True; if both operands are False, it returns False.

The not operator flips the operand’s boolean value. Do not return False if the operand is True, and do not return True if the operand is False.

Learn more: Click Now

3. What are data types that are mutable and immutable?

An object cannot change its state after creation if it is immutable in Python. Once an immutable item has gained shape, it is no longer helpful. Immutable objects in Python include texts, tuples, and numbers (including integers, floats, and complex numbers).

Conversely, a mutable object is one whose state is changeable during creation. This implies that its value can be altered once a changeable object is formed. Lists and dictionaries are instances of mutable objects in Python.

Knowing the distinction between mutable and immutable Python objects is crucial since it can influence how you utilize and work with data in your code. For example, you can sort a list of integers from lowest to highest using the built-in sort() method. However, because tuples are immutable, you cannot use the sort() method if you have a tuple of numbers. Instead, you would need to take the original tuple and turn it into a new sorted tuple.

4. What is the purpose of Python’s try and accept block?

Python uses the try and except blocks to deal with exceptions. Any time something goes wrong while a program executes, it throws an exception.

The try block contains code that could potentially throw an exception. If an error occurs while the try block is running, the code in the except block will run.

By using a try-except block, we may prevent errors from occurring and run the code with the desired output or message in the except block.

5. Describe the concepts of dict and list comprehension.

Using existing iterables, list comprehension, and dict comprehension concisely generate new lists or dictionaries.

Making a list is a breeze with list comprehension. A for loop, an expression, and one or more if statements make up this structure. The result is an updated list that takes the statement and applies the for and if statements.

Constructing a dictionary is a breeze with dict comprehension. The main parts of it are a for loop, one or more for or if statements, curly braces containing a key-value pair, and so on. A fresh dictionary assesses the key-value pair within the framework of the for and if statements.

Struggling to crack Data Science interviews? Learn from the experts and gain hands-on experience with our top-rated Data Science course.”

6. What distinguishes yield keywords from return keywords?

This syntax allows you to terminate a function and return a value to the caller. The function immediately ends and gives the caller the result of the expression that follows it when it comes across a return statement.

On the other hand, yield defines a generator function. A generator function is a unique function that generates a series of values one at a time rather than returning a single value. The generator function responds to a yield statement by making a value, stopping execution, and storing its state for later use.

7. What does Python’s “assert” keyword mean?

To test a condition in Python, you use the assert statement. The program continues to run if the condition is True. The application raises an AssertionError exception if the condition is False.

The assert statement is a common tool for assessing a program’s internal consistency. For instance, an assert statement could be used to verify that a list is sorted before running a binary search on it.

It is crucial to remember that the assert statement is not meant to handle runtime faults; instead, it is designed for debugging. Try and except blocks should be used in production code to handle any exceptions that may be produced during runtime.

8. How are categorical and numerical variables analysed using univariate analysis?

One statistical method for examining and characterizing a single variable is univariate analysis. It is a helpful tool for finding patterns and connections in the data and comprehending a variable’s distribution, central tendency, and dispersion. To conduct univariate analysis for numerical and categorical variables, follow these steps:

Regarding numerical variables:

Determine descriptive statistics like the mode, standard deviation, median, and mean to better understand the data distribution.

Use plots like density plots, boxplots, or histograms to visualize the data distribution.

Examine the data for abnormalities and outliers.

Use statistical tests or data visualizations like a Q-Q plot to determine whether the data is standard.

For categorical variables:

Determine the number or frequency of occurrences of each category in the data.

Determine the proportion or percentage for each data category.

Use plots, such as pie charts or bar plots, to visualize the data distribution.

Examine the data distribution for irregularities or imbalances.

Keep in mind that the precise procedures for carrying out univariate analysis may change based on the particular requirements and objectives of the study. Accurately and successfully describing and comprehending the data requires careful planning and execution of the analysis.

9. What are the categories of skewness in statistics?

A distribution’s skewness indicates how symmetrical it is. A bell-shaped distribution is considered symmetrical if and only if the majority of its points cluster around the mean. A skewed distribution is one in which more points lie outside the normal distribution’s center of gravity.

Positive and negative skewness are the two categories of skewness.

Positive skewness: When a distribution has a strong propensity to have a sizeable left-hand tail and a relatively small right-hand center, it is positively skewed. A favorable skewness distribution is one in which many outlying values shift the mean rightward.

Negative skewness: With a lengthy left tail and most data points clustered on the right side of the mean, this distribution is known as negative skewness. Negative skewness occurs when a small number of left-most extreme values push the distribution’s mean to the left.

10. Describe the central limit theorem.

A basic statistical principle known as the Central Limit Theorem asserts that the distribution of the sample mean will converge to a normal distribution with increasing sample size. This is true for the sample regardless of the underlying distribution of the population. This means we can still utilize regular distribution-based approaches to conclude the population even if the individual data points in a sample are not. To do this, the procedure entails averaging sufficient data points.

Our students have successfully transitioned into Data Science roles at top companies. Be the next success story! Start your journey here.

11. How do Type I and Type II mistakes differ?

Type I and Type II errors are the two categories of mistakes that might happen during hypothesis testing.

The rejection of the null hypothesis, even when believed to be correct, is a Type I error, commonly referred to as a “false positive.” Alpha (α) is the Greek letter for this type of mistake, and the standard threshold is 0.05. This indicates that the likelihood of a Type I mistake or false positive is 5%.

Often referred to as a “false negative,” a Type II error happens when the null hypothesis is incorrect but not rejected. The Greek symbol for this kind of error is beta (β), and one common way to write it is as 1 – β, where β is the test’s power. The likelihood of successfully rejecting the null hypothesis when it is incorrect is the test’s power.

When performing hypothesis testing, it is imperative to make an effort to reduce the likelihood of both kinds of errors.

12. Which kinds of sampling strategies do data analysts employ?

Many sampling techniques are available to data analysts. But these are a few of the more well-liked ones:

Every person in the population has an equal chance of being chosen for the sample when using simple random sampling, a basic sampling technique.

Stratified random sampling: This method chooses a random sample from each stratum after sorting the population into subgroups (or strata) according to particular traits.

Cluster sampling is a technique that selects a random sample of clusters after breaking the population up into smaller groupings or clusters.

Selecting every kth person in the population to be a part of the sample is known as systematic sampling.

13. What distinguishes feature extraction from feature selection?

The process by which we choose which features to feed into the model is known as feature selection. We choose the most pertinent elements for this assignment. Rejected are the features that are obviously irrelevant to the model’s prediction.

The technique of extracting features from raw data is known as feature selection. It entails turning unprocessed data into a collection of features to train an ML model.

Both of these are crucial since they aid in the feature filtering of our machine-learning model, which in turn aids in assessing its accuracy.

14. Which five presumptions apply to linear regression?

The five presumptions of linear regression are as follows:

Linearity: When the relationship between the two variables is linear, it means that the relationship is a straight line.
Error independence: The errors (residuals) do not affect one another.
Homoscedasticity: All projected values have the same variance in the errors.
Normality: The distribution of the mistakes is normal.
Predictor independence: There is no correlation between the independent variables.

15. What are some methods for preventing underfitting?

Several methods to keep a model from underfitting:

Feature selection: Selecting the appropriate feature is crucial for training a model because choosing the incorrect feature may lead to underfitting.

Increase the amount of features to avoid underfitting.

Making use of a more intricate machine-learning model

Fine-tuning the model’s parameters with hyperparameter tweaking

Noise: The model cannot identify the dataset’s complexity without more noise.

16. Multicollinearity: What is it?

A significant correlation between two or more predictor variables in a multiple regression model is known as multicollinearity. This could result in inconsistent and unstable coefficients, which would make it challenging to understand the model’s findings.

Multicollinearity, to put it another way, is the high degree of correlation between two or more predictor variables. The other linked variables may affect the estimations of each predictor variable’s coefficients, making it difficult to identify each predictor variable’s distinct contribution to the response variable.

17. How do Sigmoid and Softmax vary from one another?

Assuming your output is a binary value between 0 and 1, the sigmoid function’s output layer is the way to go. Deep learning models incorporate the sigmoid function in their output layer to provide likelihood-based predictions.

Using the softmax function, an additional activation function, neural networks calculate the probability distribution from a real-valued vector.

Multi-class models are the primary application for this function, which gives the probabilities of each class, with the target class having the highest probability. The main distinction between sigmoid and softmax activation functions is that the latter is used for multivariate classification, whereas the former is utilized for binary classification.

18. When is it inappropriate to reduce dimensionality using PCA?

In the following situations, Principal Component Analysis (PCA) might not be the ideal choice for dimensionality reduction:

Since PCA is a linear technique, it might not be able to reduce the dimensionality of the data when it is not linearly separable.

The data contains categorical features; since PCA is designed to work with continuous numerical data, it might be unable to reduce the dimensionality of data with categorical features.

When there are a lot of missing values in the data, PCA is sensitive to missing values and might not perform well on data sets with many missing values.

PCA searches for patterns in the data while preserving the connections between the original features to produce new features that combine the original characteristics. Therefore, if maintaining the relationships between the original pieces is the aim, it might not be the ideal option.

PCA may not yield satisfactory findings on extremely unbalanced data sets when the data is highly unbalanced since it is sensitive to class imbalances.

19. Gradient descent: What is it?

Gradient descent is an optimization technique for machine learning that determines the model’s parameters (bias and coefficients) to minimize the cost function. It is a first-order iterative optimization technique that converges to the global minimum by following the cost function’s negative gradient.

Gradient descent is all about starting with a random value for the model’s parameters and iteratively updating them in the opposite direction as the cost function’s gradient concerns the parameters. The learning rate controls the algorithm’s convergence to the global minimum, which defines the update’s size.

The cost function falls, and the model performs better when the algorithm modifies the parameters.

20. How do you determine how big your test and validation sets should be?

Here are some methods to confirm the size of your test sets:

Dataset size: Generally, larger datasets allow for more extensive test and validation sets. This is so that the test and validation sets can be more representative of the entire dataset because more data is available.
Model complexity: A straightforward model might need less data for testing and validation. However, if the model is really complicated, it could need additional data to make sure it is reliable and performs well when applied to new data.
Uncertainty level: The validation and test sets may be smaller if the model is anticipated to perform exceptionally well on the job. However, larger validation and test sets might be helpful to provide a more accurate evaluation of the model’s performance if the task is difficult or the model’s performance is unclear.
Resources available: The computing resources at hand may also restrict the size of the test and validation sets. If training and evaluating the model takes a long time, using extensive validation and test sets might not be feasible.

Data Science is a booming field, and mastering it can open doors to exciting opportunities. Ready to start on your journey? Explore our Data Science Course today!

priyankas.3ri