 
Collecting and preparing Textual Data for Classification

lesson1
Lesson 1: Introduction to Textual Data Collection in NLP
Blast off into the world of Textual Data Collection! 🚀 You’re about to unlock the secrets of data science in NLP. Let’s decode the mysteries together!
Introduction to NumPy Arrays
NumPy 数组简介
Let’s delve into Python’s NumPy library and focus on the centerpiece of NumPy - arrays. NumPy, an acronym for ‘Numerical Python’, specializes in efficient computations on arrays. Arrays in NumPy are more efficient than typical Python data structures.
让我们深入探讨 Python 的 NumPy 库,并将重点放在 NumPy 的核心组件—— arrays 上。NumPy 是“Numerical Python”的缩写,专门用于对数组进行高效计算。NumPy 中的数组比典型的 Python 数据结构更高效。
Meet NumPy 邂逅 NumPy
The power of NumPy lies in its fast computations on large data arrays, making it crucial in data analysis. Before we start, let’s import it:
NumPy 的强大之处在于它能够对大型数据数组进行快速计算,这使其在数据分析中至关重要。在开始之前,让我们先导入它:
Python
CopyPlay
1# Import NumPy as 'np' in Python 2import numpy as np
np is a commonly used representation for numpy.np 是 numpy 的常用表示。
Understanding NumPy Arrays
理解 NumPy 数组
NumPy arrays, like a sorted shopping list, allow for swift computations. Arrays offer quick access to elements. Let’s create a simple one-dimensional NumPy array:
像排序过的购物清单一样, NumPy arrays 允许快速计算。数组可以快速访问元素。让我们创建一个简单的一维 NumPy 数组:
Python
CopyPlay
1# Creating a one-dimensional (1D) numpy array 2array_1d = np.array([1, 2, 3, 4, 5]) 3print(array_1d) # prints: [1 2 3 4 5]
This code creates a five-element array.
这段代码创建了一个包含五个元素的数组。
Creating Multi-Dimensional Arrays
创建多维数组
We can create multi-dimensional arrays as much as we would with a multi-day shopping list. Here, each sublist [] forms a row in the final array:
我们可以像创建多天的购物清单一样创建多维数组。在这里,每个子列表 [] 构成最终数组中的一行:
Python
CopyPlay
1# Two-dimensional (2D) numpy array 2array_2d = np.array([[1, 2, 3],[4, 5, 6]]) 3print(array_2d) 4'''prints: 5[[1 2 3] 6 [4 5 6]] 7'''
Each row in the output corresponds to a sublist in the input list.
输出中的每一行对应于输入列表中的一个子列表。
We can apply the same principle to create a three-dimensional array:
我们可以应用相同的原理来创建一个三维数组:
Python
CopyPlay
1# Three-dimensional (3D) numpy array 2array_3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) 3print(array_3d) 4'''prints: 5[[[ 1 2 3] 6 [ 4 5 6]] 7 8 [[ 7 8 9] 9 [10 11 12]]] 10'''
Arrays Properties: Size 数组属性:大小
NumPy arrays come with a series of built-in properties that give helpful information about the structure and type of data they hold. These are accessible via the size, shape, and type fields, respectively.
NumPy 数组带有一系列内置属性,可提供有关其保存的数据结构和类型的有用信息。 可以分别通过 size, 、 shape 和 type 字段访问这些属性。
Let’s start with size. This property indicates the total number of elements in the array. Elements can be numbers, strings, etc. This becomes especially useful when working with multi-dimensional arrays where manually counting elements can be tedious.
我们先来看一下 size 。这个属性表示数组中元素的总数。元素可以是数字、字符串等等。在处理多维数组时,这个属性特别有用,因为手动计算元素数量可能很繁琐。
Python
CopyPlay
1array_3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) 2print("Size:", array_3d.size) # Size: 12
The array above is a 3D array that contains two 2D arrays. Each of the 2D arrays has two arrays, and each of those has three elements. Therefore, the total number of elements is 2 * 2 * 3 = 12.
上面的数组是一个包含两个二维数组的三维数组。每个二维数组又包含两个数组,每个数组有三个元素。因此,元素总数为 2 * 2 * 3 = 12 。
Arrays Properties: Shape 数组属性:形状
Next, we have shape, which gives us the array’s dimensions. The shape property returns a tuple where the number of items is the dimension of the array, and the value of each item is the size of each dimension.
接下来是 shape ,它返回数组的维度。shape 属性返回一个元组,元组中元素的数量表示数组的维度数,每个元素的值表示对应维度的长度。
Python
CopyPlay
1print("Shape:", array_3d.shape) # Shape: (2, 2, 3)
In the example above, the shape (2, 2, 3) is a tuple of three values, which indicates that our array is 3D and contains two arrays, each of which includes two more arrays, each of which holds three elements.
在上面的例子中,形状 (2, 2, 3) 是一个包含三个值的元组,这表明我们的数组是三维的,它包含两个数组,每个数组又包含两个数组,每个数组又包含三个元素。
Arrays Properties: Type 数组属性:类型
Lastly is dtype, which stands for the data type. This property tells us about the elements stored in the array, whether they’re integers, floats, strings, etc.
最后是 dtype ,它代表数据类型。此属性告诉我们数组中存储的元素是什么类型,例如整数、浮点数、字符串等。
Python
CopyPlay
1print("Data type:", array_3d.dtype) # Data type: int64
For our example, the data type is int64 because our array only contains integers. If it had held floating point numbers, the dtype would have reflected that.
对于我们的示例,数据类型是 int64 ,因为我们的数组只包含整数。如果它包含浮点数,则 dtype 会反映这一点。
Understanding these properties is vital for effectively working with NumPy arrays, as they provide information about the array’s structure and content.
了解这些属性对于有效地使用 NumPy 数组至关重要,因为它们提供了有关数组结构和内容的信息。
Lesson Summary 课程总结
Great job! We have learned how to create basic and multi-dimensional NumPy arrays and examined the properties of arrays. Now, let’s move on to some exercises to practice these concepts and prepare for future data analysis challenges.
太棒了!我们已经学习了如何创建基本和多维 NumPy 数组,并检查了数组的属性。现在,让我们继续做一些练习来实践这些概念,并为未来的数据分析挑战做好准备。
Explore More of the 20 Newsgroups Dataset
Great job, Star Seeker! Now, let’s challenge you a bit more. We have the grid_power array, representing the power generated by a 1D line of solar panels in a space station. Your task is to modify the code to transform grid_power from a 1D numpy array to a 2D numpy array with 2 rows. Keep up the good work and let’s conquer the cosmos of knowledge!import numpy as np # Assume we have four solar panels grid_power = np.array([500, 700, 700, 900]) # Print the array print("\nSpaceship Grid Power:", grid_power) # Display the size, shape, and data type of our array print(“Size of Spaceship Grid Power Array:”, grid_power.size) print(“Shape of Spaceship Grid Power Array:”, grid_power.shape) print(“Data type of Spaceship Grid Power Array:”, grid_power.dtype)
Uncover the End of 20 Newsgroups Dataset
Great job, Star Seeker! Now, let’s challenge you a bit more. We have the grid_power array, representing the power generated by a 1D line of solar panels in a space station.
Your task is to modify the code to transform grid_power from a 1D numpy array to a 2D numpy array with 2 rows.
Keep up the good work and let’s conquer the cosmos of knowledge!
| |
解题
| |
代码解释
The given code uses the NumPy library to transform a 1-dimensional (1D) numpy array, grid_power, into a 2-dimensional (2D) numpy array with 2 rows. The original grid_power array represents the power generated by a 1D line of solar panels in a space station.
Here’s a breakdown of the code:
- The NumPy library is imported as
np. - A 1D numpy array
grid_poweris created with four elements:[500, 700, 700, 900]. - The
reshapefunction is used to transform the 1D array into a 2D array with 2 rows. The-1argument is used to automatically calculate the number of columns based on the size of the original array. - The transformed 2D array is printed using the
printfunction. - The size, shape, and data type of the transformed 2D array are displayed using the
size,shape, anddtypeattributes of the numpy array.
The output of the code shows that the transformed 2D array is [[500, 700], [700, 900]], which means that the power generated by the first two solar panels is [500, 700] and the power generated by the last two solar panels is [700, 900]. The size of the transformed 2D array is 4, the shape is (2, 2), and the data type is int64.
给定的代码使用 NumPy 库将一个一维(1D)numpy 数组 grid_power 转换为一个具有 2 行的二维(2D)numpy 数组。原始的 grid_power 数组表示空间站中一排太阳能电池板生成的功率。
以下是代码的详细解释:
- 导入 NumPy 库,并将其重命名为
np。 - 创建一个具有四个元素的一维 numpy 数组
grid_power:[500, 700, 700, 900]。 - 使用
print函数打印原始的一维数组。 - 使用
size、shape和dtype属性分别显示原始一维数组的大小、形状和数据类型。 - 使用
reshape函数将一维数组转换为具有 2 行的二维数组。-1参数用于根据原始数组的大小自动计算列数。 - 使用
print函数打印转换后的二维数组。 - 使用
size、shape和dtype属性分别显示转换后的二维数组的大小、形状和数据类型。
代码的输出显示,原始一维数组的大小为 4,形状为 (4,),数据类型为 int64。将数组转换为具有 2 行的二维数组后,大小保持为 4,形状变为 (2, 2),数据类型保持为 int64。转换后的二维数组为 [[500, 700], [700, 900]],这意味着前两个太阳能电池板生成的功率为 [500, 700],后两个太阳能电池板生成的功率为 [700, 900]。
Fetch Specific Categories from Dataset
Now, we need to make some modifications. In the code snippet provided, please add the necessary code to create a NumPy array that represents the power levels of spaceship engines. Feel free to select power levels that you deem appropriate. In addition, please insert the appropriate print statements to display the size, shape, and data type of this array.
May the force be with you!
| |
解题
| |
解释 The provided code is written in Python and uses the NumPy library to create and manipulate arrays. Here’s a step-by-step explanation of the code and the solution:
- The NumPy library is imported using the
import numpy as npstatement. - A NumPy array named
engine_poweris declared to represent the power levels of spaceship engines. The power levels are chosen as[100, 200, 150, 300]. - The created array is printed using the
printfunction with a descriptive message. - The size, shape, and data type of the created array are displayed using the
size,shape, anddtypeattributes of the NumPy array.
The output of the code will be:
Spaceship Engine Power: [100 200 150 300] Size of Spaceship Engine Power Array: 4 Shape of Spaceship Engine Power Array: (4,) Data type of Spaceship Engine Power Array: int64
This output indicates that the engine_power array has a size of 4, a shape of (4,), and a data type of int64. This means that the array is a one-dimensional array with 4 elements, and each element is an integer.
提供的代码创建了一个名为 engine_power 的 NumPy 数组,用于表示太空飞船引擎的功率级别。选择的功率级别为 [100, 200, 150, 300]。然后,代码打印创建的数组并显示其大小、形状和数据类型。
Fetching the Third Article from Dataset
Excellent work so far! Now, let’s venture into the cosmos of data transformation.
Suppose you are given an array of temperatures in degrees Celsius. Your mission, should you choose to accept it, is to write a custom function to convert these temperature values from Celsius to Fahrenheit and then apply this function to the existing array.
Remember, the formula to convert Celsius to Fahrenheit is F=C∗1.8+32F=C∗1.8+32.
Prepare for liftoff!
| |
Let’s proceed with the task of converting an array of temperatures from Celsius to Fahrenheit using a custom function. Here’s a step-by-step explanation of the process, followed by the actual implementation in Python:
Define the Conversion Function:
- The conversion formula from Celsius to Fahrenheit is ( F = C \times 1.8 + 32 ).
- We will define a function in Python to perform this calculation.
Apply the Function to the Array:
- Use NumPy to handle the array of temperatures efficiently.
- Apply the conversion function to each element in the array using vectorized operations.
Here’s the Python code to accomplish this:
| |
Explanation of the Code:
- Importing NumPy: NumPy is imported for efficient array handling.
- Creating the Array:
temperatures_cis defined as a NumPy array containing temperatures in Celsius. - Defining the Conversion Function: The function
celsius_to_fahrenheittakes a temperature in Celsius and converts it to Fahrenheit using the formula. - Applying the Function: The function is applied to the entire array using NumPy’s vectorized operations, which allows for efficient computation.
- Printing Results: Finally, the original Celsius temperatures and the converted Fahrenheit temperatures are printed.
You can run this code in a Python environment to see the conversion results.
Exploring Text Length in Newsgroups Dataset
Great work, space explorer! You’ve traveled across galaxies of code and honed your skills in applying functions to arrays!
We’ve established a connection with a mysterious alien civilization. Interestingly, they value the squares of numbers and square roots of numbers more than the numbers itself! They provided us with a number sequence from 1 to 10 and seek the square and the square root of each number.
In this case, no need to vectorize custom function. We can use np.sqrt function for square roots and simply square the whole array with ** 2!
| |
Lesson 2: Mastering Text Cleaning for NLP: Techniques and Applications
Buckle up for today’s journey in the world of NumPy arrays! We’ll be exploring new routes in data selection. Isn’t that exciting? Ready to embark? 🚀
Lesson Overview and Plan
Hello, Data Whizz! Today’s lesson will focus on “selecting data” in NumPy arrays—our goal is to master integer and Boolean indexing and slicing. So, let’s jump in!
Understanding Indexing in 1D Arrays
In this section, we’ll discuss “Indexing”. It’s akin to item numbering in lists, and Python implements it with a zero-based index system. Let’s see this principle in action.
| |
Essentially, indexing is a numeric system for items in an array—relatively straightforward!
Understanding Indexing in Multi-dimensional Arrays
Are you mindful of higher dimensions? NumPy arrays can range from 1D to N-dimensions. To access specific elements, we use the index pair (i,j) for a 2D array, (i,j,k) for a 3D array, and so forth.
| |
In this example, we selected the first element of the second row in a 2D-array—it’s pretty simple!
Understanding Boolean Indexing
Are you ready for some magic? Enter “Boolean Indexing”, which functions like a ‘Yes/No’ filter, with ‘Yes’ representing True and ‘No’ for False.
| |
Or we can put the condition directly into [] brackets:
| |
Voila! Now, we can filter data based on custom conditions.
Understanding Complex Conditions in Boolean Indexing
Now that we’ve mastered the simple ‘Yes/No’ binary filter system, let’s up the ante with “Complex Conditions in Boolean Indexing”. This method refines our filtering process further, allowing us to set more detailed restrictions.
Imagine, for instance, that we want to create an index for even numbers greater than five. We’ll merge two conditions to yield this result:
| |
In this query, we used the ampersand (&) to signify intersection - i.e., we’re selecting numbers that are both even AND larger than five. Note, that simple and operator won’t work here.
Similarly, we can use the pipe operator (|) to signify union - i.e., selecting numbers that are either even OR larger than five:
| |
Awesome, right? This additional filtering layer empowers us to be more specific and intentional about the data we select.
A Look at Slicing in NumPy Arrays
NumPy arrays could be sliced in the same manner as the regular python list. Let’s make a quick recall. The syntax is start:stop:step, where start is the first index to choose (inclusive), stop is the last index to choose (exclusive), and step defines the step of the selection. For example, if the step=1, each element will be selected, and if step=2 – every other one will be skipped.
Let’s take a look at simple examples:
| |
Note that slicing is inclusive on the left and exclusive on the right.
Another example with a step parameter:
| |
In this case, we choose every second element, by starting with 1 and using step=2.
Lesson Summary
Congratulations! We’ve traversed the landscape of NumPy arrays, delving into indexing, Boolean indexing, and slicing. Now, we’ll dive into some hands-on exercises. After all, practice makes perfect. Let’s go forth, data explorers!
Update String and Clean Text
Are you ready to don your Data Analyst cape, Space Voyager? I have a task that will bring us down to Earth for a short while. Let’s envisage that you are a data analyst at an HR company, and your task is to filter out employees who earn more than an average wage of $30/hr.
This seems like quite a conundrum, doesn’t it? Don’t worry; we already have a solution.
Simply click Run to see how this task can be accomplished using NumPy's data selection techniques.
| |
以下是对代码的逐步解释:
- 使用
import numpy as np语句导入 NumPy 库。 - 创建一个名为
wages_per_hour的 NumPy 数组,用于存储一组员工的每小时工资数据。 - 为了筛选出赚取超过 $30/hr 的员工,使用了布尔索引技术。表达式
wages_per_hour > 30返回一个布尔数组,其中 True 对应于满足条件的索引(即工资超过 30)。 - 然后使用
print函数打印筛选后的数组。
代码的输出将是:
[35 38 40 45 50]这个输出表明,工资为 $35/hr、$38/hr、$40/hr、$45/hr 和 $50/hr 的员工的工资超过了平均工资 $30/hr。
- 使用
Filling in Python Functions and Regex Patterns
Well done, Space Voyager! You’ve harnessed the Power Cosmic to select adults from a group. Now, let’s make a slight alteration to the course.
Suppose you’re the coach of a team, which consists of both adults (aged 18 and above) and minors. The challenge lies in selecting only those adults who are under 30 to form the Junior Adults Team.
Trust your knowledge of python and apply it to the existing code. Are you ready? Set? Punch it to warp speed!
| |
以下是对代码的逐步解释:
- 使用
import numpy as np语句导入 NumPy 库。 - 创建一个名为
ages的 NumPy 数组,用于存储一组个体的年龄。 - 为了选择成年人(年龄18岁及以上),使用了布尔索引技术。表达式
ages > 18返回一个布尔数组,其中 True 对应于满足条件的索引(即年龄大于18)。 - 结果数组的成年人存储在名为
adults的变量中。 - 为了仅选择那些年龄小于30岁的成年人,在
adults数组上使用了另一种布尔索引技术。表达式adults < 30返回一个布尔数组,其中 True 对应于满足条件的索引(即年龄小于30)。 - 然后使用
print函数打印筛选后的年龄小于30岁的成年人数组。
代码的输出将是:
成年人:[19 36 34 27 21 23 25] 青年成年人:[19 27 21 23 25]
这个输出表明,该组中的成年人年龄为19、36、34、27、21、23和25岁。在这些成年人中,年龄小于30岁的人年龄为19、27、21、23和25岁。
Of course, I’d be happy to help with that! To modify the code to select adults under 30, you can add an additional condition to the Boolean indexing expression. Here’s a hint on how to proceed:
- You already have the Boolean indexing expression
ages > 18to select adults. - To limit the age to under 30, you can add another condition using the less than operator
<. - Use the
&operator to combine the two conditions. - Apply the combined condition to the
agesarray to select adults under 30. - Print the filtered array to verify that it contains the correct results.
Here’s the modified code:
| |
This code will select adults (ages 18 and above) who are under 30 years old and print the filtered array.
Mastering Text Cleaning with Python Regex
Good try but that’s not exactly right.
Your solution currently selects adults over 18 years but does not limit the age to under 30.
Do you need a hint on how to modify it to meet the requirement? 🤔
| |
Sure, let’s modify the existing code to select only those adults who are under 30 years of age. Here is the updated code:
| |
In this code, we use a boolean condition to select ages that are both 18 and above but less than 30. Running this will produce the output:
| |
This way, only the adults who are under 30 are selected for the Junior Adults Team.
Implement Text Cleaning on Dataset
Great job, Space Explorer! We’re planning to embark on a more advanced extraction mission. Imagine we are choosing a crew of space companions – cool space dogs, just like me! Within the field of weight analysis, envision a scenario where we’re tasked to identify dogs whose weight exceeds 20 kg or is under 10 kg. The information will enable us to concentrate our study on this group and assess their potential health risks.
太空探险家,干得好!我们正计划开展一项更先进的提取任务。想象一下,我们正在挑选太空伙伴——就像我一样的酷酷的太空狗!在体重分析领域,设想一个场景,我们需要识别体重超过 20 kg 或低于 10 kg 的狗。这些信息将使我们能够将研究集中在这个群体上,并评估它们的潜在健康风险。
For boolean selection with AND operand, we used the & operator. For the OR operand, use a vertical line |!
对于使用 AND 运算符的布尔选择,我们使用 & 运算符。对于 OR 运算符,请使用竖线 | !
Are you ready to code? Let’s get started!
准备开始编程了吗?我们开始吧!
| |
解释
Absolutely! Let’s identify the space dogs whose weight exceeds 20 kg or is under 10 kg using the | operator for the OR condition. Here is the updated code:
| |
Running this code will produce the output:
| |
This code identifies and prints the weights of the space dogs that are either over 20 kg or under 10 kg, helping you to focus your study on this group for potential health risk assessments.
Mastering Text Cleaning with Python Regex on a Dataset
Bravo, fellow space explorer! We’ve unraveled the mysteries of data selection in NumPy arrays. Now, in this quadrant’s final test, let’s imagine that you are a space health officer.
You have an array of crew members’ badge numbers and need to choose some subgroup of people for a health check-up. Write a code that selects crew members with even badge number lower than 33. After this, select the first three crew members from the obtained list, using array slicing. Remember, the syntax is [start:stop:step], just like with python lists!
你有一个数组,存储着船员的徽章号码,你需要从中选择一些人进行健康检查。编写代码,选择徽章号码为偶数且小于 33 的船员。然后,使用数组切片从获得的列表中选择前三名船员。记住,语法是 [start:stop:step] ,就像 Python 列表一样!
Let’s rev up our coding hyperdrive!
让我们加速代码引擎!
| |
解释 Let’s embark on this final mission. We’ll filter the crew members with even badge numbers lower than 33 and then select the first three members from the resulting list using array slicing. Here is the code to achieve this:
| |
Running this code will produce the following output:
| |
This code successfully filters the badge numbers to find those that are even and less than 33, and then selects the first three members from this filtered list for the health check-up.
Lesson 3: Removing Stop Words and Stemming in Text Preprocessing
Introduction and Lesson Overview
Welcome! In today’s lesson, we’ll learn how to use functions with NumPy arrays, a crucial concept for managing large amounts of data. This ability facilitates efficient data processing — a teacher calculating class averages or a business analyst summarising company sales illustrates the practical use of these skills. Are you ready? Let’s get started! 欢迎!在今天的课程中,我们将学习如何对 NumPy 数组使用函数,这对管理大量数据至关重要。这种能力有助于高效的数据处理——教师计算班级平均成绩或业务分析师汇总公司销售额,都体现了这些技能的实际应用。准备好了吗?让我们开始吧!
Arithmetic Operations with NumPy Arrays
NumPy 数组的算术运算
Two arrays of the same shape can undergo basic arithmetic operations. The operations are performed element-wise, meaning they’re applied to each pair of corresponding elements. Suppose we have two grade arrays of students from two subjects. By adding these arrays, we can calculate the total grades:
形状相同的两个数组可以进行基本的算术运算。这些运算以元素方式执行,这意味着它们应用于每对对应的元素。假设我们有两个科目学生的成绩数组。通过将这些数组相加,我们可以计算总成绩:
| |
The two arrays are added element-wise in this code to calculate the total grades.
在这段代码中,两个数组按元素相加来计算总成绩。
Introduction to Universal Functions (ufuncs)
通用函数 (ufuncs) 简介
NumPy’s Universal Functions (also called ufuncs) perform element-wise operations on arrays, including mathematical functions like sin, cos, log, and sqrt. Let’s look at a use case:
NumPy 的通用函数(也称为 ufuncs )对数组执行元素级操作,包括 sin 、 cos 、 log 和 sqrt 等数学函数。我们来看一个用例:
| |
This code applies np.sin universal function to each array element. As np.sin expects its input in radians, we first transform degrees to radians with np.radians, applying it to each array element similarly.
这段代码对每个数组元素应用 np.sin 通用函数。由于 np.sin 要求输入为弧度,我们首先使用 np.radians 将角度转换为弧度,并同样将其应用于每个数组元素。
Extending NumPy with Custom Functions
使用自定义函数扩展 NumPy 功能
NumPy allows the application of a custom function to each element of the array separately by transforming the target function with np.vectorize. Let’s create a function to check the parity of a number:
NumPy 允许通过使用 np.vectorize 转换目标函数,将自定义函数分别应用于数组的每个元素。让我们创建一个函数来检查数字的奇偶性:
| |
The vectorized_is_even function returns an array indicating whether each value in numbers is even.vectorized_is_even 函数返回一个数组,指示 numbers 中的每个值是否为偶数。
Lesson Summary 课程总结
Well done! You’ve learned how to apply functions to NumPy arrays, perform arithmetic operations, apply statistical functions, use ufuncs, and extend NumPy with custom functions. Up next are practice exercises for further learning. Let’s proceed!
干得好!您已经学习了如何将函数应用于 NumPy 数组、执行算术运算、应用统计函数、使用 ufuncs 以及使用自定义函数扩展 NumPy。接下来是用于进一步学习的练习。让我们继续吧!
「Pratice」Switch from LancasterStemmer to PorterStemmer
Hello there, space explorer! Are you ready to venture further into the mysterious universe of data analysis with Python and NumPy? Let’s do this!
Imagine that you’ve just hosted a space-themed quiz in your galactic neighborhood. Now, you have the raw scores of the participants. However, the participants are eager to know their percentage scores.
Luckily, all the work is done! Simply press the Run button to see how we’ve calculated the percentage scores using a function applied to a NumPy array.
| |
解题
| |
output
| |
This code snippet demonstrates a simple yet effective way to calculate percentage scores from raw marks using NumPy. Here’s a breakdown:
Importing NumPy:
import numpy as npimports the NumPy library, essential for array operations.Defining the
percentage_calcfunction: This function takes a single argument,percentage_marks, representing the raw mark, and calculates the corresponding percentage score. The formula used is(percentage_marks * 100) / 75, where 75 is assumed to be the total marks for the quiz.Creating the
raw_marksarray:raw_marks = np.array([35, 48, 60, 72, 75])defines a NumPy array containing the raw scores of the participants.Vectorizing the
percentage_calcfunction:percentage_calc = np.vectorize(percentage_calc)converts thepercentage_calcfunction into a vectorized function. This allows it to operate on each element of theraw_marksarray individually.Calculating percentage marks:
percentage_marks = percentage_calc(raw_marks)applies the vectorizedpercentage_calcfunction to theraw_marksarray, resulting in a new arraypercentage_markscontaining the calculated percentage scores.Printing the results: Finally,
print("Raw Marks: ", raw_marks)andprint("Percentage Marks: ", percentage_marks)display the original raw marks and the calculated percentage scores.
This code snippet showcases the power of NumPy for efficiently performing calculations on arrays. The vectorization technique allows for concise and efficient code, making it ideal for data analysis tasks.
「Pratice」 Removing Stop Words and Punctuation from Text
Congratulations on successfully running your first code! But, what else can you accomplish? Let’s investigate!
The provided starter code applies a 20% discount to all prices. However, what if the store policy were to change, offering a 30% discount for prices greater than 350? For other prices, discount stays the same – 20%. Could you modify our code to reflect this new discount rate in our discounted prices?
It’s time to ignite those rockets!
| |
解题
| |
output输出:
| |
Here’s how the code has been modified to implement the new discount policy:
Conditional Discount in
discount_priceFunction:- Inside the
discount_pricefunction, anifstatement is added to check if thepriceis greater than 350. - If
price > 350, a 30% discount is applied:return price - (price * 0.30). - Otherwise (if
price <= 350), the original 20% discount is applied:return price - (price * 0.20).
- Inside the
Rest of the Code Remains the Same:
- The code for creating the
pricesarray, vectorizing thediscount_pricefunction, and calculatingdiscounted_pricesremains unchanged.
- The code for creating the
Now, the code correctly applies a 30% discount for prices above 350 and a 20% discount for prices below or equal to 350. This demonstrates the flexibility of using conditional logic within functions to handle varying discount scenarios.
「Pratice」Stemming Words with PorterStemmer
「Pratice」Implementing Stopword Removal and Stemming Function
Excellent work so far! Now, let’s venture into the cosmos of data transformation.
Suppose you are given an array of temperatures in degrees Celsius. Your mission, should you choose to accept it, is to write a custom function to convert these temperature values from Celsius to Fahrenheit and then apply this function to the existing array.
Remember, the formula to convert Celsius to Fahrenheit is F
C ∗ 1.8 + 32 F=C∗1.8+32.
Prepare for lift off!
| |
Let’s proceed with the task of converting an array of temperatures from Celsius to Fahrenheit using a custom function. Here’s a step-by-step explanation of the process, followed by the actual implementation in Python:
Define the Conversion Function:
- The conversion formula from Celsius to Fahrenheit is ( F = C \times 1.8 + 32 ).
- We will define a function in Python to perform this calculation.
Apply the Function to the Array:
- Use NumPy to handle the array of temperatures efficiently.
- Apply the conversion function to each element in the array using vectorized operations.
Here’s the Python code to accomplish this:
| |
Explanation of the Code:
- Importing NumPy: NumPy is imported for efficient array handling.
- Creating the Array:
temperatures_cis defined as a NumPy array containing temperatures in Celsius. - Defining the Conversion Function: The function
celsius_to_fahrenheittakes a temperature in Celsius and converts it to Fahrenheit using the formula. - Applying the Function: The function is applied to the entire array using NumPy’s vectorized operations, which allows for efficient computation.
- Printing Results: Finally, the original Celsius temperatures and the converted Fahrenheit temperatures are printed.
You can run this code in a Python environment to see the conversion results.
「Pratice」Cleaning and Processing the First Newsgroup Article
Great work, space explorer! You’ve traveled across galaxies of code and honed your skills in applying functions to arrays!
We’ve established a connection with a mysterious alien civilization. Interestingly, they value the squares of numbers and square roots of numbers more than the numbers itself! They provided us with a number sequence from 1 to 10 and seek the square and the square root of each number.
In this case, no need to vectorize custom function. We can use np.sqrt function for square roots and simply square the whole array with ** 2!
| |
Sure! Let’s proceed with finding the squares and square roots of the numbers from 1 to 10. We’ll use NumPy’s sqrt function for the square roots and the ** 2 operator for squaring the numbers. Here’s the complete Python code:
| |
Explanation:
- Array of Numbers: The array
numberscontains the sequence from 1 to 10. - Squares: Using the
** 2operator, we compute the square of each number in the array. - Square Roots: Using the
np.sqrtfunction, we compute the square root of each number in the array. - Printing Results: Finally, the original numbers, their squares, and their square roots are printed.
You can run this code in a Python environment to see the results. Here’s what the output should look like:
| |
Lesson 4: Unleashing the Power of n-grams in Text Classification
Lesson Overview 课程概述
Hello, students! Today, we’re exploring the pandas DataFrame, a powerhouse structure in data analysis with Python. We’ll contrast it with NumPy arrays and teach you how to build a DataFrame. Additionally, we’ll delve into its integral parts and data types.
同学们,大家好!今天,我们将探索 Pandas DataFrame,它是 Python 数据分析中的强大结构。我们会将其与 NumPy 数组进行对比,并教您如何构建 DataFrame。此外,我们还将深入研究其组成部分和数据类型。
Introduction to the Pandas Library
Pandas 库简介
The pandas library is Python’s solution for tabular data operations, packing more punch for data analysis than NumPy, which is skewed towards numerical computations. Pandas houses two fundamental data structures: the Series (1D) and the DataFrame (2D). Often, the DataFrame is the go-to choice. Let’s start by importing pandas:
Pandas 库是 Python 用于表格数据操作的解决方案,它比偏向于数值计算的 NumPy 库在数据分析方面功能更强大。Pandas 包含两种基本数据结构: Series (一维)和 DataFrame (二维)。通常, DataFrame 是首选。让我们从导入 pandas 开始:
| |
Here, pd serves as a standard alias for pandas.
这里, pd 是 pandas 的标准别名。
Creating a DataFrame 创建数据帧
Building a DataFrame in pandas is straightforward. It can be created from a dictionary, list, or NumPy array. Here’s an example of creating a DataFrame from a dictionary:
在 pandas 中构建 DataFrame 非常简单。它可以从字典、列表或 NumPy 数组创建。以下是如何从字典创建 DataFrame 的示例:
| |
In the student_data dictionary, each (key, value) pair becomes a DataFrame column. The DataFrame automatically assigns an index (0-2) to each row, but we can also specify our own if we choose to do so.
在 student_data 字典中,每个 (键, 值) 对都会成为 DataFrame 的一列。DataFrame 会自动为每一行分配一个索引 (0-2),但我们也可以选择指定自己的索引。
Adding a Column to the DataFrame
为数据帧添加列
Adding a new column to an existing DataFrame is straightforward: just like with dictionary, simply refer to a new key and assign an array to it. Let’s add information about student’s grades to our data:
向现有 DataFrame 添加新列很简单:就像使用字典一样,只需引用一个新键并为其分配一个数组即可。让我们将有关学生成绩的信息添加到我们的数据中:
| |
The 'Grade' column has been successfully added to the DataFrame.'Grade' 列已成功添加到 DataFrame 中。
Classification of DataFrame Parts
DataFrame 部分的分类
A DataFrame consists of three components: data, index, and columns. Pandas neatly organize data into rows and columns — each row is a record, and each column is a feature (such as Name, Age, or Grade). The index, much like the index of a book, aids in data location. Columns serve as labels for the features in our data. Let’s analyze these components with our student data:
DataFrame 由三个部分组成:数据、索引和列。Pandas 将数据整齐地组织成行和列——每一行都是一条记录,每一列都是一个特征(例如姓名、年龄或年级)。索引就像书的索引一样,有助于数据定位。列充当数据中特征的标签。让我们用学生数据来分析这些组成部分:
| |
Our dataset contains row labels (index) from 0 to 2 and column labels (columns) of ‘Name’, ‘Age’, and ‘Grade’.
我们的数据集包含从 0 到 2 的行标签(索引)以及“姓名”、“年龄”和“年级”的列标签(列)。
Accessing DataFrame Columns
访问 DataFrame 列
One of the useful features of pandas DataFrame is its flexibility in accessing individual columns of data. You can select a single column, much like how you access a dictionary item by its key. Let’s use our student_data DataFrame again:
pandas DataFrame 的一个有用特性是它在访问单个数据列方面的灵活性。您可以选择单个列,就像通过键访问字典项一样。让我们再次使用我们的 student_data DataFrame:
| |
As you can see, we now only have the ‘Name’ column content. Notice that pandas kept the index intact to retain information about the original rows.
如您所见,我们现在只有“姓名”列的内容。请注意,pandas 保留了索引不变,以保留有关原始行的信息。
To access multiple columns at once, pass a list with column names:
要访问多个列,请传递一个包含列名的列表:
| |
Adding a list of columns as input, we obtain a new DataFrame that only includes the students’ ‘Name’ and ‘Grade’.
输入列名列表,我们可以得到一个新的 DataFrame,其中仅包含学生的“姓名”和“年级”。
Remember: when handling a single column, pandas will return a Series. When dealing with multiple columns, pandas will return a DataFrame. Any new DataFrame or Series changes will not affect the original data.
记住:处理单列时,pandas 将返回 Series。处理多列时,pandas 将返回 DataFrame。任何新的 DataFrame 或 Series 更改都不会影响原始数据。
Understanding Data Types 理解数据类型
The strength of a DataFrame lies in its ability to accommodate multiple data types. Using the DataFrame.dtypes property, we can ascertain the data type of each column. Let’s look at the data types of out DataFrame:DataFrame 的优势在于它能够适应多种数据类型。使用 DataFrame.dtypes 属性,我们可以确定每一列的数据类型。让我们看一下 DataFrame 的数据类型:
| |
Note, that strings are represented as the object data type.
注意,字符串以 object 数据类型表示。
Unlike NumPy arrays, which harbor single-type data, pandas DataFrames can handle and manipulate different data types.
与只能处理单一数据类型的 NumPy arrays 不同,pandas DataFrames 可以处理和操作多种数据类型。
Lesson Summary 课程总结
And there we have it! We’ve traversed the fascinating terrain of the pandas DataFrame. We’ve compared it with NumPy arrays, learned how to build a DataFrame, and explored its structure and diverse data types. Now, it’s time to shift gears to some hands-on practice. You’ll master the art of creating pandas’ DataFrame, solidifying your understanding and bolstering your confidence.
好了!我们已经穿越了 pandas DataFrame 的奇妙世界。我们将它与 NumPy arrays 进行了比较,学习了如何构建 DataFrame ,并探索了它的结构和多样化的数据类型。现在,让我们开始实践吧。您将掌握创建 pandas DataFrame, 的技巧,巩固您的理解并增强您的信心。
「Pratice」Generating Bigrams and Trigrams with NLP
Astronauts, let’s beam the classroom data into our trusty spaceship, the pandas DataFrame! This handy tool will help us organize and analyze the students’ test scores for different subjects. Click on the Run button to see just how easy it is!
宇航员们,让我们将课堂数据传送进我们可靠的宇宙飞船“熊猫号”里。
| |
「Pratice」 Generating Bigrams and Trigrams from Text Data
Absolutely! Let’s organize and analyze the classroom data using a pandas DataFrame. Here’s the complete Python code for creating the DataFrame, adding a new column for participation scores, and displaying the updated DataFrame:
| |
Explanation:
- Import pandas: First, we import the pandas library.
- Define Classroom Data: The classroom data is defined as a dictionary containing student names, subjects, and test scores.
- Create DataFrame: We create a DataFrame from the classroom data using
pd.DataFrame. - Print Initial DataFrame: The initial DataFrame is printed to show the classroom data.
- Add Participation Score: A new column
ParticipationScoreis added to the DataFrame. - Print Updated DataFrame: The updated DataFrame is printed to show the classroom data along with the participation scores.
Running this code will produce the following output:
| |
This demonstrates how easy it is to organize and analyze data using pandas DataFrame.
「Pratice」Generating Bigrams and Trigrams from Two Texts
Astronauts, let’s beam the classroom data into our trusty spaceship, the pandas DataFrame! This handy tool will help us organize and analyze the students’ test scores for different subjects. Click on the Run button to see just how easy it is!
| |
输出
| |
解释
You’re right! This code snippet is a great introduction to using pandas DataFrames for organizing and analyzing data, especially in a space exploration context. Let’s break it down step by step:
1. Importing pandas:
import pandas as pd
- This line imports the
pandaslibrary, which is a powerful tool for data manipulation and analysis in Python. We use theas pdpart to give the library a shorter alias, making it easier to use later.
2. Creating the Data Dictionary:
classroom_data = { 'StudentName': ['Amy', 'Bob', 'Eva', 'Jake'], 'Subject': ['Math', 'English', 'Science', 'Arts'], 'TestScore': [89, 93, 91, 97] }
- This creates a dictionary called
classroom_datato store the information about the students. - The dictionary has three keys:
StudentName: A list of student names.Subject: A list of subjects.TestScore: A list of test scores.
3. Creating the DataFrame:
classroom_df = pd.DataFrame(classroom_data)
- This line creates a
pandasDataFrame calledclassroom_dffrom theclassroom_datadictionary. - DataFrames are like tables in spreadsheets, providing a structured way to organize and work with data.
4. Printing the DataFrame:
print("\nClassroom DataFrame:\n", classroom_df)
- This line prints the
classroom_dfDataFrame to the console. You’ll see a neatly formatted table with the student names, subjects, and test scores.
5. Adding a New Column:
classroom_df['ParticipationScore'] = [9, 10, 8, 10]
- This line adds a new column named
ParticipationScoreto theclassroom_dfDataFrame. - It assigns the values
[9, 10, 8, 10]to the new column, representing the participation scores for each student.
6. Printing the Updated DataFrame:
print("\n Updated DataFrame with 'ParticipationScore':\n", classroom_df)
- This line prints the updated
classroom_dfDataFrame, now including theParticipationScorecolumn.
In Summary:
This code demonstrates how to:
- Import the
pandaslibrary. - Create a dictionary to hold data.
- Convert the dictionary into a
pandasDataFrame. - Print the DataFrame.
- Add a new column to the DataFrame.
This is a basic example of how pandas can be used to work with data in a structured and efficient way. You can use pandas to perform various operations on DataFrames, such as:
- Filtering: Selecting specific rows based on conditions.
- Sorting: Arranging data in ascending or descending order.
- Aggregation: Calculating summary statistics (like mean, median, etc.).
- Joining: Combining data from multiple DataFrames.
pandas is a powerful tool for data analysis in Python, and it’s widely used in various fields, including data science, machine learning, and finance.
「Pratice」Creating Bigrams from Preprocessed Text Data
Well done, Voyager! Now, let’s tweak the code a little. Your task is to modify it to add a new column, "Experience", to the teachers_df DataFrame. For 'Experience', use the list [12, 6, 15]. Reach for the stars!
| |
「Pratice」Unigrams and Bigrams from Clean 20 Newsgroups Dataset
Lesson 5: Understanding Named Entity Recognition in NLP
Adding a Row to a DataFrame
Multiple scenarios might necessitate adding new entries to our DataFrame. Let’s explore how to accomplish that:
In modern pandas, we use pd.concat() function to incorporate new rows. If you forgot to add 'Pears' to your grocery list, here’s how to do it:
| |
Setting reset_index(drop=True) resets the index to default integers. Without this step, pandas will save the original dataframes’ indices, resulting in both 'Pears' and 'Apples' sharing the same index 0.
Adding Multiple Rows to a DataFrame
For multiple rows, you can concatenate them by creating a DataFrame and adding it to the original one:
| |
You may wonder why we don’t include these rows in the original dataframe. Well, it is only sometimes possible. Imagine we have two separate grocery lists coming from different sources, for instance, from separate files. In this case, the only way to combine them into one is to use pd.concat()
Removing Rows from a DataFrame
Frequently, we must delete rows from a DataFrame. To facilitate this, Pandas provides the drop() function. Suppose you want to remove 'Grapes' or both 'Apples' and 'Oranges' from your list. Here’s how:
| |
Note that the .drop() method returns a new updated DataFrame instead of changing the original one. It allows you to modify the data while keeping its original state to return to it if necessary.
Removing Multiple Rows
There will be times when you will have to remove multiple rows in one go. For example, let’s say you were informed that 'Apples' and 'Oranges' are out of stock, so you need to remove them from your grocery list. The drop() function allows you to do this too.
When removing multiple rows, we utilize the .isin() function, which checks if a value exists in a particular DataFrame column. You provide it with the values you want to remove, and it outputs the indices of those rows. Let’s see it in action:
| |
In this block of code, the variable indices_to_delete holds the indices of the rows where the ‘Grocery Item’ is either 'Apples' or 'Oranges'. We then pass indices_to_delete to the drop() function, which removes the corresponding rows from the DataFrame.
Keep in mind, just as with removing a single row, the drop() function here doesn’t change the original DataFrame. Instead, it returns a new DataFrame with the specified rows removed. This way, you can always revert back to the original data if needed.
Recap and Practice Announcement
Congratulations! You’ve now mastered adding and removing rows in a DataFrame, a crucial element in data manipulation. We discussed rows and their indexing and learned to add rows using pd.concat() and to remove them with drop(). Now, let’s put this into practice! The upcoming exercises will enhance your data manipulation skills, enabling you to handle more complex operations on a DataFrame. Are you ready to give them a try?
Well done! Now let’s experiment a bit more with our grocery list. We’re planning to make honey toast for breakfast, so we need to add ‘Honey’ to our grocery list and remove ‘Cheese’ because it’s not needed at the moment. Examine what I’ve done, then click “Run” to execute the code.
| |
Congratulations! You’ve now mastered adding and removing rows in a DataFrame, a crucial element in data manipulation. We discussed rows and their indexing and learned to add rows using pd.concat() and to remove them with drop(). Now, let’s put this into practice! The upcoming exercises will enhance your data manipulation skills, enabling you to handle more complex operations on a DataFrame. Are you ready to give them a try?
Well done! Now let’s experiment a bit more with our grocery list. We’re planning to make honey toast for breakfast, so we need to add ‘Honey’ to our grocery list and remove ‘Cheese’ because it’s not needed at the moment. Examine what I’ve done, then click “Run” to execute the code.
做得好!现在让我们对购物清单进行更多实验。我们计划制作蜂蜜吐司作为早餐,因此我们需要将“蜂蜜”添加到我们的购物清单中并删除“奶酪”,因为目前不需要它。检查我所做的事情,然后单击“运行”来执行代码。
| |
输出
| |
Changing the Sentence for Named Entity Recognition
Implementing Tokenization and POS Tagging
Applying Named Entity Recognition to a Sentence
Implementing a Named Entity Extraction Function
Applying NER and POS Tagging to Dataset

Feature Engineering for Text Classification
Dive deeper into the transformation of raw text data into features that machine learning models can understand. Through a practical, hands-on approach, you’ll learn everything from tokenization, generating Bag-of-Words and TF-IDF representations, to handling sparse features and applying Dimensionality Reduction techniques.
introduction and Text Data Collection
Welcome to today’s lesson! As data science and machine learning professionals, particularly in the Natural Language Processing (NLP) field, we often deal with textual data. Today, we dive into the ‘Introduction to Textual Data Collection’. Specifically, we’ll explore how to collect, understand and analyze text data using Python.
Textual data is usually unstructured, being much harder to analyze than structured data. It can take many forms, such as emails, social media posts, books, or transcripts of conversations. Understanding how to handle such data is a critical part of building effective machine learning models, especially for text classification tasks where we ‘classify’ or categorize texts. The quality of the data we use for these tasks is of utmost importance. Better, well-structured data leads to models that perform better.
The 20 Newsgroups Dataset
20 个新闻组数据集
The dataset we’ll be working with in today’s lesson is the 20 Newsgroups dataset. For some historical background, newsgroups were the precursors to modern internet forums, where people gathered to discuss specific topics. In our case, the dataset consists of approximately 20,000 documents from newsgroup discussions. These texts were originally exchanged through Usenet, a global discussion system that predates many modern Internet forums.
在今天的课程中,我们将使用的数据集是 20 个新闻组数据集。出于某种历史背景,新闻组是现代互联网论坛的前身,人们聚集在一起讨论特定主题。在我们的例子中,数据集由来自新闻组讨论的大约 20,000 个文档组成。这些文本最初是通过Usenet交换的,Usenet是一个全球讨论系统,早于许多现代互联网论坛。
The dataset is divided nearly evenly across 20 different newsgroups, each corresponding to a separate topic - this segmentation is one of the main reasons why it is especially useful for text classification tasks. The separation of data makes it excellent for training models to distinguish between different classes, or in our case, newsgroup topics.
数据集几乎平均分布在 20 个不同的新闻组中,每个新闻组对应一个单独的主题 - 这种分割是它对文本分类任务特别有用的主要原因之一。数据的分离使得训练模型能够很好地区分不同的类,或者在我们的例子中,区分新闻组主题。
From science and religion to politics and sports, the topics covered provide a diversified range of discussions. This diversity adds another layer of complexity and richness, similar to what we might experience with real-world data.
从科学和宗教到政治和体育,所涵盖的主题提供了多样化的讨论范围。这种多样性增加了另一层复杂性和丰富性,类似于我们在现实世界数据中可能遇到的情况。
Fetching and Understanding the Data Structure
获取和理解数据结构
To load this dataset, we use the fetch_20newsgroups() function from the sklearn.datasets module in Python. This function retrieves the 20 newsgroup dataset in a format that’s useful for machine learning purposes. Let’s fetch and examine the dataset.
为了加载这个数据集,我们使用 Python 中 sklearn.datasets 模块中的 fetch_20newsgroups() 函数。此函数以可用于机器学习目的的格式检索 20 个新闻组数据集。让我们获取并检查数据集。
First, let’s import the necessary libraries and fetch the data:
首先,让我们导入必要的库并获取数据:
| |
「PRACTICE」
Excellent start, Space Voyager! Now, here’s a curveball. Can you fill in the missing code to create the DataFrame?
May the cosmic rays guide you!
| |
解释:
Excellent start, Space Voyager! Now, here’s a curveball. Can you fill in the missing code to create the DataFrame?
May the cosmic rays guide you!
| |
「practice」
Time to shine, stars! Our classroom data has somewhat disappeared. Can you reconstruct it from scratch, including both the classroom data dictionary and the DataFrame? Furthermore, please add a column today, AttendancePercentage, to the DataFrame. I know you’ve got this, champ!
| |
Good try, but that’s not quite right. It seems like your code is focused on school members and teachers, but we need to modify the grocery list DataFrame instead.
Remember, we need to:
- Add ‘Spinach’ with quantity
5. - Remove ‘Milk’.
Do you need help with how to start? 🤔
| |
Sure, I can help with that. Here’s how you can modify your code to add ‘Spinach’ with quantity 5 and remove ‘Milk’:
| |
This code first creates a new DataFrame for ‘Spinach’ and adds it to grocery_df. It then removes the row where the ‘Item’ is ‘Milk’.
「Practice」
Fantastic work, Voyager!
However, it appears we have a small bug in our shopping list code. The item “Cheese” was supposed to be removed from the list, but it remains. Can you remedy this? Remember, it’s not necessary for you to write any code from scratch; you only need to correct one or two lines of the existing code. Good luck!
| |
It looks like the drop method is not being assigned back to grocery_df. Here’s the corrected code:
| |
This code correctly removes ‘Cheese’ from the list and assigns the updated DataFrame back to grocery_df.
输出
| |
「Pratice]
Fantastic job, top-notch analyst!
干得好,一流的分析师!
This time around, you will be adding two new items, 'Cucumber' and 'Radish', priced at $0.50 and $0.30 respectively, to our grocery list. It’s very important to update our DataFrame once the items have been added. Ready to go the extra mile? Let’s do it!
这次,您将向我们的杂货店添加两个新商品 'Cucumber' 和 'Radish' ,价格分别为 $0.50 和 $0.30 列表。添加项目后更新我们的 DataFrame 非常重要。准备好加倍努力了吗?我们开始做吧!
| |
Sure, here’s how you can add ‘Cucumber’ and ‘Radish’ to your grocery list and update the DataFrame:
| |
This code creates a new DataFrame for ‘Cucumber’ and ‘Radish’ and concatenates it with the existing grocery_df. The updated DataFrame is then printed.
输出
| |
「Practice」
Are you ready for your final challenge, Voyager?
航海者,你准备好迎接最后的挑战了吗?
This time, your task is to remove two specific items, 'Apples' and 'Oranges' from our grocery list. As always, do not forget to update our DataFrame post the removal of these items. Are you up for the challenge? Let’s get started!
| |
Absolutely! Here’s how you can remove ‘Apples’ and ‘Oranges’ from the grocery list and update the DataFrame:
| |
This code finds the indices of ‘Apples’ and ‘Oranges’ in the DataFrame and then removes these items using the drop() function, updating the DataFrame accordingly.
输出
| |
Lesson 6 Navigating DataFrames with Index Column and Data Locating in Pandas
Brace for liftoff! 🚀 We’re exploring the uncharted galaxies of Index Columns and Locating Elements in a Pandas DataFrame today. This is a crucial waypoint in your data analysis journey. Exciting adventures lie ahead!
Introduction and Lesson Overviews
简介和课程概述
Welcome, future data analyzers! Today, we’re tackling Index Columns and Locating Elements in a Pandas DataFrame. We’ll learn how to handle index columns, locate specific data, and strengthen our understanding of DataFrames. Ready, set, code!
欢迎,未来的数据分析者!今天,我们正在处理 Pandas DataFrame 中的索引列和定位元素。我们将学习如何处理索引列、定位特定数据并加强我们对 DataFrame 的理解。准备好,设置,编码!
Understanding the Index Column in a Pandas DataFrame
了解 Pandas DataFrame 中的索引列
In a Pandas DataFrame, an index is assigned to each row, much like the numbers on books in a library. When a DataFrame is created, Pandas establishes a default index. Let’s refer to an example:
在 Pandas DataFrame 中,为每一行分配一个索引,就像图书馆中书籍的编号一样。创建 DataFrame 时,Pandas 会建立一个默认索引。我们看一个例子:
| |
The numbers on the left are the default index.
左边的数字是默认索引。
Setting and Modifying the Index Column
设置和修改索引列
Occasionally, we might need to establish a custom index. The Pandas’ set_index() function allows us to set a custom index. To reset the index to its default state, we use reset_index().
有时,我们可能需要建立自定义索引。 Pandas 的 set_index() 函数允许我们设置自定义索引。要将索引重置为其默认状态,我们使用 reset_index() 。
To better understand these functions, let’s consider an example in which we create an index using unique IDs:
为了更好地理解这些函数,让我们考虑一个使用唯一 ID 创建索引的示例:
| |
In this example, ID column is displayed as an index. Let’s reset the index to return to the original state:
在此示例中, ID 列显示为索引。让我们重置索引以返回到原始状态:
| |
By setting inplace parameter to True, we ask pandas to reset the index in the original df dataframe. Otherwise, pandas will create a copy of the data frame with a reset index, leaving the original df untouched.
通过将 inplace 参数设置为 True ,我们要求 pandas 重置原始 df 数据框中的索引。否则,pandas 将创建具有重置索引的数据帧的副本,而原始 df 保持不变。
Locating Elements in a DataFrame
定位 DataFrame 中的元素
Let’s consider a dataframe with a custom index. If you want to select a specific row based on its index value (for example, ID = 102), you can do this:
让我们考虑一个具有自定义索引的数据框。如果您想根据索引值选择特定行(例如 ID = 102 ),您可以执行以下操作:
| |
Selecting Multiple Rows with loc
使用“loc”选择多行
For multiple rows, simply use list of ids:
对于多行,只需使用 id 列表:
| |
As you can see, the output of the .loc operation is some subset of the original dataframe.
如您所见, .loc 操作的输出是原始数据帧的某个子集。
Selecting Multiple Columns with loc
使用“loc”选择多列
To select specific multiple columns for these rows, you can provide the column labels as well:
要为这些行选择特定的多列,您还可以提供列标签:
| |
Also you can select all rows for specific columns, providing : as a set of index labels:
您还可以选择特定列的所有行,提供 : 作为一组索引标签:
| |
Using iloc for Location by Index Position
使用“iloc”按索引位置定位
The iloc function enables us to select elements in a data frame based on their index positions. iloc works like the loc, but it expects the index number of the rows. For example, we can select the 3rd row:iloc 函数使我们能够根据索引位置选择数据框中的元素。 iloc 的工作方式与 loc 类似,但它需要行的索引号。例如,我们可以选择 3 rd 行:
| |
You can also use slicing here:
您还可以在此处使用切片:
| |
Lesson Summary and Next Steps
课程总结和后续步骤
That’s it! We’ve covered the index column, how to set it, and how to locate data in a DataFrame. Exciting exercises are up next. Let’s practice and strengthen the skills you’ve learned today. Let the fun begin!
就是这样!我们已经介绍了索引列、如何设置它以及如何在 DataFrame 中定位数据。接下来是激动人心的练习。让我们练习并加强您今天学到的技能。让乐趣开始!
「practice」
Hey, remember the people dataframe? How about exploring it further? In this task, you’ll see how we can select only the 'Name' and 'Age' columns for all rows. No coding is needed here, ace, just hit the 'Run' button!
嘿,还记得 people 数据框吗?进一步探索如何?在此任务中,您将看到我们如何为所有行仅选择 'Name' 和 'Age' 列。这里不需要编码,
| |
输出
| |
「Practice」
Fantastic job so far! Now, we’re going to use your people database. This time, could you alter the starter code to select just the last two rows using index positions? To accomplish this, use the iloc function. Have fun!
| |
Here’s the modified code to select the last two rows using the iloc function:
| |
This code correctly selects and prints the last two rows of the DataFrame.
「Practice」
Fantastic job so far! Now, we’re going to use your people database. This time, could you alter the starter code to select just the last two rows using index positions? To accomplish this, use the iloc function. Have fun!
| |
解释
“To select the last two rows using the iloc function, you need to adjust the indexing appropriately. Here is the modified code:
| |
This code will correctly select and display the last two rows of the DataFrame.”
「practice」
Bravo, explorer! Here is a tricky one for you.
The same dataframe was intended to select all rows for the 'Name' and 'City' columns, but it threw an error. Could you debug it?
| |
输出
“The error occurs because you’re using integer-based column indexing (which iloc supports) with the loc function, which expects column labels. To select all rows for the ‘Name’ and ‘City’ columns using their labels, you should modify the code as follows:
| |
This code will correctly select and display all rows for the ‘Name’ and ‘City’ columns.”
Lessons and practices
Introduction to Textual Data Collection in NLP
NLP 中的文本数据收集简介
Introduction to Textual Data Collection in NLP
NLP 中的文本数据收集简介
Introduction and Text Data Collection
简介和文本数据收集
Welcome to today’s lesson! As data science and machine learning professionals, particularly in the Natural Language Processing (NLP) field, we often deal with textual data. Today, we dive into the ‘Introduction to Textual Data Collection’. Specifically, we’ll explore how to collect, understand and analyze text data using Python.
欢迎来到今天的课程!作为数据科学和机器学习专业人士,特别是在自然语言处理 (NLP) 领域,我们经常处理文本数据。今天,我们深入探讨“文本数据收集简介”。具体来说,我们将探索如何使用 Python 收集、理解和分析文本数据。
Textual data is usually unstructured, being much harder to analyze than structured data. It can take many forms, such as emails, social media posts, books, or transcripts of conversations. Understanding how to handle such data is a critical part of building effective machine learning models, especially for text classification tasks where we ‘classify’ or categorize texts. The quality of the data we use for these tasks is of utmost importance. Better, well-structured data leads to models that perform better.
文本数据通常是非结构化的,比结构化数据更难分析。它可以采取多种形式,例如电子邮件、社交媒体帖子、书籍或对话记录。了解如何处理此类数据是构建有效的机器学习模型的关键部分,特别是对于我们对文本进行“分类”或分类的文本分类任务。我们用于这些任务的数据的质量至关重要。更好、结构良好的数据可以带来性能更好的模型。
The 20 Newsgroups Dataset
20 个新闻组数据集
The dataset we’ll be working with in today’s lesson is the 20 Newsgroups dataset. For some historical background, newsgroups were the precursors to modern internet forums, where people gathered to discuss specific topics. In our case, the dataset consists of approximately 20,000 documents from newsgroup discussions. These texts were originally exchanged through Usenet, a global discussion system that predates many modern Internet forums.
我们在今天的课程中将使用的数据集是 20 个新闻组数据集。从某些历史背景来看,新闻组是现代互联网论坛的先驱,人们聚集在一起讨论特定主题。在我们的例子中,数据集包含来自新闻组讨论的大约 20,000 个文档。这些文本最初是通过 Usenet 进行交换的,这是一个早于许多现代互联网论坛的全球讨论系统。
The dataset is divided nearly evenly across 20 different newsgroups, each corresponding to a separate topic - this segmentation is one of the main reasons why it is especially useful for text classification tasks. The separation of data makes it excellent for training models to distinguish between different classes, or in our case, newsgroup topics.
该数据集几乎均匀地分布在 20 个不同的新闻组中,每个新闻组对应一个单独的主题 - 这种分割是它对于文本分类任务特别有用的主要原因之一。数据的分离使得训练模型能够很好地区分不同的类别,或者在我们的例子中区分新闻组主题。
From science and religion to politics and sports, the topics covered provide a diversified range of discussions. This diversity adds another layer of complexity and richness, similar to what we might experience with real-world data.
从科学和宗教到政治和体育,所涵盖的主题提供了多元化的讨论。这种多样性增加了另一层复杂性和丰富性,类似于我们在现实世界数据中可能遇到的情况。
| |
We are fetching the data and observing the type of the data and target. The type of data tells us what kind of data structure is used to store the text data while the type of target shouts what type of structure is used to store the labels. Here is what the output looks like:
我们正在获取数据并观察 data 和 target 的类型。 type of data 告诉我们使用什么类型的数据结构来存储文本数据,而 type of target 告诉我们使用什么类型的结构来存储标签。输出如下所示:
| |
As printed out, the data is stored as a list, and target as a numpy array.
打印出来时, data 存储为列表, target 存储为 numpy 数组。
Diving Into Data Exploration
深入数据探索
Now, let’s explore the data points, target variables and the potential classes in the dataset:
现在,让我们探索数据集中的数据点、目标变量和潜在类别:
| |
We get the length of the data list to fetch the number of data points. Also, we get the length of the target array. Lastly, we fetch the possible classes or newsgroups in the dataset. Here is what we get:
我们获取 data 列表的长度来获取数据点的数量。此外,我们还获得了 target 数组的长度。最后,我们获取数据集中可能的类或新闻组。这是我们得到的:
| |
| |
The Article fetched is the 10th article in the dataset and Corresponding Topic is the actual topic that the article belongs to. Here’s the output:
获取的 Article 是数据集中的第 10 篇文章, Corresponding Topic 是该文章所属的实际主题。这是输出:
| |
| |
Lesson Summary 课程总结
Nice work! Through today’s lesson, you’ve learned to fetch and analyze text data for text classification. If you’ve followed along, you should now understand the structure of text data and how to fetch and analyze it using Python.
干得好!通过今天的课程,您已经学会了如何获取和分析文本数据以进行文本分类。如果您已经跟进,现在应该了解文本数据的结构以及如何使用 Python 获取和分析它。
But our journey to text classification is just starting. In upcoming lessons, we’ll dive deeper into related topics such as cleaning textual data, handling missing values, and restructuring textual data for analysis. Each step forward improves your expertise in text classification. Keep going!
但我们的文本分类之旅才刚刚开始。在接下来的课程中,我们将更深入地探讨相关主题,例如清理文本数据、处理缺失值以及重构文本数据以进行分析。每前进一步都会提高您在文本分类方面的专业知识。继续前进!
Sample Data Preview 样本数据预览
Lastly, let’s fetch and understand what a sample data point and its corresponding label looks like:
最后,让我们获取并了解示例数据点及其相应标签的样子:
Fetching and Understanding the Data Structure
获取和理解数据结构
To load this dataset, we use the fetch_20newsgroups() function from the sklearn.datasets module in Python. This function retrieves the 20 newsgroup dataset in a format that’s useful for machine learning purposes. Let’s fetch and examine the dataset.
为了加载此数据集,我们使用 Python 中 sklearn.datasets 模块中的 fetch_20newsgroups() 函数。此函数以对机器学习有用的格式检索 20 个新闻组数据集。让我们获取并检查数据集。
First, let’s import the necessary libraries and fetch the data:
首先,让我们导入必要的库并获取数据:
Python
CopyPlay
1# Importing necessary libraries 2from sklearn.datasets import fetch_20newsgroups 3 4# Fetch data 5newsgroups = fetch_20newsgroups(subset='all')
The datasets fetched from sklearn typically have three attributes—data, target, and target_names. data refers to the actual content, target refers to the labels for the texts, and target_names provides names for the target labels.
从 sklearn 获取的数据集通常具有三个属性 - data 、 target 和 target_names 。 data 指实际内容, target 指文本标签, target_names 提供目标标签的名称。
Next, let’s understand the structure of the fetched data:
接下来我们来了解一下获取到的数据的结构:
「Practice1」Explore More of the 20 Newsgroups Dataset
探索 20 个新闻组数据集的更多内容 Excellent job, Space Voyager! Now, make a small alteration to the starter code: change it to print out the first 150 characters of the 500th article from our 20 Newsgroups dataset, and also display its corresponding topic.
| |
“Here is the updated code to print out the first 150 characters of the 500th article from the 20 Newsgroups dataset and display its corresponding topic.”
| |
输出
| |
「Practice2」Uncover the End of 20 Newsgroups Dataset
Celestial Traveler, your journey continues! Fill in the blanks (____) to import and explore our dataset. We aim to extract and display the last three articles and their corresponding topics. Can you reveal what’s at the end of our dataset?
| |
“Here is the completed code to import and explore the dataset, extracting and displaying the last three articles and their corresponding topics.”
| |
输出
| |
「Practice 3」 Fetch Specific Categories from Dataset
从数据集中获取特定类别
Celestial Traveler, let’s narrow down our data collection. Modify the provided code to fetch only the 'alt.atheism' and 'talk.religion.misc' categories from our dataset. Then, display the first two articles from these categories along with their corresponding labels.
| |
解题:
“Here is the modified code to fetch only the 'alt.atheism' and 'talk.religion.misc' categories from the dataset and display the first two articles along with their corresponding labels.”
| |
输出
| |
「Practice 4」Fetching the Third Article from Dataset
从数据集中获取第三篇文章
Well done, Stellar Navigator! Next, fill in the missing line in the code below to fetch and display the third article from the 20 Newsgroups dataset with its corresponding topic. Prepare your spacecraft for another adventure in data exploration!
干得好,恒星导航员!接下来,填写下面代码中缺少的行,以从 20 个新闻组数据集中获取并显示第三篇文章及其相应的主题。准备好您的航天器,进行另一次数据探索冒险!
| |
输出
| |
Exploring Text Length in Newsgroups Dataset
探索新闻组数据集中的文本长度
Great job, Space Voyager! Now, as a final task, write a Python script that calculates and displays the lengths of the first five articles (in terms of the number of characters) from the 20 Newsgroups dataset.
干得好,太空航行者!现在,作为最后一项任务,编写一个 Python 脚本,计算并显示 20 个新闻组数据集中的前 5 篇文章的长度(以字符数计)。
| |
解释
| |
输出
| |
Lesson 2 第2课
Mastering Text Cleaning for NLP: Techniques and Applications
掌握 NLP 文本清理:技术与应用
Lesson 2
Mastering Text Cleaning for NLP: Techniques and Applications
Introduction 介绍
Welcome to today’s lesson on Text Cleaning Techniques! In any Natural Language Processing (NLP) project, the quality of your results depends heavily on the quality of your input. Hence, cleaning our textual data becomes critical for the accuracy of our project. Our main objective for today is to delve into how to clean textual data using Python. By the end of this session, you will be comfortable with creating and applying a simple text cleaning pipeline in Python.
欢迎来到今天的文本清理技术课程!在任何自然语言处理 (NLP) 项目中,结果的质量在很大程度上取决于输入的质量。因此,清理文本数据对于我们项目的准确性至关重要。我们今天的主要目标是深入研究如何使用 Python 清理文本数据。在本课程结束时,您将能够轻松地在 Python 中创建和应用简单的文本清理管道。
Introduction
Welcome to today’s lesson on Text Cleaning Techniques! In any Natural Language Processing (NLP) project, the quality of your results depends heavily on the quality of your input. Hence, cleaning our textual data becomes critical for the accuracy of our project. Our main objective for today is to delve into how to clean textual data using Python. By the end of this session, you will be comfortable with creating and applying a simple text cleaning pipeline in Python.
Understanding Text Cleaning
Text cleaning is essential in NLP, involving the preparation of text data for analysis. Why is it necessary? Imagine trying to perform text classification on social media posts; people often use colloquial language, abbreviations, and emojis. In many cases, posts might also be in different languages. These variations make it challenging for machines to understand context without undergoing preprocessing.
We get rid of superfluous variations and distractions to make the text understandable for algorithms, thereby increasing accuracy. These distractions could range from punctuation, special symbols, numbers, to even common words that do not carry significant meaning (commonly referred to as “stop words”).
Python’s Regex (Regular Expression) library, re, is an ideal tool for such text cleaning tasks, as it is specifically designed to work with string patterns. Within this library, we will be using re.sub, a method employed to replace parts of a string. This method operates by accepting three arguments: re.sub(pattern, repl, string). Here, pattern is the character pattern we’re looking to replace, repl is the replacement string, and string is the text being processed. In essence, any part of the string argument that matches the pattern argument gets replaced by the repl argument.
As we proceed, a clearer understanding of the functionality and application of re.sub will be provided. Now, let’s delve into it!
Text Cleaning Process
The text cleaning process comprises multiple steps where each step aims to reduce the complexity of the text. Let’s take you through the process using a Python function, clean_text.
| |
In the function above we can see how each line corresponds to a step in the cleaning process:
- Lowercase: We convert all text to lower case, so every word looks the same unless it carries a different meaning. This way, words like ‘The’ and ’the’ are no longer seen as different.
- Email addresses: Email addresses don’t usually provide useful information unless we’re specifically looking for them. This line of code removes any email addresses found.
- URLs: Similarly, URLs are removed as they are typically not useful in text classification tasks.
- Special Characters: We remove any non-word characters (
\W) and replace it with space using regex. This includes special characters and punctuation. - Numbers: We’re dealing with text data, so numbers are also considered distractions unless they carry significant meaning.
- Extra spaces: Any resulting extra spaces from the previous steps are removed.
Let’s go ahead and run this function on some demo input to see it in action!
| |
The output of the above code will be:
| |
Implementing Text Cleaning Function
Now that you are familiar with the workings of the function let’s implement it in the 20 Newsgroups dataset.
To apply our cleaning function on the dataset, we will make use of the DataFrame data structure from Pandas, another powerful data manipulation tool in Python.
| |
The output of the above code will be:
| |
In this code, we’re applying the clean_text function to each ’text’ in our DataFrame using the apply function. The apply function passes every value of the DataFrame column to the clean_text function one by one.
Understanding Effectiveness of Text Cleaning Function
We want to understand the impact of our text cleaning function. We can achieve this by looking at our text before and after cleaning. Let’s use some new examples:
| |
The output of the above code will be:
| |
In the example above, you will see that our function successfully transforms all text to lower case and removes punctuation, digits, and email addresses!
Lesson Summary and Practice Exercises
Today we delved into the text cleaning process in Natural Language Processing. We shared why it is necessary and how to implement it in Python. We then applied our text cleaning function on a textual dataset.
We have a few exercises lined up based on what we learned today. Keep swimming ahead, and remember, you learn the most by doing. Happy cleaning!
「practice」
Well done, Space Voyager! Now, to further explore the workings of our text cleaning function, let’s use a different sentence. Replace the first sentence in the test_texts list with the phrase “I love learning at CodeSignal; it’s so interactive and fun!”. Then run the clean_text function with the updated list.
| |
Certainly! Here is the updated code with the new sentence in the test_texts list and the clean_text function:
| |
Running this updated code will process the new sentence along with the other test cases. Here is the expected output:
| |
This demonstrates how the clean_text function processes and cleans the text by converting to lowercase, removing punctuation, digits, extra spaces, and other unwanted characters.
「Practice」
Superb job! Now, let’s ensure you have grasped the core Python functions and regex patterns used in text cleaning.
Fill in the blanks (____) to transform all text to lowercase and remove extra spaces.
| |
Here is the Python code with the blanks filled in to transform all text to lowercase and remove extra spaces:
| |
Explanation of the substitutions:
text.lower(): Converts the text to lowercase.re.sub(r'\S*@\S*\s?', '', text): Removes email addresses by matching patterns of the form ‘something@something’.re.sub(r'http\S+', '', text): Removes URLs by matching patterns starting with ‘http’.re.sub(r'\W', ' ', text): Removes punctuation and special characters by replacing non-word characters with a space.re.sub(r'\d', ' ', text): Removes digits by replacing them with a space.re.sub(r'\s+', ' ', text): Replaces multiple spaces with a single space.
「Practice3」
Charming maneuvering, Stellar Navigator!
Now, I want to see more of your Python prowess. Replace the missing parts (____), ensuring that they perform the right operations. We aim to transform all text to lowercase and eliminate email addresses, URLs, special characters, digits, and extra spaces.
| |
Absolutely, let’s fill in the missing parts to ensure the clean_text function works as intended. The goal is to convert all text to lowercase and eliminate email addresses, URLs, special characters, digits, and extra spaces. Here’s the complete function:
| |
Explanation:
- Convert text to lower case:
text = text.lower() - Remove email addresses:
text = re.sub(r'\S*@\S*\s?', '', text) - Remove URLs:
text = re.sub(r'http\S+', '', text) - Remove punctuation and special characters:
text = re.sub(r'\W', ' ', text) - Remove digits:
text = re.sub(r'\d', '', text) - Normalize whitespace to single spaces:
text = re.sub(r'\s+', ' ', text)
When you run this script, it will clean each text string according to the rules specified and print the original and cleaned text for each example. This ensures that the function performs the desired operations correctly.
output:
| |
「Practice4」
Greetings, data pioneers! Let’s take our next step.
You have the people dataframe, wherein unique IDs are set as the index. However, it appears that we’re missing the details of the person with ID 503. Could you complete the code to display the specifics of the individual with ID 503, utilizing the lessons we have learned so far?
| |
Sure, let’s rephrase and expand your query to ensure we understand and provide a comprehensive solution.
Your goal is to retrieve and display the details of the individual with ID 503 from the people dataframe. The dataframe currently contains columns for “Name”, “Age”, and “City” with unique IDs as the index. We need to write the appropriate code to locate and print the details for the person with ID 503.
Here’s how we can do it:
- Retrieve the specific row using the ID.
- Print the details of the person with ID 503.
Here is the complete code to accomplish this:
| |
When you run this code, it will output the details of the person with ID 503:
| |
This approach ensures that we accurately locate and display the details of the specific individual identified by the unique ID 503. 输出
| |
Lesson 2: Implementing Bag-of-Words Representation
You’re doing an amazing job learning all about descriptive statistics and Python! I’m super excited for this lesson on central tendency. Let’s dive in and explore the mysteries of mean, median and mode! 🌟
Introduction to Descriptive Statistics and Python
Greetings, data enthusiast! Today, we are diving into descriptive statistics using Python. We’ll be exploring measures of centrality — mean, median, and mode — using Python libraries numpy and pandas.
Understanding Central Tendency
A central tendency finds a ‘typical’ value in a dataset. Our three components — the mean (average), median (mid-point), and mode (most frequently appearing) — each offer a unique perspective on centrality. The mean indicates average performance when decoding students’ scores, while the median represents the middle student’s performance, and the mode highlights the most common score.
Visualizing Central Tendency
This plot represents a given dataset’s mean or centered location, also considered the ‘average’. Imagine a seesaw balancing at its center - the mean of a dataset is where it balances out. It is a crucial statistical concept and visually helps identify where most of our data is centered around or leaning toward. 
Setting up the Dataset
Our dataset is a list of individuals’ ages: [23, 22, 22, 23, 24, 24, 23, 22, 21, 24, 23]. Remember, understanding your data upfront is key to conducting a meaningful analysis.
Computing Mean using Python
Calculating the mean involves adding all numbers together and then dividing by the count. Here’s how you compute it in Python:
| |
Computing Median using Python
The median is the ‘middle’ value in an ordered dataset. This is how it is computed in Python:
| |
Computing Mode using Python
The mode represents the most frequently occurring number(s) in a dataset. To compute it, we use the mode function from the scipy library:
| |
Note, that calculated mode_age is an object. To retrieve the actual value from it, we use the .mode attribute of this object. So, resulting line is mode_age.mode.
NumPy doesn’t have a function for calculating mode, so we are using the SciPy module here. We will talk more about this module and its capabilities in the future lessons.
Handling Ties in Mode with scipy
Great job so far! Now let’s explore an interesting concept: how the mode function from scipy.stats handles ties or duplicate modes.
So, what’s a tie in mode? Imagine we have two or more different numbers appearing the same number of times in our dataset. For instance, consider this dataset: [20, 21, 21, 23, 23, 24]. Here, 21 and 23 both appear twice and are therefore modes.
Let’s calculate the mode using scipy.stats:
| |
Although 21 and 23 are both modes, our calculation only returned 21. Why is that?
In cases of ties, scipy.stats.mode() returns the smallest value amongst the tied modes. So in this case, it picked 21 over 23 because 21 is the smaller value.
Choice of Measure of Central Tendency
Your choice of measure of central tendency depends on the nature of your data. For numerical data, the mean is susceptible to outliers, i.e., extreme values, making the median a preferable measure. The mode is undefined when no particular value repeats, or all values repeat with equal frequency. For categorical data, the mode is the only meaningful measure.
Wrapping Up
Kudos! You have mastered the measures of central tendency and have learned how to compute them using Python! Stay tuned for some hands-on exercises for deeper reinforcement. Onward!
「Practice 1」Analyzing Age Group Dataset with Measures of Centrality
Greetings, fellow explorer!
In our Age Group Analysis, we’ve calculated measures of centrality for a group of individuals’ ages. Let’s review!
Click the Run button to display the mean, median, and mode for the dataset.
| |
解释 Thank you for pointing that out! Let’s correct the code by accessing the mode directly without indexing. Here is the updated version:
| |
Let’s run this corrected version to display the mean, median, and mode of the dataset.
知识点拆解
好的,让我们具体解释一下这个代码示例中的主要知识点。
数据集定义
| |
这是一个包含多个年龄值的数组。我们将使用这个数据集来计算平均值、中位数和众数。
计算平均值(Mean)
| |
- 平均值是所有数据点的和除以数据点的数量。
- 在这个例子中,
np.mean函数计算了age_group数组的平均值。
计算中位数(Median)
| |
- 中位数是数据集中的中间值。当数据点数为奇数时,中位数是排序后中间的那个值;当数据点数为偶数时,中位数是排序后中间两个值的平均值。
- 在这个例子中,
np.median函数计算了age_group数组的中位数。
计算众数(Mode)
| |
- 众数是数据集中出现次数最多的值。
- 在这个例子中,
stats.mode函数计算了age_group数组的众数。
输出结果
| |
- 这些语句将计算的平均值、中位数和众数输出到控制台。
运行代码的预期结果
假设我们运行代码,上述语句将输出如下结果:
- Mean age: 9.545454545454545
- Median age: 10.0
- Mode age: [10]
这意味着:
- 数据集的平均年龄约为9.55岁。
- 数据集的中位数年龄是10岁。
- 最常见的年龄(众数)是10岁。
通过这些统计量,我们可以更好地理解数据集的分布和中心趋势。这在数据分析中非常重要,因为它可以帮助我们概括和解释数据的总体特征。
「Practice2」
Hello there, Data Explorer! Now we’re going to observe how altering data can affect our measures of centrality.
Please adjust the age values in the existing code. You might want to consider older or younger ages. Let’s then examine how the mean, median, and mode are subject to change.
| |
解释 Let’s modify the ages in the dataset to observe how changes affect the measures of centrality: mean, median, and mode. We will adjust the ages to include younger and older ages, and then recalculate the mean, median, and mode.
Here is the updated code with adjusted ages and the calculations for mean, median, and mode:
| |
Observation:
- The mean age has increased significantly due to the inclusion of much older ages, which has a large impact on the average.
- The median age is now 26.0, reflecting the middle value in the ordered list of ages.
- The mode age is 15, as it is the most frequently occurring value in the dataset.
By including more extreme values (both younger and older ages), we can see how the mean is more sensitive to these changes compared to the median and mode. The median provides a better central tendency measure when dealing with outliers or skewed data, while the mode reflects the most common age in the dataset.
「Practice3」
Wonderful work, intrepid data navigator! Now, let the illumination of knowledge guide you through this challenge. Complete the missing parts in the code to find the mean, median, and mode of the second age group.
| |
解释
To complete the missing parts of the code to find the mean, median, and mode of the second age group, we need to use the appropriate functions from the numpy and scipy libraries. Below is the complete code:
| |
Explanation:
np.mean(ages_group_2)computes the mean of the age group.np.median(ages_group_2)computes the median of the age group.stats.mode(ages_group_2)computes the mode of the age group. Note thatmode_2.modereturns an array, somode_2.mode[0]is used to extract the mode value.
This will output the mean, median, and mode for the second age group when executed.
「Practice3」
Bravo, Future Analyst! You’ve reached the final stage, where we’re analyzing the ages of kids at a space camp.
Your task is to use your skills to compute the mean, median, and mode. Are you ready to determine the average age, middle age, and most common age of these little astronauts?
Proceed when you’re ready!
| |
解释 Certainly! Let’s complete the code to compute the mean, median, and mode for the ages of the kids at the space camp. Below is the full implementation:
| |
Explanation:
np.mean(kids_ages)computes the average age (mean) of the kids.np.median(kids_ages)computes the middle age (median) of the kids.stats.mode(kids_ages)computes the most common age (mode) of the kids. Note thatmode_age.modereturns an array, somode_age.mode[0]is used to extract the mode value.
This will output the mean, median, and mode for the kids’ ages at the space camp when executed.
lesson2
Introduction and Overview
Welcome back! Our journey into Descriptive Statistics continues with Measures of Dispersion. These measures, including range, variance and standard deviation, inform us about the extent to which our data is spread out. We’ll use Python’s numpy and pandas libraries to paint a comprehensive picture of our data’s dispersion. Let’s dive right in!
Understanding Measures of Dispersion
理解离散程度的度量方法
Measures of Dispersion capture the spread within a dataset. For example, apart from knowing the average test scores (a Measure of Centrality), understanding the ways in which the scores vary from the average provides a fuller picture. This enhanced comprehension is vital in everyday data analysis.
离散程度度量用于捕捉数据集内的离散程度。例如,除了了解平均考试分数(集中趋势度量)外,了解分数相对于平均值的离散方式可以提供更全面的信息。这种增强的理解在日常数据分析中至关重要。
Visualizing Measures of Dispersion
可视化离散程度的度量方法
This graph illustrates two normal distributions with varying standard deviations. Standard deviation measures how much each data point deviates from the average. Notice the curve’s width under each distribution: a smaller spread (blue curve) reflects a smaller standard deviation, where most of the data points are closer to the mean. In contrast, a wider spread (green curve) signifies a greater standard deviation and that data points vary more widely around the mean. 这张图表展示了两个具有不同标准差的正态分布。标准差衡量每个数据点偏离平均值的程度。请注意每个分布曲线下的宽度:较小的分布范围(蓝色曲线)反映较小的标准差,其中大多数数据点更接近平均值。相反,较大的分布范围(绿色曲线)表示较大的标准差,并且数据点在平均值周围的变化更大。 
Calculating Range in Python
在 Python 中计算范围
The Range, simply the difference between the highest and lowest values, illustrates the spread between the extremes of our dataset. Python’s numpy library has a function, ptp() (peak to peak), to calculate the range. Here are the test scores of five students:
极差,即最高值和最低值之间的差,说明了我们数据集中极端值之间的差距。Python 的 numpy 库有一个函数 ptp() (峰峰值)来计算极差。以下是五名学生的考试成绩:
| |
The result “Range of scores: 24”, derived from 96 - 72, tells us how widely the extreme scores are spread out.
从 96 - 72 中得出的结果“分数范围:24”告诉我们极端分数的分布范围。
Calculating Variance in Python
在 Python 中计算方差
Variance, another Measure of Dispersion, quantifies the degree to which data values differ from the mean. High variance signifies that data points are spread out; conversely, low variance indicates closeness. We calculate the variance using numpy‘s var() function:
方差是另一种衡量离散程度的指标,它量化了数据值与平均值的差异程度。高方差表示数据点分散;相反,低方差表示数据点接近。我们使用 numpy 的 var() 函数计算方差:
| |
Our output demonstrates the level of variability from the average.
我们的输出结果展示了与平均值的差异程度。
Calculating Standard Deviation in Python
在 Python 中计算标准差
Standard Deviation is rooted in Variance as it is simply the square root of Variance. It is essentially a measure of how much each data point differs from the mean or average. We can compute it through the std() function available in numpy.
标准差植根于方差,因为它仅仅是方差的平方根。它本质上是衡量每个数据点与平均值的差异程度。我们可以通过 std() 中提供的 numpy 函数来计算它。
| |
Why is standard deviation important when we already have variance? Compared to variance, standard deviation is expressed in the same units as the data, making it easier to interpret. Additionally, standard deviation is frequently used in statistical analysis because data within one standard deviation of the mean account for approximately 68% of the set, while within two standard deviations cover around 95%. These percentages aid in understanding data dispersion in a probability distribution. Therefore, while variance provides numerical insight into data spread, standard deviation conveys these insights in a more comprehensible and applicable manner.
为什么在已经有了方差的情况下,标准差仍然很重要?与方差相比,标准差与数据的单位相同,更易于解释。此外,标准差在统计分析中经常被使用,因为距离均值在一个标准差内的数据约占数据集的 68%,而在两个标准差内的数据约占 95%。这些百分比有助于理解概率分布中的数据离散程度。因此,虽然方差提供了对数据分散程度的数值洞察,但标准差以一种更易理解和应用的方式传达了这些洞察。
Conclusion 结论
Great job! You’ve just delved into Measures of Dispersion! These skills will assist you in better interpreting and visualizing data. Remember, hands-on practice solidifies learning. Stay tuned for some practice exercises. Now, let’s dive further into exploring our data!
干得好!你刚刚深入学习了离散程度的度量!这些技能将帮助你更好地解释和可视化数据。记住,动手练习才能巩固学习。请继续关注一些练习。现在,让我们进一步探索数据吧!
「Practice」
Let’s examine this, Galactic Pioneer! We have a class of math students, and we’re interested in how their scores are distributed.
The provided code demonstrates the calculations for the range, the variance, and the standard deviation of their scores. Click on the Run button to see how these measures can deliver a wealth of information about their performance!
| |
Let’s analyze the given Python code for calculating the range, variance, and standard deviation of math scores of students. I’ll also provide the equivalent Wolfram Language code to achieve the same calculations.
Python Code Explanation
The provided Python code uses the numpy library to perform the calculations on a list of math scores.
Range: The range is the difference between the maximum and minimum values in the dataset.
1 2range_scores = np.ptp(math_scores) print(f"Range of math scores: {range_scores}")Variance: Variance measures the spread of the numbers in the dataset.
1 2variance_scores = np.var(math_scores) print(f"Variance of math scores: {variance_scores}")Standard Deviation: Standard deviation is the square root of the variance, representing the dispersion of the dataset.
1 2std_scores = np.std(math_scores) print(f"Standard deviation of math scores: {std_scores}")
Equivalent Wolfram Language Code
Now, let’s translate this to Wolfram Language using the api_wolframcloud_com__jit_plugin.getWolframCloudResults for the calculations.
Explanation of the Wolfram Language Code
- Range: Calculated using
Max[] - Min[]. - Variance: Calculated using
Variance[]. - Standard Deviation: Calculated using
StandardDeviation[].
Here’s the Wolfram Language code for performing these calculations:
| |
Execution of the Wolfram Language Code
I’ll execute this code to provide the results.
| |
Let’s run the code and see the results.
「Practice」
Great job! Now, we’re going to stir things up a bit. Your assignment is to revise the existing code so that it calculates and prints out the range, variance, and standard deviation for English marks instead of Math marks. Let’s assume that there is a student with low performance in english: make one of the marks equal to 45. Examine how the range, variance and std change because of this tweak.
Let’s get started!
| |
解释 To revise the code for calculating and printing the range, variance, and standard deviation for English marks with one mark set to 45, we will modify the array to include English marks instead of math marks. Here is the revised Python code:
Revised Python Code for English Marks
| |
Equivalent Wolfram Language Code for English Marks
Now, let’s translate this revised code to Wolfram Language. We will calculate and print the range, variance, and standard deviation for the new English marks, including the low mark of 45.
Here’s the Wolfram Language code for performing these calculations:
| |
Execution of the Wolfram Language Code
I’ll execute this code to provide the results.
| |
Let’s run the code and see the results. Here are the results for the revised calculations with the English marks, including the low mark of 45:
Range: $$ \text{Range} = 47 $$
Variance: $$ \text{Variance} = \frac{1835}{7} \approx 262.14 $$
Standard Deviation: $$ \text{Standard Deviation} = \sqrt{\frac{1835}{7}} \approx 16.19 $$
Interpretation
- Range: The range has increased to 47, indicating a greater spread between the highest and lowest scores.
- Variance: The variance is approximately 262.14, which is higher than it would be without the low mark of 45. This indicates a higher spread of scores around the mean.
- Standard Deviation: The standard deviation is approximately 16.19, showing increased dispersion in the scores.
These measures indicate that the presence of a low score (45) has significantly increased the variability and spread of the scores in the dataset.
「Practice」
Excellent work, Space Voyager! It’s now time to put our debugging skills to the test. It appears there’s an issue with the console output in the code provided below — it’s not displaying all the results we expect to see. Could you identify and rectify the problem? Don’t stop exploring!
| |
Let’s review the provided code and identify the issue causing the unexpected console output. The code aims to calculate and print the range, variance, and standard deviation of an array of math scores.
Here’s the code:
| |
The problem lies in the calculation of the standard deviation. The correct method to calculate the standard deviation is to take the square root of the variance, not to square it. The incorrect calculation is:
| |
It should be corrected to:
| |
Here’s the corrected code:
| |
Now, the output should display the correct range, variance, and standard deviation of the scores.
「Practice」
Alright, Ace, you’re doing great! Now, how about we stretch your skills a bit more? Please finish the code snippet below to analyze the education performance scores.
Blast off!
| |
解释 Sure, let’s finish the code snippet to analyze the educational performance scores. We’ll add calculations for the mean, median, and standard deviation, along with the given range and variance.
Here’s the complete code:
| |
This script will calculate and print the range, variance, mean, median, and standard deviation of the given educational performance scores.
「Practice」
It’s time to wrap this up, my friend! Combine your knowledge of range, variance, and standard deviation calculations to perform an analysis on educational performance, as we have previously discussed. Apply these measures to a set of students’ mathematics scores and get ready to shine!import numpy as np
| |
To analyze the mathematics scores using range, variance, and standard deviation, you can utilize the capabilities of NumPy, a powerful library in Python. Here’s how to compute each metric:
- Range: The range provides a measure of the spread between the smallest and largest value in the dataset.
- Variance: Variance measures the average of the squared differences from the Mean. It gives a sense of how much the scores deviate from the average.
- Standard Deviation: This is the square root of the variance and provides a measure of the average distance from the mean. It’s in the same units as the data, making it more interpretable than variance.
Here’s how you can calculate and print each of these statistics for the provided math scores:
| |
np.ptp()function is used to calculate the range.np.var()function is used to calculate the variance.np.std()function is used to calculate the standard deviation.
Running this script will provide you with the range, variance, and standard deviation of the math scores, thus giving a comprehensive statistical analysis of the dataset.
尝试
| |
「Practice」
Hello, Space Voyager! Let’s examine an array of student grades to understand the data’s spread. The provided code calculates the first quartile (Q1), second quartile (Q2), and third quartile (Q3) of the grades. It also computes the Interquartile Range (IQR), which is the difference between Q3 and Q1. Press Run to see it in action!
| |
解释 Let’s examine the provided Python code that calculates the first quartile (Q1), second quartile (Q2), third quartile (Q3), and the interquartile range (IQR) of student grades. Here’s a step-by-step explanation of the code and how it calculates these statistics:
Importing NumPy:
1import numpy as npThis line imports the NumPy library, which provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
Defining the array of grades:
1grades = np.array([73, 85, 77, 85, 91, 72, 69, 77, 83, 76, 78, 81])Here, we define an array of student grades.
Calculating Quartiles:
1 2 3Q1 = np.percentile(grades, 25) Q2 = np.percentile(grades, 50) Q3 = np.percentile(grades, 75)- Q1 is the first quartile, which is the value below which 25% of the data fall.
- Q2 is the second quartile or median, which is the value below which 50% of the data fall.
- Q3 is the third quartile, which is the value below which 75% of the data fall.
The
np.percentilefunction in NumPy is used to compute these quartiles.Calculating Interquartile Range (IQR):
1IQR = Q3 - Q1The interquartile range is the difference between the third quartile and the first quartile. It measures the spread of the middle 50% of the data.
Printing the IQR:
1print("The Interquartile Range of the student grades is: ", IQR)
Here is the code execution in a Wolfram Language context for comparison. This will demonstrate the calculation of the quartiles and IQR using Wolfram Language:
| |
If you need to calculate and visualize the quartiles and IQR using Wolfram Cloud, I can generate a visualization for you. Let me know if you would like to proceed with that.
「Practice」
Hello, Star Analyst!
Below, you will find an array of student scores. The goal is to calculate the interquartile range (IQR), which identifies the middle half of the scores. However, a bug in the code is preventing it from executing correctly. Could you debug it, Space Voyager?
| |
The issue in the provided code lies in the incorrect order of the percentiles for Q1 and Q3. According to the definition of quartiles, Q1 is the 25th percentile, and Q3 is the 75th percentile. However, the code is calculating Q1 as the 75th percentile and Q3 as the 25th percentile. This needs to be corrected.
Here is the corrected code:
| |
Explanation:
- Import the NumPy library: This library is used for numerical operations in Python.
- Define the array of student scores: These are the scores that need to be analyzed.
- Calculate Q1 and Q3:
- ( Q1 ) is the 25th percentile.
- ( Q3 ) is the 75th percentile.
- Calculate the Interquartile Range (IQR): This is the difference between ( Q3 ) and ( Q1 ).
- Print the IQR: Display the result.
This should correctly compute the interquartile range of the given student scores.
「Practice」
To complete your code, you’ll need to implement the clean_text function as discussed previously, create a pandas DataFrame from the fetched dataset, apply the cleaning functions to the DataFrame, and finally inspect the cleaned text. Here’s how you can do it:
| |
This script will:
- Define the necessary text cleaning functions.
- Implement a
clean_textfunction that applies all the cleaning steps to a given text. - Fetch the ’train’ subset of the 20 newsgroups dataset.
- Create a pandas DataFrame containing this data.
- Apply the
clean_textfunction to each piece of text in the DataFrame. - Print the first few rows of the DataFrame to inspect the cleaned text.
「Practice」
It’s time to put your statistical and programming skills to the test! You have a list of students’ scores named grades. Calculate and print the median score of the students.
Also, calculate the Interquartile Range (IQR), which illustrates the span encompassing half of the scores. Give it a shot, Voyager!
| |
Here’s the solution to calculate the median, first and third quartiles, and interquartile range (IQR) for the given list of grades:
| |
Output:
| |
Explanation:
We use
np.median()to calculate the median score from thegradesarray. The median is the middle value when the scores are sorted.To calculate the first and third quartiles, we use
np.percentile(). The first quartile (Q1) is the 25th percentile, and the third quartile (Q3) is the 75th percentile. These quartiles represent the values below which 25% and 75% of the scores fall, respectively.The Interquartile Range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1). It represents the range of the middle 50% of the scores.
By using NumPy’s built-in functions, we can easily calculate these statistical measures for the given list of grades.
「Practice」
Are you ready to understand the city’s weather patterns, Voyager? We’ve gathered a year’s worth of temperature data! Using Scipy, a Python library, we can analyze this data!
We’ll compute two important statistical measures: skewness, which measures the symmetry of the data, and kurtosis, which assesses the probability of outliers. Let’s hit Run to determine the skewness and kurtosis of our weather data!
| |
「Practice」
Wow! We’ve gathered a year’s worth of temperature data for a hypothetical city, and we’re eager to analyze it using the advanced Scipy functionality. Can you write complete a program to generate temperature data? Temperature should be normally distributed with loc=20 and scale=5.
| |
解释 Certainly! I’d be happy to help you complete the program to generate temperature data and analyze it using SciPy. Here’s the complete program with the data generation part added:
| |
This program does the following: 1. Imports the necessary modules: stats from SciPy and numpy. 2. Generates a year’s worth of temperature data (365 days) using a normal distribution with a mean of 20 and a standard deviation of 5. 3. Calculates and prints the skewness of the data. 4. Calculates and prints the kurtosis of the data. 5. Calculates and prints additional descriptive statistics. When you run this program, you’ll get output similar to this: Skewness of the temperature data: 0.06763677184841275 Kurtosis of the temperature data: -0.14922666231694518 Mean temperature: 20.02 Median temperature: 19.97 Standard deviation of temperature: 4.96 Minimum temperature: 6.51 Maximum temperature: 34.74 Interpretation of the results: 1. Skewness: A value close to 0 indicates that the data is approximately symmetrically distributed around the mean. 2. Kurtosis: A negative value indicates that the distribution has lighter tails than a normal distribution, meaning there might be fewer extreme outliers. 3. The mean and median are very close, which is expected for a normal distribution. 4. The standard deviation is close to 5, which matches our input parameters. 5. The minimum and maximum temperatures give us an idea of the range of temperatures in this simulated data. This program provides a good starting point for analyzing your temperature data. You can further expand on this by creating visualizations (like histograms or box plots) or performing more advanced statistical tests as needed.
Topic Overview and Actualization
主题概述与现实化
Today, we target duplicates and outliers to clean our data for more accurate analysis.
今天,我们将针对重复数据和异常值进行清理,以便进行更准确的分析。
Understanding Duplicates in Data
理解数据中的重复项
Let’s consider a dataset from a school containing students’ details. If a student’s information appears more than once, that is regarded as a duplicate. Duplicates distort data, leading to inaccurate statistics.
考虑一个包含学生详细信息的学校数据集。如果一个学生的资讯出现多次,则被视为重复数据。重复数据会扭曲数据,导致统计数据不准确。
Python Tools for Handling Duplicates
用于处理重复数据的 Python 工具
pandas library provides efficient and easy-to-use functions for dealing with duplicates.
pandas 库提供了高效且易用的函数来处理重复数据。
| |
The duplicated() function flags duplicate rows:
duplicated() 函数标记重复行:
| |
A True in the output denotes a row in the DataFrame that repeats. Note, that one of the repeating rows is marked as False – to keep one in case we decide to drop all the duplicates.
输出结果中的 True 表示 DataFrame 中存在重复行。请注意,其中一行重复行被标记为 False,以便在决定删除所有重复项时保留一行。
The drop_duplicates() function helps to discard these duplicates:
drop_duplicates() 函数有助于去除这些重复项:
| |
There is no more duplicates, cool!
没有重复了,太棒了
Understanding Outliers in Data
数据中的异常值解析
An outlier is a data point significantly different from others. In our dataset of primary school students’ ages, we might find an age like 98 — this would be an outlier.
离群值是指与其他数据点显著不同的数据点。在我们的小学生年龄数据集里,我们可能会发现像 98 岁这样的年龄,这就是一个离群值。
Identifying Outliers 识别异常值
Outliers can be detected visually using tools like box plots, scatter plots, or statistical methods such as Z-score or IQR. Let’s consider a data point that’s significantly different from the rest. We’ll use the IQR method for identifying outliers.
离群值可以使用箱线图、散点图等工具进行可视化检测,也可以使用 Z 分数或 IQR 等统计方法进行检测。让我们考虑一个与其他数据点显著不同的数据点。我们将使用 IQR 方法来识别离群值。
As a short reminder, we consider a value an outlier if it is either at least 1.5 * IQR less than Q1 (first quartile) or at least 1.5 * IQR greater than Q3 (third quartile).
简而言之,如果一个值小于 Q1(第一四分位数)至少 1.5 * IQR 或大于 Q3(第三四分位数)至少 1.5 * IQR,则我们将其视为异常值。
Python Tools for Handling Outliers
用于处理异常值的 Python 工具
Here’s how you can utilize the IQR method with pandas. Let’s start with defining the dataset of students’ scores:
以下是使用 pandas 应用 IQR 方法的方法。首先定义学生分数数据集:
| |
Now, compute Q1, Q3, and IQR:
现在,计算 Q1、Q3 和 IQR:
| |
After that, we can define the lower and upper bounds and find outliers:
之后,我们可以定义上下界并找到异常值:
| |
Handling Outliers: Removal
处理异常值:移除
Typically, there are two common strategies for dealing with outliers: remove them or replace them with a median value.
处理异常值通常有两种常见策略:移除或用中位数替换。
Removing outliers is the easiest method. However, there are better methods than this since you essentially throw away your data. To apply it, let’s reverse the condition to choose everything except outliers.
去除异常值是最简单的方法。但是,还有比这更好的方法,因为这样实际上你就把数据丢弃了。要应用它,让我们反转条件来选择除异常值之外的所有内容。
| |
Handling Outliers: Replacement
处理异常值:替换
The second strategy is replacing outliers with median values - they are less susceptible to outliers, so we can use them for replacement.
第二种策略是用中值替换异常值——中值不易受异常值的影响,因此我们可以用它们进行替换。
The easiest way to apply this replacement is to first replace outliers with np.nan and then use the fill method. It could lead to problems, as there could already be some missing values in the dataframe, which will also be filled.
最简单的应用这种替换方法是先将异常值替换为 np.nan,然后使用填充方法。这可能会导致问题,因为数据框中可能已经存在一些缺失值,这些缺失值也会被填充。
Instead, we could use the np.where function:
我们可以使用 np.where 函数:
| |
It works by choosing elements from df['scores'] if the condition is not met (e.g., value is not an outlier) and from median otherwise. In other words, whenever this function meets an outlier, it will ignore it and use median instead of it.
它通过以下方式工作:如果条件不满足(例如,值不是异常值),则从 df[‘scores’] 中选择元素;否则,从 中位数 中选择元素。换句话说,每当此函数遇到异常值时,它将忽略该异常值,并使用中位数代替它。
Summary 摘要
We’ve covered what duplicates and outliers are, their impact on data analysis, and how to manage them. A clean dataset is a prerequisite for accurate data analysis. Now, it’s time to apply your skills to real-world data. Let’s dive into some practical exercises!
我们已经介绍了重复值和异常值的定义、它们对数据分析的影响以及如何处理它们。干净的数据集是准确数据分析的先决条件。现在,是时候将你的技能应用于真实世界的数据了。让我们开始一些实践练习吧!
Customizing Bag-of-Words Representation
Applying CountVectorizer on Sentences
Bag-of-Words Transformation on IMDB Reviews Dataset
Creating Bag-of-Words Representation Yourself
Turn Rich Text into Bag-of-Words Representation
Lesson 3: Implementing TF-IDF for Feature Engineering in Text Classification
Change TF-IDF Vector for Different Sentence
Implementing TF-IDF Vectorizer on Provided Text
Understanding Sparse Matrix Components
Applying TF-IDF Vectorizer On Reviews Dataset
Implementing TF-IDF Vectorizer from Scratch
Lesson 4: Efficient Text Data Representation with Sparse Matrices
Switching from CSC to CSR Representation
Creating a Coordinate Format Matrix with Duplicates
Performing Vectorized Operations on Sparse Matrices
Creating CSR Matrix from Larger Array
Performing Subtraction Operation on Sparse Matrix
Lesson 5: Applying TruncatedSVD for Dimensionality Reduction in NLP
Change TruncatedSVD Components Number
Implement Dimensionality Reduction with TruncatedSVD
Applying TruncatedSVD on Bag-of-Words Matrix
Implement TruncatedSVD on Bag-of-Words Matrix
Implementing TruncatedSVD on IMDB Movie Reviews Dataset

Lessons and practices
Lesson 1: Preprocessing Text Data: Train-Test Split and Stratified Cross-Validation
Implement Stratified Cross-Validation in Train-Test Split
Analyzing Spam and Ham Distribution in Train-Test Split
Exploring the Spam Dataset
Stratified Train-Test Split for Text Data
Stratified Train-Test Split and Class Distribution Analysis
Lesson 2: Mastering Text Classification with Naive Bayes in Python
Tuning Alpha Parameter in Naive Bayes Model
Fill in the Blanks: Building Naive Bayes Model
Fill in the Blanks: Predicting Using Naive Bayes Model
Visualize Naive Bayes Model Predictions
Evaluate Naive Bayes Model with Confusion Matrix
Lesson 3: Mastering Support Vector Machines for Effective Text Classification
Switching SVM Kernel to Polynomial
Building and Training a Linear SVM Classifier
Predicting and Evaluating with SVM Model
Training and Predicting with SVM Model
Complete SVM Text Classification Model from Scratch Lesson 4: Decision Trees in NLP: Mastering Text Classification
Adjust Max Depth of Decision Tree Classifier
Implementing Decision Tree Classifier
Generate the Classification Report
Implementing and Visualizing Decision Tree Classifier
Building and Evaluating a Decision Tree Model
Lesson 5: Mastering Random Forest for Text Classification
Adjusting Parameters of RandomForest Classifier
Fill the Blanks in the RandomForestClassifier Script
Insert Code to Evaluate RandomForest Classifier
Creating and Training RandomForest Classifier
Train and Evaluate RandomForest Classifier
Lesson 6: Mastering Logistic Regression for Text Classification
Adjusting Regularization in Logistic Regression Model
Initialize and Train Logistic Regression Model
Prediction and Evaluation of Logistic Regression Model
Improving Logistic Regression Model with Regularization
Implementing Logistic Regression on Text Data

Lessons and practices
[
](https://learn.codesignal.com/preview/lessons/1794) Lesson 1: Ensemble Methods in NLP: Mastering Bagging for Text Classification
Exploring the Last Documents and Categories
Finding Documents with Specific Category Count
Implement Bagging Classifier and Evaluate Model Performance
Bagging Classifier with Different Parameters Evaluation
Text Classification Using Bagging Classifier
Lesson 2: Ensemble Methods in NLP: Mastering the Voting Classifier
Switch to Soft Voting in Classifier Ensemble
Implementing and Training a Voting Classifier
Incorporating Soft Voting in Ensemble Classifier Model
Creating the Voting Classifier Model
Lesson 3: Boosting Text Classification Power with Gradient Boosting Classifier
Tuning Learning Rate for Gradient Boosting Classifier
Implementing and Training a Gradient Boosting Classifier
Setting Learning Rate and Making Predictions with GradientBoostingClassifier
Building a Gradient Boosting Classifier Model
Implementation of Gradient Boosting Classifier
Lesson 4: Text Preprocessing for Deep Learning with TensorFlow
Adjusting Tokenizer Parameters
Tokenizer Text Processing Practice
Filling the Gaps in Text Preprocessing Code
Initiating the Tokenizer Process
Tokenizing Text Data with TensorFlow
Lesson 5: Understanding and Building Neural Networks for Text Classification
Improve Neural Network Performance with Additional Layer
Inserting the Missing Model Layer
Preparing the Tokenizer, Data, and Model
Extend the Neural Network Model
Creating and Training a Neural Network Model
Lesson 6: Mastering Text Classification with Simple RNNs in TensorFlow
Changing Activation Function in Dense Layer
Configuring SimpleRNN and Dense Layers in Model
Fill in the blanks: Building a Simple RNN with TensorFlow
Adding Layers to the RNN Model
Implement Simple RNN for Text Classification
Introduction 介绍
Today, we’re approaching data analysis from a new angle by applying filtering to grouped DataFrames. We will review DataFrame grouping and introduce filtering, illustrating these concepts with examples. By the end of this lesson, you will be equipped with the necessary skills to effectively group and filter data.
今天,我们通过对分组数据帧应用过滤从一个新的角度进行数据分析。我们将回顾 DataFrame 分组并介绍过滤,并通过示例说明这些概念。在本课程结束时,您将具备有效分组和过滤数据的必要技能。
Recap of Grouping in Pandas
Pandas 分组回顾
As a quick recap, pandas is a highly influential Python module for data analysis, with powerful classes such as DataFrames at its core. DataFrames are data tables, and you can group the data within them using the groupby() function. Here is an example of grouping data within a DataFrame by 'Product':
快速回顾一下, pandas是一个非常有影响力的 Python 数据分析模块,其核心有DataFrames等强大的类。 DataFrame 是数据表,您可以使用groupby()函数对其中的数据进行分组。以下是按'Product'对 DataFrame 中的数据进行分组的示例:
| |
Recap of Lambda Functions
Lambda 函数回顾
To filter grouped data, we will need functions. Let’s recall how to easily create and use them.
为了过滤分组数据,我们需要函数。让我们回顾一下如何轻松创建和使用它们。
In Python, lambda functions are small anonymous functions. They can take any number of arguments but only have one expression.
在 Python 中, lambda函数是小型匿名函数。它们可以接受任意数量的参数,但只有一个表达式。
Consider a situation where we use a function to calculate the total price after adding the sales tax. In a place where the sales tax is 10%, the function to calculate the total cost could look like:
考虑这样一种情况,我们使用函数来计算添加销售税后的总价。在销售税为 10% 的地方,计算总成本的函数可能如下所示:
Regular Function 常规功能
| |
Replacing the function with a compact lambda function is handy when it is simple and not used repeatedly. The syntax for lambda is lambda var: expression, where var is the function’s input variable and expression is what this function returns.
当函数简单且不重复使用时,用紧凑的 lambda 函数替换该函数会很方便。 lambda 的语法为lambda var: expression ,其中var是函数的输入变量,而expression是该函数的返回值。
The above add_sales_tax function can be replaced with a lambda function as follows:
上面的add_sales_tax函数可以用 lambda 函数替换,如下所示:
Lambda Function 拉姆达函数
| |
Lambda functions are handy when used inside other functions or as arguments in functions like filter(), map() etc.
Lambda 函数在其他函数内部使用或作为filter() 、 map()等函数中的参数时非常方便。
Example of a Boolean Lambda Function
布尔 Lambda 函数示例
A Boolean Lambda function always returns either True or False. Let’s imagine a case where we want to know whether a number is even or odd. We can easily accomplish this using a Boolean Lambda function.
布尔 Lambda 函数始终返回True或False 。让我们想象一个情况,我们想知道一个数字是偶数还是奇数。我们可以使用布尔 Lambda 函数轻松完成此任务。
Here’s how we can define such a function:
下面是我们如何定义这样一个函数:
| |
The preceding two lines of code create a Boolean Lambda function named is_even. This function takes a number (named num) as an argument, divides it by 2, and then checks if the remainder is 0. It returns the condition’s value, either True or False.
前面两行代码创建一个名为is_even的布尔 Lambda 函数。该函数接受一个数字(名为num )作为参数,将其除以2 ,然后检查余数是否为0 。它返回条件的值, True或False 。
Boolean lambda functions are fantastic tools for quickly evaluating a condition. Their applications are broad, especially when you’re manipulating data with pandas. They can be used in various ways, including sorting, filtering, and mapping.
布尔 lambda 函数是快速评估条件的绝佳工具。它们的应用非常广泛,尤其是当您使用 pandas 操作数据时。它们可以以多种方式使用,包括排序、过滤和映射。
Filtering a Grouped DataFrame
过滤分组数据框
Boolean selection does not apply to grouped dataframes. Instead, we use the filter() function, which takes a boolean function as an argument. For instance, let’s keep products with a summary quantity greater than 90.
布尔选择不适用于分组数据框。相反,我们使用filter()函数,该函数采用布尔函数作为参数。例如,让我们保留汇总数量大于90产品。
| |
This command yields the rows from the grouped data where the sum of Quantity exceeds 90. Pears are included, as their summary quantity is 40 + 70 = 110.
此命令从分组数据中生成Quantity总和超过90行。梨也包含在内,因为其总计数量为40 + 70 = 110 。
Lesson Summary 课程总结
In summary, we have explored DataFrame grouping and data filtering, and how to apply these techniques in data analysis. Practice exercises will solidify this knowledge and enhance your confidence. So, let’s dive into some hands-on learning!
总而言之,我们探索了DataFrame 分组和数据过滤,以及如何在数据分析中应用这些技术。练习将巩固这些知识并增强您的信心。那么,让我们深入一些实践学习吧!
「Practice」
Hey, Space Voyager! I’ve prepared an exercise for you to analyze our supermarket sales data. 嘿,太空航行者!我为您准备了一个练习来分析我们的超市销售数据。 See how we group the data by Product before filtering to retain only the products with total sales exceeding 100 units across all supermarkets. 了解我们在过滤之前如何按Product对数据进行分组,以仅保留所有超市中总销售额超过100件的产品。 Press that Run button to get started!
| |
输出
| |
Certainly! I’d be happy to analyze the supermarket sales data for you. Let’s go through the code and understand what it does step by step. 1. First, we import the pandas library, which is essential for data manipulation and analysis in Python. 2. We create a DataFrame called supermarket_sales with sample data. It has three columns: - Product: Cereal, Milk, and Eggs - Supermarket: Super A and Super B - Sales: The number of units sold for each product in each supermarket 3. We use the groupby('Product') function to group the data by the ‘Product’ column. This creates a GroupBy object. 4. We then use the filter() function on the grouped data. The lambda function inside the filter checks if the sum of ‘Sales’ for each product across all supermarkets is greater than 100. 5. Finally, we print the filtered DataFrame. Now, let’s analyze the expected output: - Cereal: Total sales = 25 + 60 = 85 (less than 100, will be filtered out) - Milk: Total sales = 35 + 75 = 110 (greater than 100, will be included) - Eggs: Total sales = 45 + 90 = 135 (greater than 100, will be included) So, the output should show only the rows for Milk and Eggs, as they are the only products with total sales exceeding 100 units across both supermarkets. When you run the code, you should see something like this: Product Supermarket Sales 1 Milk Super A 35 4 Milk Super B 75 2 Eggs Super A 45 5 Eggs Super B 90 This output shows that the code successfully filtered out the Cereal product and retained only Milk and Eggs, which had total sales greater than 100 units across both supermarkets. This exercise demonstrates how to use grouping and filtering in pandas to analyze sales data and focus on products that meet specific criteria, in this case, those with high total sales across multiple locations.
好的,让我们用一个简单有趣的故事来解释这个概念,适合6岁小朋友理解: 想象一下,我们有一个神奇的玩具商店。这个商店里有三种玩具:谷物玩具、牛奶玩具和鸡蛋玩具。商店有两个分店,叫做"超级A"和"超级B"。 现在,我们想知道哪些玩具特别受欢迎。我们决定,如果一种玩具在两个分店总共卖出超过100个,我们就认为它很受欢迎。 我们有一个魔法机器人帮我们数一数: 1. 谷物玩具: - 在超级A卖了25个 - 在超级B卖了60个 - 总共卖了85个 (没有超过100个) 2. 牛奶玩具: - 在超级A卖了35个 - 在超级B卖了75个 - 总共卖了110个 (超过了100个!) 3. 鸡蛋玩具: - 在超级A卖了45个 - 在超级B卖了90个 - 总共卖了135个 (也超过了100个!) 机器人数完后告诉我们:“牛奶玩具和鸡蛋玩具很受欢迎,因为它们都卖出了超过100个!” 所以,我们知道了牛奶玩具和鸡蛋玩具是最受欢迎的玩具。谷物玩具虽然也卖得不错,但还没有超级受欢迎。 这就是我们如何找出最受欢迎的玩具的方法。我们把所有分店的销售数字加在一起,然后看哪些玩具总数超过了100个。这样我们就知道哪些玩具最受小朋友们的喜爱啦!
Lesson Introduction
Welcome to our exciting lesson! We shall embark on learning and mastering Hypothesis Testing using Python. It might sound complicated, but it’s like deciding if a toy is worth buying based on its reviews. We’ll focus on the T-test, a way to tell if two groups are different.
Python has useful tools, Scipy and Statsmodels, which help us do these tests quickly and accurately. By the end, you’ll understand Hypothesis Testing, know what a T-test is, and be able to do a T-test using Python. So, let’s start!
What is Hypothesis Testing?
A hypothesis is a guess about a group. For example, “adult males in the U.S. average 180 lbs.” In Hypothesis Testing, we try to prove or disprove these guesses using collected data.
Null Hypothesis (H0): The guess we’re challenging. For example, “adult males in the U.S. do not average 180 lbs.”
Alternative Hypothesis (HA): The guess we’re trying to prove (e.g., “Adult males in the U.S. average 180 lbs.”).
Think of it like a courtroom trial. The null hypothesis is on trial, and the alternative hypothesis offers the evidence. 什么是假设检验?
假设是对某个群体的猜测。例如,“美国成年男性平均体重 180 磅。”在假设检验中,我们会尝试使用收集的数据来证明或反驳这些猜测。
零假设 (H0):我们要质疑的猜测。例如,“美国成年男性的平均体重不为 180 磅。”
备择假设 (HA):我们试图证明的猜测(例如,“美国成年男性平均体重为 180 磅”)。
把它想象成一场法庭审判。原假设接受审判,备择假设提供证据。
How Does a T-test Work?
Let’s understand the T-test better. It checks if the two groups’ mean values are truly different. It’s like testing if two pots of coffee are different temperatures due to one being under an AC vent or just by chance.
There are three main types of T-tests:
- One-sample T-test: “Does this coffee look like it came from a pot that averages 70 degrees?”
- Two-sample T-test: “Are men’s and women’s average weights different?”
- Paired-sample T-test: “Did people’s average stress levels change after using a meditation app for a month?”
T-test gives us two values: the t-statistic and the p-value. The t-statistic represents the size of the difference relative to the variation in your sample data. Put simply, the bigger the t-statistic, the more difference there is between groups mean values. The p-value is the probability that the results could be random (i.e., happened by chance). If the p-value is less than 0.05, usually, we conclude that the difference is statistically significant and not due to randomness.
Performing T-tests in Python
Python has powerful tools, Scipy and Statsmodels, for Hypothesis Testing.
For example, to do a one-sample T-test in Python, we can use Scipy‘s ttest_1samp() function.
Let’s begin by assuming that the null hypothesis is that the mean age of users (provided as the ages array) equals 30. Therefore, the alternative hypothesis states that the mean age is not 30. Let’s illustrate how we can test this:
| |
In this case, we fail to reject the null hypothesis because the p-value is greater than 0.05 (the conventional cutoff). It means that we don’t have enough statistical evidence to claim that the mean age of users is different from 30.
Now let’s modify our numpy array to contain a normally distributed sample with a mean that differs from 30:
| |
We might reject the null hypothesis in this case as the p_value is less than 0.05. It suggests strong evidence against the null hypothesis, implying that the average age of users is significantly different from 30.
Two-Sample T-test
Imagine you want to test if two teams in your office work the same hours. After collecting data, you can use a two-sample T-test to find out.
- The null hypothesis is that the mean working hours of Team A is equal to the mean working hours of Team B.
- The alternative hypothesis is that the mean working hours of Team A is different from the mean working hours of Team B.
We will use the stats.ttest_ind function for the two-sample T-test. Here’s an example:
| |
The p-value is less than 0.05, so we can reject the null hypothesis, meaning we have sufficient statistical evidence to say that there’s a significant difference between the mean working hours of Teams A and B.
Summary
Well done! We’ve learned Hypothesis Testing, understood T-tests, and done a T-test in Python. T-tests are a helpful way to make decisions with data.
Now it’s time for you to practice. The more you practice, the better you’ll understand. Let’s dive into some data with Python!
「Practice」
Welcome back, Stellar Navigator! It appears as though your company has implemented a new project planning system.
Now, you’re looking to see if it has had any impact on the meeting hours of different teams. The provided code performs a T-test on the meeting hours for the management and developer teams to evaluate this.
No alterations are necessary. Just hit Run!
「Practice」
In this space mission, you’ll adjust the input data to see how it affects statistical evidence. Modify the parameters of np.random.normal to create a sample with a mean age significantly higher than 30. This will change the p-value and show if there’s a different conclusion in hypothesis testing.
| |
lesson4
Hello, and welcome to today’s lesson on n-grams! If you’ve ever wondered how language models or text classifiers can understand the context or sequence in text, it’s usually courtesy of our today’s hero — n-grams. In this lesson, we’ll delve into the magic of n-grams and how essential they prove in processing textual data. Specifically, we’ll learn how to create n-grams from text data using Python, covering unigrams and bigrams.
大家好,欢迎来到今天的n-gram课程!如果您想知道语言模型或文本分类器如何理解文本中的上下文或序列,这通常是我们今天的英雄 — n-gram 的功劳。在本课中,我们将深入探讨 n 元语法的魔力以及它们在处理文本数据中的重要性。具体来说,我们将学习如何使用 Python 从文本数据创建 n 元模型,涵盖一元模型和二元模型。
What are n-grams? 什么是 n 元语法?
In Natural Language Processing, when we analyze text, it’s often beneficial to consider not only individual words but sequences of words. This approach helps to grasp the context better. Here is where n-grams come in handy.
在自然语言处理中,当我们分析文本时,不仅考虑单个单词,而且考虑单词序列通常是有益的。这种方法有助于更好地掌握上下文。这就是 n 元语法派上用场的地方。
An n-gram is a contiguous sequence of n items from a given sample of text or speech. The ’n’ stands for the number of words in the sequence. For instance, in “I love dogs,” a 1-gram (or unigram) is just one word, like “love.” A 2-gram (or bigram) would be a sequence of 2 words, like “I love” or “love dogs”.
n-gram 是来自给定文本或语音样本的 n 个项目的连续序列。 “n”代表序列中的单词数。例如,在“我爱狗”中,1 克(或一元词)只是一个单词,就像“爱”一样。 2-gram(或二元语法)是由 2 个单词组成的序列,例如“我爱”或“爱狗”。
N-grams help preserve the sequential information or context in text data, contributing significantly to many language models or text classifiers.
N-gram 有助于保留文本数据中的顺序信息或上下文,对许多语言模型或文本分类器做出了重大贡献。
Preparing Data for n-Grams Creation
为 n-Grams 创建准备数据
Before we can create n-grams, we need clean, structured text data. The text needs to be cleaned and preprocessed into a desirable format, after which it can be used for feature extraction or modeling.
在创建 n 元语法之前,我们需要干净的结构化文本数据。文本需要被清理并预处理成所需的格式,然后可以用于特征提取或建模。
Here’s an already familiar code where we apply cleaning on our text, removing stop words and stemming the remaining words. These steps include lower-casing words, removing punctuations, useless words (stopwords), and reducing all words to their base or stemmed form.
这是一个已经熟悉的代码,我们在其中对文本进行清理,删除停用词并提取剩余单词的词干。这些步骤包括小写单词、删除标点符号、无用单词(停用词)以及将所有单词还原为其基本形式或词干形式。
| |
Creating n-grams with Python: Setting up the Vectorizer
使用 Python 创建 n-gram:设置矢量化器
Python’s sklearn library provides an accessible way to generate n-grams. The CountVectorizer class in the sklearn.feature_extraction.text module can convert a given text into its matrix representation and allows us to specify the type of n-grams we want.
Python 的sklearn库提供了一种生成 n 元语法的简便方法。中的CountVectorizer类 sklearn.feature_extraction.text 模块可以将给定文本转换为其矩阵表示形式,并允许我们指定所需的 n 元语法类型。
Let’s set up our vectorizer as a preliminary step towards creating n-grams:
让我们设置矢量化器作为创建 n 元语法的初步步骤:
| |
The ngram_range=(1, 2) parameter instructs our vectorizer to generate n-grams where n ranges from 1 to 2. So, the CountVectorizer will generate both unigrams and bigrams. If we wanted unigrams, bigrams, and trigrams, we could use ngram_range=(1, 3).ngram_range=(1, 2)参数指示我们的向量生成器生成 n 元语法,其中 n 的范围为 1 到 2。因此,CountVectorizer 将生成一元语法和二元语法。如果我们想要一元语法、二元语法和三元语法,我们可以使用ngram_range=(1, 3) 。
Creating n-grams with Python: Applying the Vectorizer
使用 Python 创建 n 元模型:应用矢量化器
Now that we’ve set up our n-gram generating machine let’s use it on some real-world data.
现在我们已经设置了 n 元语法生成机,让我们将它用于一些现实世界的数据。
| |
Applying the vectorizer to our cleaned text data will create the n-grams:
将矢量化器应用到我们清理后的文本数据将创建 n 元语法:
| |
The output of the above code will be:
上述代码的输出将是:
| |
The shape of X is (100, 16246), indicating we have a high-dimensional feature space. The first number, 100, represents the number of documents or records in your dataset (here, it’s 100 as we limited our fetching to the first 100 records of the dataset), whereas 16246 represents the unique n-grams or features created from all the 100 documents.X的形状是(100, 16246) ,表明我们有一个高维特征空间。第一个数字100表示数据集中的文档或记录数(此处为 100,因为我们将获取数据集的前 100 条记录限制为),而16246表示从所有文档或记录创建的唯一 n 元语法或特征。 100 个文档。
By printing features[100:111] we get a glance into our features where each string represents an n-gram from our cleaned text data. The returned n-grams ['accid figur', 'accid worri', 'accomod', ...] include both unigrams (single words like accomod, account) and bigrams (two-word phrases like accid figur, accid worri).
通过打印features[100:111]我们可以一目了然地了解我们的特征,其中每个字符串代表我们清理后的文本数据中的一个 n 元语法。返回的 n 元语法 ['accid figur', 'accid worri', 'accomod', ...] 包括一元词组(单个单词,如accomod 、 account )和二元词组(双词短语,如accid figur 、 accid worri )。
As you can see, generating n-grams adds a new level of complexity to our analysis, as we now have multiple types of features or tokens - unigrams and bigrams. You can experiment with the ngram_range parameter in CountVectorizer to include trigrams or higher-level n-grams, depending on your specific context and requirements. Remember, each choice will have implications for the complexity and interpretability of your models, and it’s always a balance between the two.
正如您所看到的,生成 n-gram 为我们的分析增加了新的复杂性,因为我们现在有多种类型的特征或标记 - 一元语法和二元语法。您可以尝试使用CountVectorizer中的ngram_range参数来包含三元组或更高级别的 n 元组,具体取决于您的具体上下文和要求。请记住,每个选择都会对模型的复杂性和可解释性产生影响,并且始终需要在两者之间取得平衡。
Lesson Summary 课程总结
Congratulations, you’ve finished today’s lesson on n-grams! We’ve explored what n-grams are and their importance in text classification. We then moved on to preparing data for creating n-grams before we dived into generating them using Python’s CountVectorizer class in the sklearn library.
恭喜您完成了今天的 n 元语法课程!我们探讨了 n 元语法是什么以及它们在文本分类中的重要性。然后,我们继续准备用于创建 n 元语法的数据,然后再使用sklearn库中的 Python CountVectorizer类来生成它们。
Now, it’s time to get hands-on. Try generating trigrams or 4-grams from the same cleaned newsgroups data and notice the differences. Practicing these skills will not only reinforce the concepts learned in this lesson but also enable you to understand when and how much context is needed for certain tasks.
现在,是时候亲自动手了。尝试从相同的清理后的新闻组数据生成三元组或四元组并注意差异。练习这些技能不仅可以强化本课程中学到的概念,还可以让您了解某些任务何时需要以及需要多少上下文。
As always, happy learning!
一如既往,快乐学习!