Introduction to PySpark Part 2 – Selecting, Filtering and Sorting Data
This is the second part in a series of blog posts as an introduction to PySpark.
The other parts of this blog post series can be found here:
- Part 1 – Creating Data Frames and Reading Data from Files
- Part 2 – Selecting, Filtering and Sorting Data
- Part 3 – Adding, Updating and Removing Columns
- Part 4 – Summarising Data
- Part 5 – Aggregating Data
In this part I’ll be covering:
- Selecting Columns
- Filtering Rows
- Chaining Conditions
- String Matching
- Matching against a list
- Not (~)
- Filtering on Nulls
- Dropping Duplicates
- Sorting Data
Selecting Columns
We can use the select
method of our DataFrame
to select on the columns we’re interested in.
If we were to load the pokemon
parquet file we looked at in the previous section, we can then select only the columns we’re interested in and because the commands are evaluated lazily, we’ll only end up loading in the columns we’re interested in when we load from the parquet file.
pokemon = spark.read.parquet('/mnt/tmp/pokemon.parquet')
pokemon = pokemon.select('#', 'Name', 'Type1', 'Type2', 'Generation')
display(pokemon.limit(10))
# | Name | Type1 | Type2 | Generation |
---|---|---|---|---|
1 | Bulbasaur | Grass | Poison | 1 |
2 | Ivysaur | Grass | Poison | 1 |
3 | Venusaur | Grass | Poison | 1 |
3 | VenusaurMega Venusaur | Grass | Poison | 1 |
4 | Charmander | Fire | null | 1 |
5 | Charmeleon | Fire | null | 1 |
6 | Charizard | Fire | Flying | 1 |
6 | CharizardMega Charizard X | Fire | Dragon | 1 |
6 | CharizardMega Charizard Y | Fire | Flying | 1 |
7 | Squirtle | Water | null | 1 |
We can also match columns using the colRegex
method of a DataFrame, which will give us the columns that match a regular expression, let’s try it out:
display(
pokemon.select(
pokemon.colRegex('`^T[a-z]*[1-9]$`')
).limit(5)
)
Type1 | Type2 |
---|---|
Grass | Poison |
Grass | Poison |
Grass | Poison |
Grass | Poison |
Fire | null |
Filtering Rows
We can filter rows in a similar manner to how we do in pandas, if we want only the second generation pokemon for example:
second_gen_pokemon = pokemon[pokemon['Generation'] == 2]
display(second_gen_pokemon.limit(10))
# | Name | Type1 | Type2 | Generation |
---|---|---|---|---|
152 | Chikorita | Grass | null | 2 |
153 | Bayleef | Grass | null | 2 |
154 | Meganium | Grass | null | 2 |
155 | Cyndaquil | Fire | null | 2 |
156 | Quilava | Fire | null | 2 |
157 | Typhlosion | Fire | null | 2 |
158 | Totodile | Water | null | 2 |
159 | Croconaw | Water | null | 2 |
160 | Feraligatr | Water | null | 2 |
161 | Sentret | Normal | null | 2 |
Chaining Conditions
We can chain conditions using &
or |
, so if we wanted to select all the "Ground"
type pokemon:
ground_type_pokemon = pokemon[
(pokemon['Type1'] == 'Ground')
| (pokemon['Type2'] == 'Ground')
]
display(ground_type_pokemon.limit(5))
# | Name | Type1 | Type2 | Generation |
---|---|---|---|---|
27 | Sandshrew | Ground | null | 1 |
28 | Sandslash | Ground | null | 1 |
31 | Nidoqueen | Poison | Ground | 1 |
34 | Nidoking | Poison | Ground | 1 |
50 | Diglett | Ground | null | 1 |
String Matching
We can use startswith
, endswith
, or contains
to filter on strings.
display(pokemon[pokemon['Name'].startswith('Nido')])
# | Name | Type1 | Type2 | Generation |
---|---|---|---|---|
29 | Nidoran♀ | Poison | null | 1 |
30 | Nidorina | Poison | null | 1 |
31 | Nidoqueen | Poison | Ground | 1 |
32 | Nidoran♂ | Poison | null | 1 |
33 | Nidorino | Poison | null | 1 |
34 | Nidoking | Poison | Ground | 1 |
We can also use like
and rlike
, where rlike
is used to match on a regular expression – let’s look for Pokemon whose names begin with “R” and end with “n”:
display(
pokemon[
pokemon['Name'].rlike('^[Rr].+n$')
]
)
# | Name | Type1 | Type2 | Generation |
---|---|---|---|---|
111 | Rhyhorn | Ground | Rock | 1 |
112 | Rhydon | Ground | Rock | 1 |
Matching against a list
We can filter based on a list of values using isin
display(
pokemon[
pokemon['#'].isin([121, 131, 141, 151])
]
)
# | Name | Type1 | Type2 | Generation |
---|---|---|---|---|
121 | Starmie | Water | Psychic | 1 |
131 | Lapras | Water | Ice | 1 |
141 | Kabutops | Rock | Water | 1 |
151 | Mew | Psychic | null | 1 |
NOT (~)
We can get the opposite of any of the filters above by using the a tilde ~
symbol.
display(
pokemon[
~(pokemon['Type1'].isin(['Grass', 'Fire', 'Water', 'Bug']))
& ~(pokemon['Generation'] == 1)
].limit(10)
)
# | Name | Type1 | Type2 | Generation |
---|---|---|---|---|
161 | Sentret | Normal | null | 2 |
162 | Furret | Normal | null | 2 |
163 | Hoothoot | Normal | Flying | 2 |
164 | Noctowl | Normal | Flying | 2 |
169 | Crobat | Poison | Flying | 2 |
172 | Pichu | Electric | null | 2 |
173 | Cleffa | Fairy | null | 2 |
174 | Igglybuff | Normal | Fairy | 2 |
175 | Togepi | Fairy | null | 2 |
176 | Togetic | Fairy | Flying | 2 |
Filtering on Nulls
We can filter on null values using isNull
so if we only want to look for Pokemon with a single type, we can do:
single_type_pokemon = pokemon[pokemon['Type2'].isNull()]
display(single_type_pokemon.limit(5))
# | Name | Type1 | Type2 | Generation |
---|---|---|---|---|
4 | Charmander | Fire | null | 1 |
5 | Charmeleon | Fire | null | 1 |
7 | Squirtle | Water | null | 1 |
8 | Wartortle | Water | null | 1 |
9 | Blastoise | Water | null | 1 |
Or conversely if we only wanted Pokemon with a Type1 and Type2, we just use our `~“:
dual_type_pokemon = pokemon[~pokemon['Type2'].isNull()]
display(dual_type_pokemon.limit(5))
# | Name | Type1 | Type2 | Generation |
---|---|---|---|---|
1 | Bulbasaur | Grass | Poison | 1 |
2 | Ivysaur | Grass | Poison | 1 |
3 | Venusaur | Grass | Poison | 1 |
3 | VenusaurMega Venusaur | Grass | Poison | 1 |
6 | Charizard | Fire | Flying | 1 |
Dropping Duplicate Rows
You may have noticed that sometimes we have Pokemon number (#
) duplicated because we have the regular version of that Pokemon and the “Mega” version, we can drop the duplicate numbers using the dropDuplicates
method of a DataFrame. If we wanted to just drop any duplicated rows, we don’t need to provide a subset of columns, but by providing a subset of columns we can remove rows that have duplicates in any single/combination of columns.
display(pokemon.dropDuplicates(subset=['#']).limit(10))
# | Name | Type1 | Type2 | Generation |
---|---|---|---|---|
1 | Bulbasaur | Grass | Poison | 1 |
2 | Ivysaur | Grass | Poison | 1 |
3 | Venusaur | Grass | Poison | 1 |
4 | Charmander | Fire | null | 1 |
5 | Charmeleon | Fire | null | 1 |
6 | Charizard | Fire | Flying | 1 |
7 | Squirtle | Water | null | 1 |
8 | Wartortle | Water | null | 1 |
9 | Blastoise | Water | null | 1 |
10 | Caterpie | Bug | null | 1 |
Note that now we no longer have the MegaVenusaur or MegaCharizard in our data.
Sorting Data
At the moment our DataFrame is sorted by the Pokemon’s number (#
), if we load in the data again, we can choose to sort by some other attribute using sort
or orderBy
:
pokemon = spark.read.parquet('/mnt/tmp/pokemon.parquet')
pokemon = pokemon.sort('HP', ascending=False)
display(pokemon.limit(10))
# | Name | Type1 | Type2 | HP | Attack | Defense | Sp.Atk | Sp.Def | Speed | Generation | Legendary |
---|---|---|---|---|---|---|---|---|---|---|---|
242 | Blissey | Normal | null | 255 | 10 | 10 | 75 | 135.0 | 55 | 2 | false |
113 | Chansey | Normal | null | 250 | 5 | 5 | 35 | 105.0 | 50 | 1 | false |
202 | Wobbuffet | Psychic | null | 190 | 33 | 58 | 33 | 58.0 | 33 | 2 | false |
321 | Wailord | Water | null | 170 | 90 | 45 | 90 | 45.0 | 60 | 3 | false |
594 | Alomomola | Water | null | 165 | 75 | 80 | 40 | 45.0 | 65 | 5 | false |
143 | Snorlax | Normal | null | 160 | 110 | 65 | 65 | 110.0 | 30 | 1 | false |
289 | Slaking | Normal | null | 150 | 160 | 100 | 95 | 65.0 | 100 | 3 | false |
426 | Drifblim | Ghost | Flying | 150 | 80 | 44 | 90 | 54.0 | 80 | 4 | false |
487 | GiratinaAltered Forme | Ghost | Dragon | 150 | 100 | 120 | 100 | 120.0 | 90 | 4 | true |
487 | GiratinaOrigin Forme | Ghost | Dragon | 150 | 120 | 100 | 120 | 100.0 | 90 | 4 | true |
There are multiple pokemon that have 150 HP, if we wanted to order by "HP"
first and "Attack"
second but order by Attack in ascending order, we can do that as follows:
pokemon = pokemon.sort(['HP', 'Attack'], ascending=[0, 1])
display(pokemon.limit(10))
# | Name | Type1 | Type2 | HP | Attack | Defense | Sp.Atk | Sp.Def | Speed | Generation | Legendary |
---|---|---|---|---|---|---|---|---|---|---|---|
242 | Blissey | Normal | null | 255 | 10 | 10 | 75 | 135.0 | 55 | 2 | false |
113 | Chansey | Normal | null | 250 | 5 | 5 | 35 | 105.0 | 50 | 1 | false |
202 | Wobbuffet | Psychic | null | 190 | 33 | 58 | 33 | 58.0 | 33 | 2 | false |
321 | Wailord | Water | null | 170 | 90 | 45 | 90 | 45.0 | 60 | 3 | false |
594 | Alomomola | Water | null | 165 | 75 | 80 | 40 | 45.0 | 65 | 5 | false |
143 | Snorlax | Normal | null | 160 | 110 | 65 | 65 | 110.0 | 30 | 1 | false |
426 | Drifblim | Ghost | Flying | 150 | 80 | 44 | 90 | 54.0 | 80 | 4 | false |
487 | GiratinaAltered Forme | Ghost | Dragon | 150 | 100 | 120 | 100 | 120.0 | 90 | 4 | true |
487 | GiratinaOrigin Forme | Ghost | Dragon | 150 | 120 | 100 | 120 | 100.0 | 90 | 4 | true |
289 | Slaking | Normal | null | 150 | 160 | 100 | 95 | 65.0 | 100 | 3 | false |