Introduction to PySpark Part 2 – Selecting, Filtering and Sorting Data

This is the second part in a series of blog posts as an introduction to PySpark.

The other parts of this blog post series can be found here:

In this part I’ll be covering:

Selecting Columns
Filtering Rows
- Chaining Conditions
- String Matching
- Matching against a list
- Not (~)
- Filtering on Nulls
- Dropping Duplicates
Sorting Data

Selecting Columns

We can use the select method of our DataFrame to select on the columns we’re interested in.

If we were to load the pokemon parquet file we looked at in the previous section, we can then select only the columns we’re interested in and because the commands are evaluated lazily, we’ll only end up loading in the columns we’re interested in when we load from the parquet file.

pokemon = spark.read.parquet('/mnt/tmp/pokemon.parquet')
pokemon = pokemon.select('#', 'Name', 'Type1', 'Type2', 'Generation')
display(pokemon.limit(10))

#	Name	Type1	Type2	Generation
1	Bulbasaur	Grass	Poison	1
2	Ivysaur	Grass	Poison	1
3	Venusaur	Grass	Poison	1
3	VenusaurMega Venusaur	Grass	Poison	1
4	Charmander	Fire	null	1
5	Charmeleon	Fire	null	1
6	Charizard	Fire	Flying	1
6	CharizardMega Charizard X	Fire	Dragon	1
6	CharizardMega Charizard Y	Fire	Flying	1
7	Squirtle	Water	null	1

We can also match columns using the colRegex method of a DataFrame, which will give us the columns that match a regular expression, let’s try it out:

display(
    pokemon.select(
        pokemon.colRegex('`^T[a-z]*[1-9]$`')
    ).limit(5)
)

Type1	Type2
Grass	Poison
Grass	Poison
Grass	Poison
Grass	Poison
Fire	null

Filtering Rows

We can filter rows in a similar manner to how we do in pandas, if we want only the second generation pokemon for example:

second_gen_pokemon = pokemon[pokemon['Generation'] == 2]

display(second_gen_pokemon.limit(10))

#	Name	Type1	Type2	Generation
152	Chikorita	Grass	null	2
153	Bayleef	Grass	null	2
154	Meganium	Grass	null	2
155	Cyndaquil	Fire	null	2
156	Quilava	Fire	null	2
157	Typhlosion	Fire	null	2
158	Totodile	Water	null	2
159	Croconaw	Water	null	2
160	Feraligatr	Water	null	2
161	Sentret	Normal	null	2

Chaining Conditions

We can chain conditions using & or |, so if we wanted to select all the "Ground" type pokemon:

ground_type_pokemon =  pokemon[
    (pokemon['Type1'] == 'Ground')
    | (pokemon['Type2'] == 'Ground')
]

display(ground_type_pokemon.limit(5))

#	Name	Type1	Type2	Generation
27	Sandshrew	Ground	null	1
28	Sandslash	Ground	null	1
31	Nidoqueen	Poison	Ground	1
34	Nidoking	Poison	Ground	1
50	Diglett	Ground	null	1

String Matching

We can use startswith, endswith, or contains to filter on strings.

display(pokemon[pokemon['Name'].startswith('Nido')])

#	Name	Type1	Type2	Generation
29	Nidoran♀	Poison	null	1
30	Nidorina	Poison	null	1
31	Nidoqueen	Poison	Ground	1
32	Nidoran♂	Poison	null	1
33	Nidorino	Poison	null	1
34	Nidoking	Poison	Ground	1

We can also use like and rlike, where rlike is used to match on a regular expression – let’s look for Pokemon whose names begin with “R” and end with “n”:

display(
    pokemon[
        pokemon['Name'].rlike('^[Rr].+n$')
    ]
)

#	Name	Type1	Type2	Generation
111	Rhyhorn	Ground	Rock	1
112	Rhydon	Ground	Rock	1

Matching against a list

We can filter based on a list of values using isin

display(
    pokemon[
        pokemon['#'].isin([121, 131, 141, 151])
    ]
)

#	Name	Type1	Type2	Generation
121	Starmie	Water	Psychic	1
131	Lapras	Water	Ice	1
141	Kabutops	Rock	Water	1
151	Mew	Psychic	null	1

NOT (~)

We can get the opposite of any of the filters above by using the a tilde ~ symbol.

display(
    pokemon[
        ~(pokemon['Type1'].isin(['Grass', 'Fire', 'Water', 'Bug']))
        & ~(pokemon['Generation'] == 1)
    ].limit(10)
)

#	Name	Type1	Type2	Generation
161	Sentret	Normal	null	2
162	Furret	Normal	null	2
163	Hoothoot	Normal	Flying	2
164	Noctowl	Normal	Flying	2
169	Crobat	Poison	Flying	2
172	Pichu	Electric	null	2
173	Cleffa	Fairy	null	2
174	Igglybuff	Normal	Fairy	2
175	Togepi	Fairy	null	2
176	Togetic	Fairy	Flying	2

Filtering on Nulls

We can filter on null values using isNull so if we only want to look for Pokemon with a single type, we can do:

single_type_pokemon = pokemon[pokemon['Type2'].isNull()]

display(single_type_pokemon.limit(5))

#	Name	Type1	Type2	Generation
4	Charmander	Fire	null	1
5	Charmeleon	Fire	null	1
7	Squirtle	Water	null	1
8	Wartortle	Water	null	1
9	Blastoise	Water	null	1

Or conversely if we only wanted Pokemon with a Type1 and Type2, we just use our `~“:

dual_type_pokemon = pokemon[~pokemon['Type2'].isNull()]

display(dual_type_pokemon.limit(5))

#	Name	Type1	Type2	Generation
1	Bulbasaur	Grass	Poison	1
2	Ivysaur	Grass	Poison	1
3	Venusaur	Grass	Poison	1
3	VenusaurMega Venusaur	Grass	Poison	1
6	Charizard	Fire	Flying	1

Dropping Duplicate Rows

You may have noticed that sometimes we have Pokemon number (#) duplicated because we have the regular version of that Pokemon and the “Mega” version, we can drop the duplicate numbers using the dropDuplicates method of a DataFrame. If we wanted to just drop any duplicated rows, we don’t need to provide a subset of columns, but by providing a subset of columns we can remove rows that have duplicates in any single/combination of columns.

display(pokemon.dropDuplicates(subset=['#']).limit(10))

#	Name	Type1	Type2	Generation
1	Bulbasaur	Grass	Poison	1
2	Ivysaur	Grass	Poison	1
3	Venusaur	Grass	Poison	1
4	Charmander	Fire	null	1
5	Charmeleon	Fire	null	1
6	Charizard	Fire	Flying	1
7	Squirtle	Water	null	1
8	Wartortle	Water	null	1
9	Blastoise	Water	null	1
10	Caterpie	Bug	null	1

Note that now we no longer have the MegaVenusaur or MegaCharizard in our data.

Sorting Data

At the moment our DataFrame is sorted by the Pokemon’s number (#), if we load in the data again, we can choose to sort by some other attribute using sort or orderBy:

pokemon = spark.read.parquet('/mnt/tmp/pokemon.parquet')
pokemon = pokemon.sort('HP', ascending=False)
display(pokemon.limit(10))

#	Name	Type1	Type2	HP	Attack	Defense	Sp.Atk	Sp.Def	Speed	Generation	Legendary
242	Blissey	Normal	null	255	10	10	75	135.0	55	2	false
113	Chansey	Normal	null	250	5	5	35	105.0	50	1	false
202	Wobbuffet	Psychic	null	190	33	58	33	58.0	33	2	false
321	Wailord	Water	null	170	90	45	90	45.0	60	3	false
594	Alomomola	Water	null	165	75	80	40	45.0	65	5	false
143	Snorlax	Normal	null	160	110	65	65	110.0	30	1	false
289	Slaking	Normal	null	150	160	100	95	65.0	100	3	false
426	Drifblim	Ghost	Flying	150	80	44	90	54.0	80	4	false
487	GiratinaAltered Forme	Ghost	Dragon	150	100	120	100	120.0	90	4	true
487	GiratinaOrigin Forme	Ghost	Dragon	150	120	100	120	100.0	90	4	true

There are multiple pokemon that have 150 HP, if we wanted to order by "HP" first and "Attack" second but order by Attack in ascending order, we can do that as follows:

pokemon = pokemon.sort(['HP', 'Attack'], ascending=[0, 1])
display(pokemon.limit(10))

#	Name	Type1	Type2	HP	Attack	Defense	Sp.Atk	Sp.Def	Speed	Generation	Legendary
242	Blissey	Normal	null	255	10	10	75	135.0	55	2	false
113	Chansey	Normal	null	250	5	5	35	105.0	50	1	false
202	Wobbuffet	Psychic	null	190	33	58	33	58.0	33	2	false
321	Wailord	Water	null	170	90	45	90	45.0	60	3	false
594	Alomomola	Water	null	165	75	80	40	45.0	65	5	false
143	Snorlax	Normal	null	160	110	65	65	110.0	30	1	false
426	Drifblim	Ghost	Flying	150	80	44	90	54.0	80	4	false
487	GiratinaAltered Forme	Ghost	Dragon	150	100	120	100	120.0	90	4	true
487	GiratinaOrigin Forme	Ghost	Dragon	150	120	100	120	100.0	90	4	true
289	Slaking	Normal	null	150	160	100	95	65.0	100	3	false