Data Wrangling Course

James Howison's Data Wrangling course from the Information School at the University of Texas at Austin.

Parameterized queries allow us to insert variables into our SQL queries. This is analogous to strings and .format.

Parameterized queries have two important functions:

  1. They “sanitize” inputs from users to protect against malicious queries reaching the database.
  2. They simplify re-using queries inside loops.

The relevant screencast is here MySQL and placeholders/parameterized queries:

The full code used in the screencast is at the bottom of the page, but below I build it up step by step.

Using placeholders in queries.

Sometimes we have values in strings and ints in python that we want to use within queries. We might have gotten input from the terminal, or a web-form. Very often we’re working with data we’ve read from a CSV file and are inserting into the database.

In these cases we can’t just work with hard-coded SQL queries (such as SELECT * FROM venues) but we have to put data into the queries. This is very similar to interpolating variables into strings, as we did with the Michael Finegan poem, but there are a few differences. Whether the data are typed in by someone or come from a CSV file, we have to treat them as potentially dangerous, because they could have special characters that could do mischief in the database (see below).

For interpolating strings we’ve used + and we’ve used {} and .format. For sql neither are sufficient, since if we’re inserting a number we don’t need quotes but if we’re inserting a string we do. Possibly more importantly, we also have to make sure we’re not inserting strings that might hold SQL that tries to cause problems. One of the most common security flaws is called an “SQL Injection Attack” and it can be avoided through parameterized queries. See XKCD’s excellent comic which explains this.

In the first example we are working with just one input to interpolate (aka stick into the query). We set up a dictionary (called param_dict) with keys, then use those keys in the placeholder and pass both the placeholder and the param_dict to cursor.execute(). This is very similar to using the **row in .format, although there is a slight difference in the syntax.

# after connecting to SQL etc.

min_capacity = input("What is the minimum capacity needed? ")

param_dict = { "min_cap": min_capacity }

# The placeholder format is %(keyname)s where keyname is the
# appropriate key in the param_dict. This differs from .format
# where the placeholder would have been {keyname}
sql_placeholder = "SELECT * FROM venues WHERE capacity >= %(min_cap)s"

# Then we pass the param_dict along with the query
# string to cursor.execute
cursor.execute(sql_placeholder, param_dict)

results = cursor.fetchall()

pprint.pprint(results)

We can also use this with multiple keys and it is convenient to INSERT data.

# Let's say we want to insert a new venue ("Rocking new venue" with capacity 10,000)
new_venue_params = {"new_name": "Rocking new venue",
                    "new_capacity": 10000 }

new_venue_sql = "INSERT INTO venues(name,capacity) VALUE (%(new_name)s, %(new_capacity)s)"

cursor.execute(new_venue_sql, new_venue_params)


#See any changes:
get_venues_sql = "SELECT * FROM venues"
cursor.execute(get_venues_sql)
results = cursor.fetchall()
print(results)

Where this becomes really useful is substituting data drawn from different rows of a CSV file. Each row from the CSV will come in as a dictionary, so we use the names of the keys as placeholders in our SQL query.

In the example below we’re working with list of dictionaries, using a single parameterized SQL query to insert each of the rows into the database. The id is assigned by AUTO-INCREMENT.

venues = [{'capacity': 700, 'name': 'AMD'},
{'capacity': 2000, 'name': 'Honda'},
{'capacity': 2300, 'name': 'Austin Kiddie Limits'},
{'capacity': 2000, 'name': 'Austin Ventures'}]

sql_placeholder = "INSERT INTO venues(names, capacity) VALUE (%(name)s, %(capacity)s)"

for row in venues:
    cursor.execute(sql_placeholder, row)

We define the string once, but execute using it one time per row.

The rest of this is just for interest, don’t bother with it unless you are curious

While I think the param_dict style of parameterization (also called “pyformat” parameters) is easiest to use consistently there are two other forms you might encounter. The first uses %s on its own for each parameter. Somewhat confusingly it is %s for each field, regardless of whether it is a string, an int, or another datatype. This style of parameter is called “format” sql = “UPDATE performances SET venue_id = %s WHERE id = %s”

You then pass the variables in the right order to the second part of the cursor.execute function. Somewhat confusingly you have to surround the list of variables in a set of round parentheses. If you are interested this is another data type called a “tuple”. cursor.execute(sql, ( new_venue_id, 40 ) )

The other style is just to manually create the string with .format string interpolation. In addition to being insecure, this is difficult because you have to remember when to add quotes around strings or not around ints etc. sql = “UPDATE performances SET venue_id = {} WHERE id = {}”.format(new_venue_id, 40) cursor.execute(sql)

If you have security concerns (e.g., especially if the data you are inserting comes from a web form), and for some reason you don’t want to use the %s or %(dict_key)s (format or pyformat) parameterizations, then you can “de-fang” the content of strings using connection.escape. See http://stackoverflow.com/questions/11363335/how-can-i-escape-the-input-to-a-mysql-db-in-python3solved new_venue_id_escaped = connection.escape(new_venue_id) perf_id_escaped = connection.escape(perf_id) sql = “UPDATE performances SET venue_id = {} WHERE id = {}”.format(new_venue_id_escaped, perf_id_escaped) cursor.execute(sql) ```