How to Prepare for a Technical Data Engineer Interview
Sample SQL, Python and Data System Design Questions
I recently went through a number of interviews for Data Engineering roles. Most of them at the senior, individual contributor level. This is an aggregation of some of the questions I was asked throughout the technical rounds.
The core focus was of course SQL and Python with some data infrastructure and system design questions. Below is a breakdown of the type of technical questions asked as well sample questions and prompts.
- SQL: 45%
- Python: 35%
- Data Systems Design: 20%
SQL
Verbal:
- Explain the difference between a left join and an inner join
- What's the difference between Union vs. Union All?
- Explain what indexing is and it's purpose in a RDBMS
- When analyzing a slow query, what steps would you take to diagnose the issue?
Written:
Given an order and customer table, answer the following questions:
- Take both order and customer tables and produce the
order_amount
for allcustomer_ids
- Take the order table and sum the
order_amount
for each month and display byorder_type
. Eachorder_type
should be in its own column. - Take the order table and generate an
average_order_type_amount
, byorder_type
for each row.
Order Table:
customer_id | order_type | date_of_order | order_amount |
1 | A | 01/02/2021 | 250 |
2 | B | 02/12/2021 | 500 |
3 | C | 04/08/2021 | 600 |
4 | B | 03/05/2021 | 150 |
5 | A | 01/27/2021 | 230 |
6 | A | 03/14/2021 | 250 |
1 | B | 04/13/2021 | 405 |
Customer Table:
customer_id | name |
1 | Jasper |
2 | Will |
3 | Ben |
Python
Verbal:
- What's the difference between a list and a tuple?
- Explain what a dictionary is
- In pseudocode, explain a how to encode a string input to
$
if the character repeats or to^
if the character does not repeat.- For instance,
temperature
would encode to$$^^$$^$^$$
t
repeats twice and thus encoded to$
- For instance,
- What is the output of this function?
mydata = {'foo': 'foo', 'bar': 'bar'} def transform(data): local_data = data local_data['foo'] = 'baz' transform(mydata) print(mydata['foo'])
Written:
- Given a string as an input, write a function to replace all
#
hashtags with a-
and replace all&
ampersands with a+
- For example,
###&##
should return---+--
- For example,
Find the longest consecutive recurrence of an element in a list. Given this list:
[A ,C ,D, D, D, B, C, C, A, F, A, D, A, A,]
Print dictionary in alphabetic order by first name. Expected output
Anthony is in IT, Ben is in HR...
mydictionary = {'Jess': 'IT', 'Will': 'Sales', 'Anthony': 'IT', 'Ben': 'HR' }
Data Systems Design
Prompt Example:
Be prepared to talk about the architectural design of data infrastructure you have worked on, preferably acted as the lead engineer. Prepare architecture diagrams in advance, or you can draw one live as you explain the design.
- Explain the infrastructure and the principles that led to this architecture
- Explain the reasoning behind different design decisions and alternative solutions
- Explain further extensions of the project for future requests
- What was the overall impact the project had?
Questions:
- How would you design a scalable pipeline to ingest flat files?
- How would you design an API client to interact with a microservice?