spirosgyros.net

Mastering Python Code for Efficient Data Engineering Practices

Written on

Introduction to Effective Python Coding

If you're engaging with this guide, it's likely that you aspire to enhance your coding skills in data engineering. Here, I will outline my method for coding as a problem-solving tool.

A Bit About Me

I have over five years of experience as a Senior Data Engineer at Headspace.

Understanding Key Concepts

In this discussion, I will distinguish between declarative and non-declarative (imperative) code:

  • Imperative Code: This type instructs the compiler on how to execute a program step-by-step, with each step reliant on the previous one.
  • Declarative Code: In contrast, this type tells the compiler what the desired outcome is, allowing it to optimize the steps taken. The compiler can skip or merge steps, as it anticipates all potential outcomes.

As David Farley elegantly put it, high-performance software is about accomplishing the maximum amount of work in the fewest steps. Modern compilers use various techniques to enhance code execution speed, so the more predictable your instructions are, the better the performance.

Below is a diagram illustrating how a compiler interacts with processing hardware. Remember, allowing the compiler to handle simpler, more predictable instructions can significantly enhance performance.

Compiler interaction with processing hardware

In summary, using declarative code can leverage modern compiler capabilities, leading to improved performance.

Let's dive into some coding examples!

Identifying the First Non-Null Value

Imagine you have a list that could be empty or filled with millions of items, and you need to find the first non-null value:

# A small example list

small_list = [0, 1]

# A large list

large_list = list(range(1_000_000))

# A list containing nulls

list_with_nulls = [None, "", object()]

Expected Output: The function should return accurate results without bias against 0 or empty strings. We aim for a performance of at least O(1) for the minimum and O(k) for the maximum, where k is the size of the list.

Here's how it should work:

>> get_first_non_null([1, 2, 3])

1

>> get_first_non_null([None, 2, 3])

2

>> get_first_non_null([None, 0, 3])

0

>> get_first_non_null([None, None, None])

None

>> get_first_non_null([])

None

>> get_first_non_null([None, "", 1])

""

Solution Approaches

#### Solution 1: Using List Comprehension

This is a straightforward approach that many data engineers might use. It involves iterating through the list to find non-null values:

def get_first_non_null_list_comp(my_vals: list, default=None):

"""

Get first non-null value using list comprehension.

"""

filtered_vals = [x for x in my_vals if x is not None]

if len(filtered_vals) > 0:

return filtered_vals[0]

else:

return default

However, this method can be inefficient, especially for large lists, as it processes every element and consumes additional memory.

#### Solution 2: Loop Evaluation

We can modify the function to directly iterate through the list:

def get_first_non_null_loop(my_vals: list, default=None):

"""

Get first non-null value using a loop.

"""

for x in my_vals:

if x is not None:

return x

return default

This method is more efficient but still has its drawbacks, including multiple return statements.

#### Solution 3: Utilizing a Generator

For a more advanced approach, we can use Python's built-in filter function to create a generator:

from operator import is_not

from functools import partial

def get_first_non_null_generator(my_vals: list, default=None):

"""

Get first non-null value using a generator (via filter).

"""

filtered_vals = filter(partial(is_not, None), my_vals)

return next(filtered_vals, default)

This method dynamically evaluates values, optimizing memory usage and allowing for lazy loading. It also permits further compilation enhancements.

For a deeper dive into vectorization and its benefits, consider this resource: [Vectorization: A Key Tool To Improve Performance On Modern CPUs](#).

Extending the Solution

To find the first non-empty value from a dictionary based on specific keys, you can adapt the previous generator function:

my_doc = {

"field_1": "one",

"field_2": "two"

}

# Fetching the first non-empty value

result = get_first_non_null_generator(map(my_doc.get, ("field_1", "field_2")))

assert result == "one"

Advanced Use Case

For more complex scenarios, such as accessing class attributes dynamically, consider the following example:

from typing import Any, Optional

from operator import attrgetter

class FieldAttributes:

"""

Field attributes structure.

"""

example: Any

default: Any

description: Optional[str]

def __init__(self, example=None, default=None, description=None):

self.example = example

self.default = default

self.description = description

class UserData:

"""Class representing user data."""

name = Field("user_name", example="Willy Wonka")

country = Field("country", default="USA", example="Wonka-land")

n_wonka_bars = Field("n_wonka_bars", default=0, example=11)

has_golden_ticket = Field("has_golden_ticket", default=False)

is_an_oompa_loompa = Field("is_an_oompa_loompa", description="Is this person an Oompa Loompa?")

# Create an example record

example_record = {

k.name: get_first_non_null_generator(

map(k.attrs.__getattribute__, ("example", "default"))

)

for k in UserData.fields

}

assert example_record == expected_result

Conclusion

In summary, seasoned data engineers strive to write code that is not only efficient and high-performing but also reduces complexity for future use. The get_first_non_null_generator() function is an excellent example of this balance—combining cleverness, readability, and performance.

In the next section, we'll explore how to apply Python fundamentals in data engineering contexts.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Goodbye High-Tech Hype! Can Embracing a Low-Tech Life Enhance Our Happiness?

Exploring the wisdom of embracing a low-tech lifestyle for enhanced happiness and a break from digital distractions.

# Discover the Therapeutic Benefits of Journaling for Self-Care

Explore the numerous advantages of journaling for self-care, including emotional clarity, self-awareness, and creative expression.

Understanding Skin Pallor: Causes and Implications

Explore the reasons behind skin pallor, its effects on health, and how diet and digestion play crucial roles.