AI-Based Data Transformation: Llama-3-70B’s Take on PySpark Code

AI-Based Data Transformation: Llama-3-70B’s Take on PySpark Code

Welcome back to our comparative series where we dive into the performance of various large language models (LLMs) in generating PySpark code for data transformations. After reviewing GPT-4's capabilities in our last post, we now turn our attention to Llama-3-70B, a model developed by Meta. Instead of deploying the model on our own servers, we are using pre-deployed model on Groq LPU Inference Engine.

Introduction

For this installment, we focus on Llama-3-70B. Notably, since Llama-3 needs to be deployed on our own server and Meta's AI chat corpus training date cutoff is December 2022, we utilized Llama-3-70B running on the Groq LPU Inference Engine to ensure top performance and responsiveness.

Preselected Data and Use Case

As with our previous analysis, we continue to work with ERCOT's 2-Day Ancillary Services Reports. This dataset includes:

  • NP3-959-EX: 48-hour Aggregate AS Offers

  • NP3-960-EX: 48-hour Self-Arranged AS

  • NP3-961-EX: 48-hour Cleared DAM AS

Our task involves processing these data files to load them on a MS SQL database on Azure.

Prompt Used

We have a folder with Ercot's "2-Day Ancillary Services Reports" This report contains all 48 Hour disclosure data related to DAM. The following individual files are included in the report:NP3-959-EX 48-hour Aggregate AS Offers; NP3-960-EX 48-hour Self-Arranged AS; NP3-961-EX 48-hour Cleared DAM AS (previously named 48 Hour Ancillary Services Reports).

We gathered daily data as a zipped file containing multiple CSVs inside the Zipped file for the month of April. I have added all the zipped files in a master folder. So now there is a Master folder with a 30 days zipped file. Each zipped file has multiple CSV with data listed above.

Can you give me step by step procedure to do the following:

1) Ingest the CSV file within zipped folders using PySpark

2) filter outNP3-960-EX 48-hour Self-Arranged ASandNP3-961-EX 48-hour Cleared DAM AS

3) Ingest the file in the SQL database that I recently set up on my Azure.

Llama-3-70B's PySpark Answer

Here is a step-by-step procedure to achieve the tasks you mentioned:

Step 1: Ingest the CSV files using PySpark

  1. Install PySpark and its dependencies using pip:
pip install pyspark
  1. Then, follow these steps:
from pyspark.sql import SparkSession
import zipfile

# Create a SparkSession
spark = SparkSession.builder.appName("Ercot Report Ingestion").getOrCreate()

# Define the master folder path
master_folder_path = "/path/to/master/folder"

# Iterate over each zipped file in the master folder
for file in os.listdir(master_folder_path):
    if file.endswith(".zip"):
        zip_file_path = os.path.join(master_folder_path, file)

        # Extract the zipped file
        with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
            zip_ref.extractall(path=master_folder_path)

        # Iterate over each CSV file extracted from the zipped file
        for csv_file in os.listdir(master_folder_path):
            if csv_file.endswith(".csv"):
                csv_file_path = os.path.join(master_folder_path, csv_file)

                # Read the CSV file using PySpark
                df = spark.read.csv(csv_file_path, header=True, inferSchema=True)

Step 2: Filter outNP3-960-EX 48-hour Self-Arranged AS and NP3-961-EX 48-hour Cleared DAM AS

  1. Assuming the CSV files have a column named "File_Name" or "File_Type" that contains the file names, you can filter out the unwanted files using PySpark's filter method:
filtered_df = df.filter((df.File_Name != 'NP3-960-EX 48-hour Self-Arranged AS') & (df.File_Name != 'NP3-961-EX 48-hour Cleared DAM AS'))

Step 3: Ingest the file in the SQL database on Azure

  1. To ingest the filtered data into an Azure SQL database, you'll need to use the spark-sql package and the jdbc connector. First, add the necessary dependencies to your PySpark script:
from pyspark.sql import SparkSession
from pyspark.sql.types import *

# Create a SparkSession
spark = SparkSession.builder.appName("Ercot Report Ingestion").getOrCreate()

# Define the Azure SQL database connection properties
jdbc_url = "jdbc:sqlserver://your_server_name.database.windows.net:1433;database=your_database_name"
username = "your_username"
password = "your_password"
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"

# Write the filtered data to the Azure SQL database
filtered_df.write.format("jdbc") \
    .option("url", jdbc_url) \
    .option("query", "your_table_name") \
    .option("user", username) \
    .option("password", password) \
    .option("driver", driver) \
    .save()

Replace the placeholders with your actual Azure SQL database credentials and table name.

Note: Make sure to install the spark-sql package and the mssql-jdbc driver in your PySpark environment. You can do this using pip install spark-sql mssql-jdbc or by adding the dependencies to your spark-defaults.conf file.

Implementation Results

  • I got following error on: Ingest the CSV files using PySpark step:

    • NameError: name 'os' is not defined

      • The issue was with importing os library. I just added import os
    • Another issue in the ingestion step was due to multiple nested loops which caused ingestion code taking very long time to run.

  • After completing ingestion step, I got following error on data filtering step:

    • AttributeError: 'DataFrame' object has no attribute 'File_Name'

      • I was able to fix this by asking Llama model to give me a solution after which I was provided this code. However, there were couple of other suggestions that the model gave me which also raised error. I was able to get to this solution after 3 shots of prompting.

        •     from pyspark.sql.types import StructType, StructField, StringType
          
              schema = StructType([
                  StructField("File_Name", StringType(), True),
                  StructField("Delivery Date", StringType(), True),
                  StructField("Hour Ending", StringType(), True),
                  StructField("MW Offered", StringType(), True),
                  StructField("ECRSM Offer Price", StringType(), True)
                  # Add more columns as needed
              ])
          
              df = spark.read.csv(csv_file_path, header=True, schema=schema)
          
    • I was able to load the file to SQL server without any issue.

Conclusion on Llama-3-70B**'s Performance**

Llama-3-70B was able to generate code faster. However, it fell short compared to GPT-4 in terms of creating file ingestion script. GPT-4 was able to convert zipped files to binary files, thereby, the processing time was super fast.

However, Llama-3-70B generated codes with multiple nested loop which is not a good practice when ingesting large file in big data projects. So the processing time took almost 10 mins to run compared to GPT-4 ingestion which happened seamlessly within 1 minute.

Moreover, GPT-4 didn't give any issues on file filtering steps. Llama-3-70B raised an Attribute error issue which I was able to solve after multiple intermediate steps. Those intermediate steps raised additional issues which I was finally able to resolve by using the code above in Implementation Results section.

Next in this series

We will explore Claude's performance on this data transformation project next in this series.