How to pass pandas dataframe to airflow tasks

By jacksparrow September 9, 2024

Passing Pandas DataFrames to Airflow Tasks

Apache Airflow is a popular platform for orchestrating data pipelines, and Pandas is a widely used Python library for data manipulation. Integrating Pandas with Airflow enables powerful data processing and analysis within workflows.

Methods for Passing Pandas DataFrames

1. Using XComs

XComs (cross-communication) are a built-in mechanism in Airflow for sharing data between tasks. This is a simple and flexible approach for passing DataFrames:

Code Example

 from airflow import DAG from airflow.operators.python import PythonOperator from airflow.utils.dates import days_ago import pandas as pd def generate_dataframe(**kwargs): df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) kwargs['ti'].xcom_push(key='dataframe', value=df) def process_dataframe(**kwargs): df = kwargs['ti'].xcom_pull(task_ids='generate_dataframe', key='dataframe') print(df) with DAG( dag_id='dataframe_xcom_dag', start_date=days_ago(1), schedule_interval=None, ) as dag: generate_task = PythonOperator( task_id='generate_dataframe', python_callable=generate_dataframe ) process_task = PythonOperator( task_id='process_dataframe', python_callable=process_dataframe ) generate_task >> process_task

Output

 col1 col2 0 1 4 1 2 5 2 3 6

2. Using Pickle

Pickle is a Python module for serializing and de-serializing objects. You can pickle DataFrames and pass them as strings:

Code Example

 from airflow import DAG from airflow.operators.python import PythonOperator from airflow.utils.dates import days_ago import pandas as pd import pickle def generate_dataframe(**kwargs): df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) pickled_df = pickle.dumps(df) kwargs['ti'].xcom_push(key='dataframe', value=pickled_df) def process_dataframe(**kwargs): pickled_df = kwargs['ti'].xcom_pull(task_ids='generate_dataframe', key='dataframe') df = pickle.loads(pickled_df) print(df) with DAG( dag_id='dataframe_pickle_dag', start_date=days_ago(1), schedule_interval=None, ) as dag: generate_task = PythonOperator( task_id='generate_dataframe', python_callable=generate_dataframe ) process_task = PythonOperator( task_id='process_dataframe', python_callable=process_dataframe ) generate_task >> process_task

Output

 col1 col2 0 1 4 1 2 5 2 3 6

3. Using a File System

You can save the DataFrame to a file and then read it in a subsequent task:

Code Example

 from airflow import DAG from airflow.operators.python import PythonOperator from airflow.utils.dates import days_ago import pandas as pd import os def generate_dataframe(**kwargs): df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) file_path = '/tmp/dataframe.csv' # Adjust as needed df.to_csv(file_path, index=False) def process_dataframe(**kwargs): file_path = '/tmp/dataframe.csv' # Adjust as needed df = pd.read_csv(file_path) print(df) with DAG( dag_id='dataframe_file_dag', start_date=days_ago(1), schedule_interval=None, ) as dag: generate_task = PythonOperator( task_id='generate_dataframe', python_callable=generate_dataframe ) process_task = PythonOperator( task_id='process_dataframe', python_callable=process_dataframe ) generate_task >> process_task

Output

 col1 col2 0 1 4 1 2 5 2 3 6

4. Using a Database

For larger DataFrames or more robust data storage, you can store the DataFrame in a database:

Code Example

 from airflow import DAG from airflow.operators.python import PythonOperator from airflow.utils.dates import days_ago import pandas as pd import sqlite3 def generate_dataframe(**kwargs): df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) conn = sqlite3.connect('dataframe.db') df.to_sql('dataframe_table', conn, if_exists='replace', index=False) def process_dataframe(**kwargs): conn = sqlite3.connect('dataframe.db') df = pd.read_sql_query('SELECT * FROM dataframe_table', conn) print(df) with DAG( dag_id='dataframe_db_dag', start_date=days_ago(1), schedule_interval=None, ) as dag: generate_task = PythonOperator( task_id='generate_dataframe', python_callable=generate_dataframe ) process_task = PythonOperator( task_id='process_dataframe', python_callable=process_dataframe ) generate_task >> process_task

Output

 col1 col2 0 1 4 1 2 5 2 3 6

Choosing the Right Method

The best method depends on the size of the DataFrame, the complexity of the workflow, and the desired level of persistence. Here’s a brief guide:

XComs: Suitable for small DataFrames and simple workflows.
Pickle: Good for sharing DataFrames between tasks but might not be suitable for large datasets.
File System: Offers flexibility and can handle larger datasets, but requires careful file management.
Database: Provides robust and persistent storage, ideal for complex workflows and large volumes of data.

Conclusion

Passing Pandas DataFrames between Airflow tasks is a fundamental aspect of data processing workflows. Understanding the different methods and their strengths helps you choose the most efficient approach based on your specific requirements. By leveraging these techniques, you can effectively integrate Pandas into your Airflow pipelines, enabling complex data manipulation and analysis within your workflows.

Post Views: 7

How to pass pandas dataframe to airflow tasks

Passing Pandas DataFrames to Airflow Tasks

Methods for Passing Pandas DataFrames

1. Using XComs

Code Example

Output

2. Using Pickle

Code Example

Output

3. Using a File System

Code Example

Output

4. Using a Database

Code Example

Output

Choosing the Right Method

Conclusion

By jacksparrow

Leave a Reply Cancel reply

You Missed

What is Python? – Definition, Features, Application

KeyAttestation in Android Nougat API 24

UTM tracking codes in Firebase

android.os.BadParcelableException: ClassNotFoundException when unmarshalling: com.facebook.flatbuffers.helpers.FlatBufferModelHelper$LazyHolder

How to pass pandas dataframe to airflow tasks

Passing Pandas DataFrames to Airflow Tasks

Methods for Passing Pandas DataFrames

1. Using XComs

Code Example

Output

2. Using Pickle

Code Example

Output

3. Using a File System

Code Example

Output

4. Using a Database

Code Example

Output

Choosing the Right Method

Conclusion

By jacksparrow

Related Post

What are the differences between airflow and Kubeflow pipeline?

Leave a Reply Cancel reply

You Missed

What is Python? – Definition, Features, Application

KeyAttestation in Android Nougat API 24

UTM tracking codes in Firebase

android.os.BadParcelableException: ClassNotFoundException when unmarshalling: com.facebook.flatbuffers.helpers.FlatBufferModelHelper$LazyHolder