Mahender Sarangam
2018-11-17 11:23:49 UTC
Hi,
We have daily data pull which pulls almost 50 GB of data from upstream system. We are using Spark SQL for processing of 50 GB. Finally insert 50 GB of data into Hive Target table and Now we are copying whole hive target table to SQL esp. SQL Staging Table & implement merge from staging SQL table against final SQL target table and insert only modified or new records in SQL Target table. Since this process is time consuming due to majority of time vested in copying data from Blob to SQL . Instead of copying whole set of data from cluster to SQL Server & implementing merge logic in SQL . We would likes to do Merge logic implementation in Spark SQL and Move the same Delta difference to SQL and Merge against Final SQL Target Table. This will reduce Network & I/O cost. As any one implementing DELTA difference in Spark / SPark SQL
We have daily data pull which pulls almost 50 GB of data from upstream system. We are using Spark SQL for processing of 50 GB. Finally insert 50 GB of data into Hive Target table and Now we are copying whole hive target table to SQL esp. SQL Staging Table & implement merge from staging SQL table against final SQL target table and insert only modified or new records in SQL Target table. Since this process is time consuming due to majority of time vested in copying data from Blob to SQL . Instead of copying whole set of data from cluster to SQL Server & implementing merge logic in SQL . We would likes to do Merge logic implementation in Spark SQL and Move the same Delta difference to SQL and Merge against Final SQL Target Table. This will reduce Network & I/O cost. As any one implementing DELTA difference in Spark / SPark SQL