We know that when a transaction is in process and not yet committed, it must remain isolated from any other transaction. Also when we count the record we don’t get the expected output. Thus, the Atomicity principal of ACID properties is violated too. If the operation would have been atomic, then the state would have been rolled-back to the original data in the source, but it is not what happened. But as seen from the above code, the existing data is successfully deleted, but no new data is written. First, the operation deletes the existing data from the data source, and the next operation writes new data into it. This left our system in an inconsistent state and violate the Consistency property of transaction.Īlso, overwrite operation of writing data frame to a source is two-fold. Not only did Job-2 fail, but it deleted the records created by Job-1 as well. Let’s go to the FILE_PATH directory and look for records created by Job-1. When we execute Job-2, a RuntimeException will be thrown as expected. In the above code snippet, Job-2 will overwrite the set of 100 records created by job-1, but this job fails in the middle of the write operation. Throw new RuntimeException("Oops! Atomicity failed") ).write().mode("overwrite").csv(FILE_PATH) Throw new RuntimeException("Atomicity failed") Now, Let’s create another Job to do same task but with an exception. We can count the record: spark.read().csv(FILE_PATH).count() When we execute this, we get a record consisting of Integers from 0 to 99 in the FILE_PATH directory. In the above code snippet, we have a job which creates 100 records and writes them into the source. But in case of job failure, the original data will be lost or corrupted and no new data will be written. Consistency, on the other hand, ensures that the data is always in a valid state.Īs evident from the spark documentation below, it is clear that while saving data frame to a data source, existing data will be deleted before writing the new data. Atomicity & ConsistencyĪtomicity states that it should either write full data or nothing to the data source when using spark data frame writer. Hold on! Do we know what does ACID Transactions mean? We will see what ACID means for Spark and whether spark really accommodates it or not. Along with the many advantages and features that make it a powerful Big Data tool, there are some essential features missing from Spark. It’s efficiency and cool features make it a preferred choice among data scientists and Tech giants like Amazon, eBay, yahoo. It is an open-source cluster computing framework schemed for fast computing. Spark, one of the most successful projects in the Apache Software Foundation, is evolving day by day as a market leader for Big Data processing.
0 Comments
Leave a Reply. |