Corrections - Google Professional Data Engineer Practice Questions and Answers - Part 2

Q 9

The best solution to minimize the storage cost of the migration of your company's on-premises Apache Hadoop servers to Google Cloud Dataproc is

✅ Cost Efficiency: Google Cloud Storage (GCS) is cheaper compared to Google Persistent Disk, which can significantly reduce storage costs.
✅ Compatibility with Hadoop: GCS is compatible with Hadoop. It provides a connector that allows Hadoop jobs to use GCS for input and output.
✅ Durability and Availability: GCS provides high data durability and availability, making it a suitable storage option for large amounts of data.
✅ Scalability: GCS is highly scalable and can handle large amounts of data, making it suitable for migrating a Hadoop cluster.

❌ B. Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster: Preemptible VMs are cheaper than regular instances, but they are not related to storage cost savings. Moreover, they are short-lived, which may not be suitable for long-running jobs.
❌ C. Tune the Cloud Dataproc cluster so that there is just enough disk for all data: This option might not significantly reduce the storage cost as the data size is large (50TB per node). Moreover, it can lead to storage capacity issues in the future.
❌ D. Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk: This option could still result in high costs because Persistent Disk is more expensive than GCS. Also, it may lead to complexity in managing data across two storage systems.

Thanks

By Deepak Dubey