Google PDE QnA Part 1

A 1

The best method to address the overfitting problem in this scenario is:

✅ "C. Dropout Methods"
✅ Here's why:

✅ Avoiding Overfitting: Dropout is a regularization method that helps prevent overfitting in neural networks. It works by randomly ignoring, or "dropping out," a number of output units from a layer during training, which helps the model to generalize better.
✅ Improving Generalization: By using dropout, the neural network becomes less sensitive to specific weights of neurons, and this results in a more generalized and robust model that can perform better on new, unseen data.

🔴 Now, let's examine why the other options are not the best choice:
❌ A. Threading: Threading is a method used for allowing multiple tasks to run concurrently in the same program. It doesn't directly help in improving the performance of a neural network model on new data.
❌ B. Serialization: Serialization is a process of converting an object into a format that can be stored or transmitted and then recreating the object from this format. It's not a technique for improving the performance of a model on new data.
❌ D. Dimensionality Reduction: While dimensionality reduction can sometimes help improve the performance of machine learning models by removing irrelevant features, in this case, it doesn't directly address the overfitting problem which is most likely caused by the complexity of the neural network itself.

A 2

The best method to use the new data in training the model is:

✅ "B. Continuously retrain the model on a combination of existing data and the new data."
✅ Here's why:

✅ Preserving Past Information: By combining new and existing data, the model maintains information about past user preferences while adjusting to recent changes. This can be beneficial as past preferences may still hold some relevance.
✅ Adapting to New Trends: The incorporation of new data allows the model to adjust and adapt to any changes in user preferences over time, making it more accurate in its predictions.
✅ Data Diversity: Training the model on a mix of new and old data ensures diversity in the training set, which can help the model generalize better to unseen data.

🔴 Now, let's examine why the other options are not the best choice:
❌ A. Continuously retrain the model on just the new data: This approach would cause the model to lose all the information from past data, and it might not be effective if new data is limited or not diverse enough.
❌ C. Train on the existing data while using the new data as your test set: This approach does not use the new data for training and therefore the model won't learn from recent changes in user preferences.
❌ D. Train on the new data while using the existing data as your test set: This approach might lead to a model that's overfitted to the new data and not generalized well, since it doesn't learn from the full range of past preferences.

A 3

The best method to adjust the database design is:

✅ "C. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join."
✅ Here's why:

✅ Efficiency: Normalization reduces data redundancy, helps avoid data anomalies, and generally results in efficient use of storage.
✅ Avoid Self-Join: By creating separate tables for patients and visits, self-join operations that are resource-intensive can be avoided.
✅ Scalability: Normalization makes the database more scalable, because each table will be smaller and queries will run faster.
✅ Flexibility: With normalization, you can flexibly update one table without affecting the others, reducing the risk of errors and inconsistencies.

🔴 Now, let's examine why the other options are not the best choice:
❌ A. Add capacity (memory and disk space) to the database server by the order of 200: While adding capacity can help to some extent, it doesn't address the underlying design issue and is not a scalable solution.
❌ B. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges: Sharding can help manage large datasets but it adds complexity and can lead to issues when needing to query across multiple shards.
❌ D. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports: Partitioning could help improve performance, but it does not address the fundamental issue of data redundancy and may lead to difficulty in maintaining data consistency.

A 4

The best method to ensure the freshest data is shown in Google Data Studio 360 is:

✅ "A. Disable caching by editing the report settings."
✅ Here's why:

✅ Caching Control: Google Data Studio 360 caches data for performance optimization. If the cache is disabled, the most recent data will be pulled for each visualization rendering.
✅ Timeliness: By disabling cache, you ensure that the report is always using the latest data from BigQuery, providing the most accurate and up-to-date information.
✅ User Side Setting: This change is made on the user side, i.e., within the report settings in Google Data Studio, not on the database side, which could impact other applications or users.
✅ Simplicity: Disabling cache is a simple change that can be done directly within the report settings, no additional tools or extensive technical knowledge required.

🔴 Now, let's examine why the other options are not the best choice:
❌ B. Disable caching in BigQuery by editing table details: This option would not be effective as the caching issue is with Google Data Studio, not BigQuery.
❌ C. Refresh your browser tab showing the visualizations: While this may load recent data, it does not solve the underlying issue of caching in Google Data Studio.
❌ D. Clear your browser history for the past hour then reload the tab showing the virtualizations: This option would have no impact on Google Data Studio's internal caching system.

A 5

The best method to build a pipeline with potentially corrupted or incorrectly formatted data is:

✅ "D. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis."
✅ Here's why:

✅ Error Isolation: Google Cloud Dataflow allows for a sophisticated error handling mechanism, isolating the incorrect or corrupted data into a separate "dead-letter" table for further analysis.
✅ Uninterrupted Pipeline: This method prevents the entire pipeline from failing due to few incorrect or corrupted rows. The valid data continues to flow into BigQuery for analysis.
✅ Robust and Scalable: Cloud Dataflow is a fully managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness.
✅ Analysis of Errors: Pushing errors to a separate table allows for investigation and potentially rectifying the source of errors for future data dumps.

🔴 Now, let's examine why the other options are not the best choice:
❌ A. Use federated data sources, and check data in the SQL query: While federated queries can pull data from an external source, they are not designed to handle corrupted or incorrectly formatted data within the query itself.
❌ B. Enable BigQuery monitoring in Google Stackdriver and create an alert: Although this would inform you of any errors, it doesn't directly address the handling of corrupted or incorrectly formatted data.
❌ C. Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0: This would halt the import at the first instance of corrupted or incorrectly formatted data, which doesn't solve the problem at hand.

A 6

The best method to design the frontend to respond to a database failure is:

✅ "B. Retry the query with exponential backoff, up to a cap of 15 minutes."
✅ Here's why:

✅ Prevent Overloading: When the database is down, immediate retries can overload the system. Exponential backoff increases the wait time between retries exponentially, reducing the load on the database when it is recovering.
✅ Smart Resource Utilization: The exponential backoff algorithm is a standard strategy for conserving resources in the face of failure, while still attempting to complete the operation in a reasonable timeframe.
✅ Accommodate Recovery Time: Setting a cap of 15 minutes ensures that the retry mechanism doesn't continue indefinitely, giving the database enough time to recover.

🔴 Now, let's examine why the other options are not the best choice:
❌ A. Issue a command to restart the database servers: The frontend application shouldn't have the responsibility or the access rights to restart database servers.
❌ C. Retry the query every second until it comes back online to minimize staleness of data: Constant retries every second could overload the database server, making recovery slower.
❌ D. Reduce the query frequency to once every hour until the database comes back online: Reducing the query frequency to once every hour might make the app data stale and not responsive to database recovery.

A 7

The most appropriate learning algorithm to use for predicting housing prices on a resource-constrained machine is:

✅ "A. Linear regression"
✅ Here's why:

✅ Efficiency: Linear regression is computationally efficient and does not require a high computational capacity, making it suitable for a resource-constrained machine.
✅ Simplicity: Linear regression is a simple algorithm that is easy to implement and interpret. It's straightforward and less prone to overfitting compared to more complex models.
✅ Suitability: Linear regression is commonly used for predicting a continuous outcome variable (like housing prices) from one or more predictor variables.

🔴 Now, let's examine why the other options are not the best choice:
❌ B. Logistic classification: Logistic classification is typically used for binary classification problems, not for regression problems like predicting housing prices.
❌ C. Recurrent neural network (RNN): RNNs are typically used for sequence prediction problems and are computationally expensive, which is not suitable for a resource-constrained machine.
❌ D. Feedforward neural network: While feedforward neural networks can be used for regression problems, they are typically more computationally intensive than linear regression, making them less suitable for a resource-constrained machine.

A 8

The most effective query type to ensure that duplicates are not included while interactively querying data is:

✅ "D. Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1."
✅ Here's why:

✅ Unique Entries: The ROW_NUMBER window function assigns a unique row number to each row within the partition of a result set. By partitioning by the unique ID, we ensure that each unique ID gets its own set of row numbers.
✅ Duplication Elimination: By selecting only the rows where the row number equals 1, we ensure that we only select the first entry for each unique ID, effectively eliminating duplicates.
✅ Order Preservation: The ROW_NUMBER function will respect the order of the data (typically by a timestamp), ensuring the data isn't arbitrarily chosen.

🔴 Now, let's examine why the other options are not the best choice:
❌ A. Include ORDER BY DESK on timestamp column and LIMIT to 1: This option only returns a single row, the most recent one, and does not handle the case of multiple rows with unique IDs.
❌ B. Use GROUP BY on the unique ID column and timestamp column and SUM on the values: This would aggregate your data based on the unique ID and timestamp, but it would not ensure the elimination of duplicate entries. Moreover, SUM operation might not make sense for all data types or scenarios.
❌ C. Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL: This query type would return all but the first row of each partition. If there are duplicates, it does not guarantee their removal and might exclude legitimate entries.

A 9

The correct table name to make the SQL statement work correctly is:

✅ "D. bigquery-public-data.noaa_gsod.gsod*"
✅ Here's why:

✅ Syntax: Google BigQuery uses backticks (`) to enclose identifiers, which can include both table and column names. This helps in correctly parsing the names, especially when special characters are involved.
✅ Wildcard: The use of the asterisk (*) symbol in the table name after "gsod" allows the use of wildcard tables. This means that all tables that match the pattern will be queried.
✅ Error Resolution: The error message "Expected end of statement but got “-“ at [4:11]" suggests that the issue is with how the table name is currently formatted. By changing the quotes to backticks and including the wildcard character, we resolve the error.

🔴 Now, let's examine why the other options are not the best choice:
❌ A. ‘bigquery-public-data.noaa_gsod.gsod‘: This option doesn't use the correct backtick (`) notation for the table name and also lacks the wildcard character for a wildcard table query.
❌ B. bigquery-public-data.noaa_gsod.gsod*: This option is missing the backticks (`) which are required to correctly identify table names in BigQuery, especially with the wildcard character present.
❌ C. ‘bigquery-public-data.noaa_gsod.gsod’*: This option uses incorrect quotes for the table name and also misplaces the wildcard character which needs to be within the backticks (`).

A 10

The best three approaches to enforce minimum information access requirements with Google BigQuery are:

✅ "B. Restrict access to tables by role."
✅ "D. Restrict BigQuery API access to approved users."
✅ "F. Use Google Stackdriver Audit Logging to determine policy violations."
✅ Here's why:

✅ Access Control: By restricting access to tables by role, you can define the precise level of access each user or group of users has. This enforces the principle of least privilege, ensuring that users only have access to the data they need for their job.
✅ API Access: Restricting BigQuery API access to approved users helps ensure that only authorized personnel can interact with the BigQuery data and its associated operations.
✅ Audit Logging: Google Stackdriver Audit Logging allows for monitoring and logging of activities in BigQuery. By logging access and modifications to data, you can detect any policy violations or unauthorized activity.

🔴 Now, let's examine why the other options are not the best choice:
❌ A. Disable writes to certain tables: Disabling writes might not necessarily limit information access. It can prevent users from altering the data, but they can still read it.
❌ C. Ensure that the data is encrypted at all times: While data encryption is important for security, it doesn't necessarily limit data access to only those who need it to perform their jobs.
❌ E. Segregate data across multiple tables or databases: While this can help organize data, it doesn't directly limit access. Users might still be able to access data they shouldn't, unless appropriate access controls are also put in place.

Google PDE QnA Part 1

A 1

A 2

A 3

A 4

A 5

A 6

A 7

A 8

A 9

A 10

Thanks

for

Watching