Slides - Oracle Pro to Data Engineer

Code Highlights Course lecture slide by slide presentation for how to transform your skills as an oracle professional (developer, dba, analyst) to become a data engineer in 8 easy steps. To learn more about Code Highlights visit [www.code-hl.com](http://www.code-hl.com/) --- Slide 1: Introduction - Title: Introduction - Image: An image of a team of data engineers collaborating on a project - Content: Welcome to the course on "Transforming Your Skills as an Oracle Professional to Become a Data Engineer in 8 Easy Steps". In this course, you will learn the necessary skills and knowledge to become a data engineer, including the differences between a data engineer and other Oracle professions, the core skills required, and the steps to take to make the transition. Slide 2: Understanding the Role of a Data Engineer - Title: Understanding the Role of a Data Engineer - Image: An image of a data engineer working with big data tools - Content: In this section, you will learn the key responsibilities of a data engineer and how they differ from other Oracle professionals. You will also learn about the technologies and tools that data engineers use to build and maintain data pipelines and the infrastructure required to manage data at scale. ### Oracle DBA to Data Platform Administrator/Engineer | Oracle DBA | Data Platform Administrator | | ----------------------------------------- | ---------------------------------------------------------------- | | Manages Oracle databases | Manages the entire data platform | | Ensures databases function properly | Manages hardware, software, data | | Optimizes databases for performance | Troubleshoots issues that arise | | Ensures databases are secure and reliable | Ensures platform is secure, reliable, and available | | Focused solely on Oracle databases | Works with multiple databases, including non-Oracle | | May not have knowledge of entire platform | Must have deep understanding of hardware and software components | | . | Oracle DBA | Data Platform Administrator | |-----------------------|------------|----------------------------------------------------------------------------------------------------------------------------------------| | Main Focus | Management, maintenance, and security of Oracle databases | Administration and management of large and complex data platforms, including design, deployment, and maintenance of data solutions | | Skills Required | Oracle Database, SQL, PL/SQL, Oracle Enterprise Manager, Data Security | Big Data platforms, SQL/NoSQL databases, Cloud Computing, ETL, Data Integration, Data warehousing, Business Intelligence, Analytics | | Tools & Technologies | Oracle Database, Oracle Enterprise Manager, SQL Developer | Hadoop, Spark, SQL/NoSQL databases, Cloud Computing, ETL tools, Business Intelligence software, Analytics tools | | Key Responsibilities | Database installation, configuration, and upgrade, Performance tuning and optimization, Data Backup and Recovery | Design, deploy, and maintain data platforms, Data modeling and architecture, Capacity planning, Data Integration and ETL | | Career Path | Oracle DBA, Senior DBA, Database Manager | Data Platform Administrator, Data Architect, Big Data Administrator, Data Warehouse Architect, Business Intelligence Analyst | | Salary Range | $73,000 - $140,000+ annually (Glassdoor) | $80,000 - $180,000+ annually (depending on skills, experience, and location) (Indeed) | | Job Outlook | Steady demand, but slower than average job growth | Rapidly growing field with strong job demand and high salary potential | | Industry Applications | All industries that use Oracle databases | All industries that rely on large data platforms for decision-making and analysis, including finance, healthcare, technology, and more | ### Oracle Developer to Data Engineer (Pipeline) | | Oracle Developer | Data Engineer | |-----------------------|-----------------|---------------------------------------------------------------------------------------------------------------------------------------| | Main Focus | Application development using Oracle technologies | Data storage, management, and processing, including analysis and visualization | | Skills Required | SQL, PL/SQL, Java, Oracle Forms/Reports, APEX, XML, JavaScript | SQL, NoSQL, ETL, Big Data platforms, Hadoop, Spark, Python, Data warehousing, Business Intelligence, Machine Learning, Visualization | | Tools & Technologies | Oracle Database, Oracle Middleware, Oracle APEX | Big Data tools (Hadoop, Spark), SQL/NoSQL databases, ETL tools, Business Intelligence software | | Key Responsibilities | Design, develop, test and maintain software applications | Manage and analyze large and complex data sets, design and implement data storage and processing systems | | Career Path | Oracle Developer, Database Developer, Software Engineer | Data Engineer, Data Architect, Business Intelligence Analyst, Data Scientist | | Salary Range | $70,000 - $140,000+ annually | $70,000 - $150,000+ annually (depending on skills, experience, and location) | | Job Outlook | Steady demand, but slower than average job growth | Rapidly growing field with strong job demand and high salary potential | | Industry Applications | All industries that use Oracle technologies | All industries that rely on data for decision-making and analysis, including finance, healthcare, technology, and more | # Slide 3: Core Skills Required for Data Engineering - Title: Core Skills Required for Data Engineering - Image: An image of a data engineer working on a data pipeline - Content: In this section, you will learn about the core skills required to become a data engineer. These include expertise in programming languages such as Python and SQL, knowledge of distributed systems and cloud computing, and proficiency in data modeling and database design. # Slide 4: Step N - Advance your SQL Skills Here are some advanced SQL functions: links to their documentation on Oracle and Spark SQL: 1. **Window Functions**: These functions allow you to perform calculations across rows and return results alongside the original data. Some examples of window functions are ROW_NUMBER, RANK, DENSE_RANK, and LEAD/LAG. - (https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Window-Functions.html) - (https://spark.apache.org/docs/latest/sql-ref-functions-window.html) 2. **Common Table Expressions (CTE)**: CTEs allow you to define a temporary result set that can be used in subsequent queries. They are useful for breaking down complex queries into smaller, more manageable pieces. - (https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Common-Table-Expressions.html) - (https://spark.apache.org/docs/latest/sql-ref-syntax-cte.html) 3. **Analytic Functions**: Analytic functions are used to perform complex calculations on a set of rows. These include functions such as SUM, AVG, MIN, MAX, and COUNT. - (https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Analytic-Functions.html) - (https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#window-aggregate-functions) 4. **Aggregate Functions**: Aggregate functions are used to perform calculations on a set of rows and return a single value. These functions include SUM, AVG, MIN, MAX, and COUNT. - (https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Aggregate-Functions.html) - (https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#aggregate-functions) 5. **PIVOT and UNPIVOT Functions**: These functions are used to transform data from rows to columns or from columns to rows, respectively. - (https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pivot-and-Unpivot-Operators.html) - (https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-pivot.html) 6. **Recursive CTEs**: Recursive CTEs are used to perform recursive operations on data. They are useful for tasks such as building hierarchical data structures. - (https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Recursive-Subquery-Factoring.html) - (https://spark.apache.org/docs/latest/sql-ref-syntax-qry-recursive.html) # Slide 4: Step N - Learn Programming Languages - Title: Step 1 - Learn Programming Languages - Image: An image of a person learning to code - Content: In this section, you will learn about the importance of programming languages in data engineering and the most commonly used languages. You will also learn about resources and tools that can help you learn these languages, such as online courses, tutorials, and code libraries. ``` sql CREATE OR REPLACE DIRECTORY my_dir AS '/path/to/directory'; CREATE TABLE emp_ext ( ... ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY my_dir ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE FIELDS TERMINATED BY ',' MISSING FIELD VALUES ARE NULL ( ... ) ) LOCATION ('emp.csv') ) ... ``` ``` Python import pandas as pd # Load the emp and salgrade tables from CSV files into pandas dataframes emp_df = pd.read_csv('emp.csv') salgrade_df = pd.read_csv('salgrade.csv') # Perform a join operation on the emp and salgrade dataframes based on the salary ranges result_df = pd.merge(emp_df, salgrade_df, how='inner', left_on='sal', right_on='losal') # Select only the required columns (ename, sal, grade) result_df = result_df[['ename', 'sal', 'grade']] # Print the result print(result_df) ``` ``` scala import org.apache.spark.sql.functions.col import org.apache.spark.sql.SparkSession // Create a Spark session val spark = SparkSession.builder().appName("JoinExample").getOrCreate() // Load the emp and salgrade tables from CSV files into Spark dataframes val empDF = spark.read.format("csv").option("header", "true").load("emp.csv") val salgradeDF = spark.read.format("csv").option("header", "true").load("salgrade.csv") // Join the emp and salgrade dataframes based on the salary ranges val resultDF = empDF.join(salgradeDF, col("sal").between(col("losal"), col("hisal"))) // Select only the required columns (ename, sal, grade) val selectedDF = resultDF.select("ename", "sal", "grade") // Print the result selectedDF.show() ``` ``` R # R programming language # Read in the 'emp' and 'salgrade' CSV files as data frames emp <- read.csv("emp.csv") salgrade <- read.csv("salgrade.csv") # Merge the two data frames on the 'sal' column in 'emp' and the 'losal' column in 'salgrade' result <- merge(emp, salgrade, by.x = "sal", by.y = "losal") # Subset the resulting data frame to include only the 'ename', 'sal', and 'grade' columns result <- subset(result, select = c("ename", "sal", "grade")) ``` | ename | sal | grade | |--------|--------|-------| | SMITH | 800.0 | 1 | | ALLEN | 1600.0 | 2 | | WARD | 1250.0 | 2 | # Slide 5: Step N - Learn Cloud Computing and Distributed Systems - Title: Step 2 - Learn Distributed Systems and Cloud Computing - Image: An image of a cloud infrastructure - Content: In this section, you will learn about distributed systems and cloud computing and their importance in data engineering. You will also learn about the most commonly used cloud platforms, such as Amazon Web Services (AWS) and Microsoft Azure, and the tools and services they offer for data engineering. - **In-memory computing**: Apache Spark is designed to keep data in memory as much as possible, reducing the need for expensive disk I/O operations and enabling faster data processing and analysis. - **Cluster computing**: Apache Spark can run on a cluster of machines, allowing it to scale horizontally as data volumes and processing requirements increase. - **Distributed data processing**: Apache Spark allows for distributed data processing across multiple machines, making it possible to handle large datasets and complex computations. - **Fault tolerance**: Apache Spark is designed to be fault-tolerant, so if one machine fails, the data and processing can be seamlessly moved to another machine in the cluster without interrupting the job. - **Distributed machine learning**: Apache Spark includes a powerful machine learning library called MLlib, which provides scalable implementations of common machine learning algorithms and techniques. - **Streaming data processing**: Apache Spark Streaming allows for real-time data processing and analysis of streaming data sources, such as log files, sensor data, and social media feeds. - **Graph processing**: Apache Spark GraphX is a distributed graph processing library that enables large-scale graph processing and analysis, such as social network analysis and recommendation systems. | Cloud Technology | Oracle Cloud | AWS | GCP | Azure | |------------------|--------------|-----|-----|-------| | Compute | Oracle Compute VMs, Kubernetes, Bare Metal | Amazon EC2, EC2 Container Service, Elastic Kubernetes Service | Google Compute Engine, Google Kubernetes Engine, App Engine | Azure Virtual Machines, Azure Kubernetes Service, Azure Container Instances | | Networking | Oracle Networking, FastConnect, VPN | Amazon VPC, Direct Connect, Transit Gateway | Google VPC, Cloud Interconnect, VPN | Azure Virtual Network, ExpressRoute, VPN Gateway | | Storage | Oracle Cloud Storage, Block Volumes, Object Storage | Amazon S3, EBS, EFS | Google Cloud Storage, Persistent Disk, Cloud Filestore | Azure Blob Storage, Disk Storage, File Storage | | Filesystem | Oracle Cloud File Storage, NFS, SMB | Amazon EFS, FSx, Storage Gateway | Google Cloud Filestore, Persistent Disk, Cloud Storage | Azure Files, NetApp Files, Disk Storage | | Databases | Oracle Database Cloud Service, Autonomous Database, MySQL | Amazon RDS, DynamoDB, Aurora | Google Cloud SQL, Cloud Spanner, Cloud Bigtable | Azure Database, Cosmos DB, SQL Server | | Data Pipelines | Oracle Cloud Infrastructure Registry, Oracle Data Integration, Oracle Integration | Amazon CodePipeline, AWS Glue, AWS Data Pipeline | Google Cloud Build, Cloud Dataflow, Cloud Pub/Sub | Azure DevOps, Azure Data Factory, Azure Event Grid | | Job Scheduling | Oracle Cloud Infrastructure Compute, Oracle Functions, Oracle Container Engine for Kubernetes | Amazon EC2 Scheduler, AWS Lambda, AWS Batch | Google Cloud Scheduler, Cloud Functions, Cloud Run | Azure Scheduler, Azure Functions, Azure Logic Apps | ## Slide 6: Step - Learn Data Modeling and Database Design - Title: Step 3 - Learn Data Modeling and Database Design - Image: An image of a database schema - Content: In this section, you will learn about data modeling and database design and their importance in data engineering. You will learn about different database management systems, such as MySQL and MongoDB, and how to create and maintain data models and database schemas. list of common data modeling techniques used by data engineers in markdown code: 1. Relational Data Modeling: Modeling data as tables with columns and defining relationships between them. This technique is widely used for structured data and is effective for complex queries and transactions. - [Relational Data Modeling Overview by Oracle] - (https://docs.oracle.com/en/database/oracle/oracle-database/19/ladrm/index.html 2. Dimensional Data Modeling: Modeling data as fact tables and dimension tables with hierarchies between them. This technique is used for data warehouses and is effective for fast querying and reporting. - [Dimensional Data Modeling Overview by Oracle] (https://docs.oracle.com/en/business-intelligence/enterprise-performance-management-system/epm-data-management-guide/18.1.0/dmcon/dimensional_modeling.html) 3. NoSQL Document Data Modeling: Modeling data as JSON-like documents with nested key-value pairs, sometimes with relationships between documents. This technique is used for NoSQL document databases like MongoDB and is effective for unstructured and semi-structured data, as well as horizontal scaling and high availability. - [NoSQL Document Data Modeling by MongoDB] - (https://www.mongodb.com/nosql-explained/nosql-data-modeling) 4. Graph Data Modeling: Modeling data as a graph of nodes and edges with attributes and relationships between them. This technique is used for highly connected and complex data and is effective for graph-based queries. - [Graph Data Modeling by Neo4j] - (https://neo4j.com/developer/graph-database-modeling/) 5. Data Vault Modeling: Modeling data using a hybrid approach of both relational and dimensional data modeling techniques, where data is organized into a standardized set of tables that store historical data and link to source systems. This technique is used for large-scale data warehouses and is effective for data integration and auditing. - [Data Vault Modeling by Kent Graziano] - (https://kentgraziano.com/what-is-data-vault-and-why-do-i-care/) ### Data Vault Technique here's a brief description of the Hub, the Link, and the Satellite components of the Data Vault method, along with an example table name for each in an inventory system: Modeling data using a hybrid approach of both relational and dimensional data modeling techniques, where data is organized into a standardized set of tables that store historical data and link to source systems. This technique is used for large-scale data warehouses and is effective for data integration and auditing. [Data Vault Modeling by Kent Graziano](https://kentgraziano.com/what-is-data-vault-and-why-do-i-care/) 1. **Hub**: A Hub is a central point of integration for a business concept, and it contains a unique identifier for that concept. Hubs are used to store the business keys and their corresponding attributes. The attributes stored in the hub are considered to be "true" for the business key. - `ProductHub` - This table contains the unique identifiers for each product, as well as the attributes that are true for each product, such as product name, description, and price. 2. **Link**: A Link is used to represent the relationships between the business concepts, and it contains the unique identifiers for the Hubs that it connects. Links are used to store the relationship between the business keys. - `ProductVendorLink` - This table contains the unique identifiers for the product and vendor Hubs, as well as the attributes of the relationship between them, such as the date when the vendor started supplying the product. 3. **Satellite**: A Satellite contains contextual information about the data, such as timestamps, sources, and other metadata. Satellites are used to store the context and metadata of the business keys. - `ProductSatellite` - This table contains the attributes of the product that may change over time, such as the price history, product category, and other related data. In an inventory system, the Data Vault model can be used to capture the relationships between products, vendors, sales, and other data, making it a powerful tool for data integration, scalability, and flexibility. By using Hubs, Links, and Satellites, the inventory system can be easily extended and modified as the business requirements change over time. ![[Pasted image 20230216135556.png]] ProductHub Table: ![[Pasted image 20230216140530.png]] ProductVendorLink Table: ![[Pasted image 20230216140545.png]] ProductSatellite Table: ![[Pasted image 20230216140603.png]] ## Slide 7: Step 4 - Pick a Data Platform A data platform provides a centralized and integrated environment for managing, storing, processing, analyzing, and visualizing data. It enables organizations to collect and aggregate data from various sources, including transactional systems, applications, databases, and external data sources. Here are some of the main benefits of a data platform: 1. **Data Integration**: A data platform provides tools and frameworks for integrating data from different sources, transforming it, and loading it into a central repository. This makes it easier to manage and analyze data across the organization. 2. **Data Storage**: A data platform provides scalable and secure storage solutions for storing and managing data. This ensures that data is available and accessible whenever needed, and that it is protected from unauthorized access and other security threats. 3. **Data Processing**: A data platform provides tools and frameworks for processing and analyzing large volumes of data, using techniques such as machine learning, data mining, and natural language processing. This enables organizations to gain insights into their data, make data-driven decisions, and identify trends and patterns. 4. **Data Visualization**: A data platform provides tools for visualizing and presenting data in a meaningful and understandable way. This helps users to interpret and analyze data more easily, and to identify trends and patterns that might not be visible in raw data. ### Incorta Data Platform Incorta is a cloud-based data analytics platform that provides fast and comprehensive insights from complex data. It combines data ingestion, storage, and analytics into a single platform, allowing organizations to quickly analyze and visualize large volumes of data without the need for extensive data modeling or ETL processes. ![[Incorta-Data-Platform-Overview.png]] ![[Pasted image 20230216142522.png]] ## Slide 7: Step 4 - Learn Big Data Technologies - Title: Step 4 - Learn Big Data Technologies - Image: An image of a person working with Hadoop - Content: In this section, you will learn about big data technologies and their importance in data engineering. You will learn about tools and frameworks such as Hadoop, Spark, and Kafka, and their use in building data pipelines and processing large amounts of data. Here are a few big data technologies that are relatively easy to learn and can provide a good foundation for further learning: 1. **Apache Hadoop**: Apache Hadoop is a widely used big data technology that provides a framework for distributed storage and processing of large data sets. It is built on the Java programming language and provides a set of tools and frameworks for managing and analyzing data. Hadoop is relatively easy to learn and is a good starting point for understanding distributed computing and data processing. 2. **Apache Spark**: Apache Spark is a fast and versatile big data processing engine that provides a variety of APIs for processing and analyzing large data sets. It is built on the Scala programming language and provides a simple and intuitive API that is easy to learn. Spark is designed to be easy to use and is a good starting point for learning distributed computing and big data processing. 3. **Apache Hive**: Apache Hive is a data warehouse system for Hadoop that provides a SQL-like interface for querying data stored in Hadoop. It is designed to be easy to use and provides a familiar SQL-like interface for users who are already comfortable with SQL. 4. **Apache Pig**: Apache Pig is a high-level scripting language for Hadoop that provides a simple and intuitive interface for processing and analyzing large data sets. It is designed to be easy to use and requires minimal programming knowledge, making it a good starting point for learning big data processing. 5. **Apache NiFi**: Apache NiFi is a data integration and processing tool that provides a visual interface for building data pipelines. It is designed to be easy to use and provides a simple and intuitive interface for building complex data processing pipelines. ## Slide N: Step N - Learn Machine Learning the Easy Way ## Slide 8: Step 5 - Learn Data Visualization and Reporting - Title: Step 5 - Learn Data Visualization and Reporting - Image: An image of a dashboard showing data visualizations - Content: In this section, you will learn about data visualization and reporting and their importance in data engineering. You will learn about tools such as Tableau and Power BI and how to use them to create compelling visualizations and reports to communicate insights and findings from your data pipelines. 1. Tableau: Tableau is a popular data visualization tool that allows you to create interactive dashboards and reports using data from a variety of sources, including big data platforms like Apache Hadoop and Apache Spark. Tableau's drag-and-drop interface and intuitive design make it easy to use for both technical and non-technical users. Get started with Tableau: [https://www.tableau.com/learn/get-started](https://www.tableau.com/learn/get-started) 2. Apache Superset: Apache Superset is an open-source data exploration and visualization platform that provides an intuitive and interactive interface for creating charts, dashboards, and reports. Superset supports a variety of data sources, including big data platforms like Apache Hadoop and Apache Spark. Get started with Apache Superset: [https://superset.apache.org/docs/intro](https://superset.apache.org/docs/intro) 3. Power BI: Power BI is a business intelligence tool from Microsoft that allows you to create interactive dashboards and reports using data from a variety of sources, including big data platforms. Power BI's drag-and-drop interface and intuitive design make it easy to use for both technical and non-technical users. Get started with Power BI: [https://powerbi.microsoft.com/en-us/get-started/](https://powerbi.microsoft.com/en-us/get-started/) 4. QlikView: QlikView is a data discovery and visualization tool that allows you to create interactive reports and dashboards using data from a variety of sources, including big data platforms. QlikView's associative engine allows you to explore and analyze data in real-time, making it well-suited for big data applications. Get started with QlikView: [https://www.qlik.com/us/try-or-buy/download-qlikview](https://www.qlik.com/us/try-or-buy/download-qlikview) 5. Looker: Looker is a data analytics and business intelligence tool that allows you to create interactive reports and dashboards using data from a variety of sources, including big data platforms. Looker's intuitive design and flexible data modeling capabilities make it easy to use for both technical and non-technical users. Get started with Looker: [https://looker.com/product/resources/getting-started](https://looker.com/product/resources/getting-started) These are just a few of the many big data technologies available for visualization and reporting. By using these tools, you can gain insights from your big data sets and communicate your findings in a clear and actionable way. ## Slide 9: Step 6 - Practice with Real-World Projects - Title: Step 6 - Practice with Real-World Projects - Image: An image of a person working on a real-world data engineering project - Content: In this section, you will learn about the importance of practical experience in data engineering and how to gain it through working on real-world projects. You will learn about the steps involved in working on a data engineering project, such as identifying business needs, designing data models, building data pipelines, and creating visualizations and reports. ## Slide 10: Step 7 - Join Data Engineering Communities and Networks - Title: Step 7 - Join Data Engineering Communities and Networks - Image: An image of a person networking with other data engineers - Content: In this section, you will learn about the importance of joining data engineering communities and networks and how to find and participate in them. You will learn about online communities, such as GitHub and Stack Overflow, as well as in-person communities, such as meetups and conferences. ## Slide 11: Step 8 - Keep Learning and Updating Your Skills - Title: Step 8 - Keep Learning and Updating Your Skills - Image: An image of a person reading a book about data engineering - Content: In this section, you will learn about the importance of continuous learning and updating your skills as a data engineer. You will learn about resources and tools that can help you stay up-to-date on the latest technologies and best practices, such as online courses, conferences, and industry publications. ## Slide 12: Conclusion - Title: Conclusion - Image: An image of a person successfully transforming their skills to become a data engineer - Content: Congratulations, you have completed the course on "Transforming Your Skills as an Oracle Professional to Become a Data Engineer in 8 Easy Steps". By following the steps outlined in this course, you now have the necessary skills and knowledge to make the transition from an Oracle professional to a data engineer. Remember to keep learning and practicing, and don't be afraid to ask for help from the data engineering community. If you liked this prompt please like it on the prompt search page so we know to keep enhancing it. ## Types of Data Engineers In the data engineering industry, there are typically three types of professionals: data platform administrators, data pipeline engineers, and data analysts or data scientists. Data platform administrators are responsible for managing the infrastructure and data platforms used by organizations. They ensure that the platforms are up-to-date, secure, and available for use. This role is also known as a Data Operations Engineer or Data Infrastructure Engineer. Data pipeline engineers are responsible for designing and building data pipelines that extract, transform, and load data from various sources into data warehouses or data lakes. They use a variety of technologies and tools, such as Apache Kafka, Apache Spark, and AWS Glue, to ensure data is accurate, timely, and available for analysis. This role is also known as a Data Integration Engineer or ETL Developer. Data analysts or data scientists are responsible for analyzing and making sense of data to derive insights and inform business decisions. They use statistical analysis, machine learning, and data visualization tools to analyze data and present their findings. This role is also known as a Data Analyst or Data Scientist. It's important to note that these roles can overlap, and some professionals may perform duties from more than one role. However, having a clear understanding of these roles can help individuals identify which skills they need to develop to pursue a career in data engineering. 1. Data Platform Administrators (Data Operations Engineers, Data Infrastructure Engineers) - Manage infrastructure and data platforms - Ensure platforms are up-to-date, secure, and available for use 2. Data Pipeline Engineers (Data Integration Engineers, ETL Developers) - Design and build data pipelines - Extract, transform, and load data from various sources into data warehouses or data lakes 3. Data Analysts or Data Scientists - Analyze and make sense of data - Derive insights and inform business decisions using statistical analysis, machine learning, and data visualization tools Here's a table that maps Oracle Professional (DBA, Developer, Analyst) to Data Engineer, along with the corresponding Oracle and Data Engineer tools: | Oracle Professional Type | Data Engineer Type | Oracle Tools Used | Data Engineer Tools Used | |--------------------------|---------------------------|--------------------------------------------------------|------------------------------------------------------------------------------------------------------------| | DBA | Data Platform Administrator | Oracle Database, Oracle Cloud, Oracle Enterprise Manager | Amazon Web Services, Microsoft Azure, Google Cloud Platform, Kubernetes, Docker | | Developer | Data Pipeline Engineer | Oracle SQL, PL/SQL, Oracle Data Integrator | Apache Kafka, Apache NiFi, Apache Spark, AWS Glue, Talend | | Analyst | Data Analyst / Scientist | Oracle BI, Oracle Analytics Cloud, Oracle R Enterprise, Oracle Data Mining | Python, R, Tableau, Power BI, Apache Hadoop, Apache Spark | | Oracle Professional Type | Data Engineer Type | Oracle Tools Used | Data Engineer Tools Used | |--------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------| | DBA | Data Platform Administrator | Oracle Database, Oracle Cloud, Oracle Enterprise Manager, SQL*Loader | AWS, GCP, Azure, Oracle Cloud, Hadoop, Kubernetes, Docker, SQL*Loader | | Developer | Data Pipeline Engineer | Oracle SQL, PL/SQL, ODI | Apache Spark, Kafka, NiFi, AWS Glue, Talend, Parquet, Python, Pandas, Scala | | Analyst | Data Analyst / Scientist | Oracle BI, Oracle Analytics Cloud, Oracle R Enterprise, Oracle Data Mining, OBIEE | Python, R, Tableau, Power BI, OBIEE, Postgres, Parquet, Spark SQL, AWS, GCP, Azure, Oracle Cloud, Hadoop, Data Warehouse, Data Lake | Note: This table is not an exhaustive list of all the tools used by Oracle professionals and data engineers, but rather a representative sample of some commonly used tools. | Function | Oracle Tools | Data Engineer Tools | |---------------------|-------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------| | ETL (Extract, Transform, Load) | Oracle Data Integrator, SQLLoader, Oracle Warehouse Builder (OWB) | Apache Spark, Apache NiFi, AWS Glue, Talend, Parquet, Python, Pandas, Scala | | Administration | Oracle Enterprise Manager, SQL Developer, Oracle Cloud Control | AWS, GCP, Azure, Oracle Cloud, Kubernetes, Docker, Shell Scripting, Job Scheduling | | SQL (Structured Query Language) | Oracle SQL, PL/SQL, SQLPlus, Oracle SQL Developer | Apache Spark SQL, Apache Hive, Presto, AWS Redshift, GCP BigQuery, Snowflake, SQL Server, Postgres, MySQL, MariaDB, Oracle SQL Developer | | DML (Data Manipulation Language) | Oracle PL/SQL, SQLPlus | Apache Spark, Apache Flink, Apache NiFi, AWS Glue, Python, Pandas, Scala | | Storage | Oracle Database, Oracle Exadata, Oracle ASM | Hadoop, AWS S3, GCP Storage, Azure Blob Storage, Cassandra, MongoDB, Couchbase, Elasticsearch, Redis | | Programming | Java, Oracle Forms, Oracle Reports, Oracle Application Express (APEX) | Python, Java, Scala, R, Go, C++, .NET, Shell Scripting, IDE | | Reporting/Analytics | Oracle Business Intelligence Enterprise Edition (OBIEE), Oracle Analytics Cloud, Oracle BI | Tableau, Power BI, Looker, QlikView, MicroStrategy, Google Data Studio, Apache Superset, Kibana | | Streaming | Oracle Stream Analytics, Kafka | Apache Kafka, Apache Flink, Apache NiFi, AWS Kinesis, Azure Stream Analytics, Google Cloud Dataflow, Confluent Platform, StreamSets | | Data Pipeline Batch | Oracle Data Integrator, Oracle Warehouse Builder (OWB) | Apache Spark, Apache Airflow, Apache Beam, Apache NiFi, Talend, AWS Glue, GCP Dataflow, SQL Server Integration Services (SSIS) | | Data Modeling | Oracle Designer, Oracle SQL Developer Data Modeler | Erwin, Oracle SQL Developer Data Modeler, PowerDesigner, ER/Studio Data Architect, MySQL Workbench, PostgreSQL pgModeler | | Oracle DBA | Data Platform Administrator | |-------------------------------------------|---------------------------------| | Manages Oracle databases | Manages the entire data platform | | Ensures databases function properly | Manages hardware, software, data | | Optimizes databases for performance | Troubleshoots issues that arise | | Ensures databases are secure and reliable | Ensures platform is secure, reliable, and available | | Focused solely on Oracle databases | Works with multiple databases, including non-Oracle | | May not have knowledge of entire platform | Must have deep understanding of hardware and software components | Data Engineer: - Specializes in building and maintaining data pipelines. - Focuses on designing and implementing ETL processes. - Skilled in big data technologies such as Hadoop and Spark. - Designs data validation and testing procedures. - Works closely with Data Scientists and Data Analysts. - Has experience working with cloud-based platforms. - Develops and deploys automated data pipelines. Oracle Developer: - Specializes in developing and maintaining Oracle database applications. - Focuses on writing SQL and PL/SQL code. - Skilled in Oracle-specific technologies such as Oracle Forms and Oracle Reports. - Designs and implements data validation and testing procedures. - Works closely with stakeholders. - Has experience working with on-premises database systems. - Develops and deploys batch processes. A Database Developer focuses on designing, developing, and maintaining databases and the applications that interact with them, whereas a Data Engineer focuses on designing and building data pipelines and ETL processes to move and process large volumes of data. While both roles work with data, the tasks and responsibilities of each role are distinct. A Database Developer is typically focused on the application layer, while a Data Engineer is focused on the data layer. - Data Layer: Deals with the storage and retrieval of data. - Application Layer: Deals with the business logic and functionality of an application. - Presentation Layer: Deals with the user interface of an application. - Integration Layer: Deals with integrating various systems and applications. - Infrastructure Layer: Deals with the underlying hardware and software infrastructure. - Security Layer: Deals with the security of the application and the data it processes. I. Introduction A. CEO welcomes a candidate for a data engineering position B. Candidate explains their qualifications and plans for the position II. Overview of Data Engineering A. Growth of data engineering positions and pay B. Importance of data engineering in relation to data science and analytics III. Challenges in Data Engineering A. Data Engineering's role in a company 1. Supporting the core business but not directly generating revenue 2. Focus on technology over business outcomes 3. B. Dependency issues 1. Data engineers act as a bridging function between producers and consumers 2. Difficulties arise when data sources standardize data differently or miss data points 3. Constant communication required between different stakeholders 4. C. Rapidly evolving technologies 1. Data engineers must stay updated with new technologies and adapt to new tools 2. Outdated technology can affect a project's success and reflect poorly on a data engineer IV. Conclusion A. Data engineering can be a rewarding career but requires a strong dedication to technology and communication B. Sponsor message: Project Pro can help build end-to-end projects and improve data proficiency. C. Final thoughts and questions for the reader. Data Engineering Overview: - Data engineering is the practice of designing, building, and maintaining the infrastructure and systems required to collect, store, process, and analyze large volumes of data. - It involves developing and maintaining data pipelines that extract, transform, and load (ETL) data from various sources into data warehouses or data lakes. - Data engineers are responsible for ensuring the reliability, scalability, and security of these systems, as well as ensuring that data is accessible to other members of the organization who need it for analysis or decision-making purposes. Title: Challenges of being a Data Engineer Slide 1: Why Be a Data Engineer? - Data Engineering job profile has been growing drastically, over - Data Engineers are the second-class citizens of a company - Importance of Data Engineering before Data Science and Analytics Slide 2: - First Challenge: Supportint the core business - second class citizens - Back office position that is not revenue generating - act as a bridge between multiple departments - often interfacing between mulitiple business members - gathering data - identifiying requirements - fixing issues Slide 3: - Second challenge: Dependencies - Bridging function between producers and consumers - ERP CRM MFG systems - marketing data, sales and website data - Sensor data and IOT - 3rd party data sources, like consumer data, competitor pricing - Issues with standardizing data - Missing fields, missing data that needs data enrichment - dirty data, that needs cleansing - Misshaped data, that needs transformation and reshaping - Troubleshooting when things go wrong - Multiple layers, multiple stacks - different technologies - blend of legacy technologies and new technologies Slide 4: - Third challenge: Technologies - Data engineers deal with multiple evolving technologies - a few years ago, Hadoop, Map Reduce were the all the rage - now Apache Spark and the Apache ecosystem are in favor - also, commercial software like Snowflake and Databricks - End-to-end technologies - Source Systems - Data Pipeline systems - Data Visualization and Reporting Systems - SALT - snowflake altyrix tableu - The need to stay up to date - know some of the limitations of Parquet and the - advantage of Delta Lake for Slowly Changing Dimensions - No-Code solutions can accelerate Slide 5: - Being a Data Engineer is not a walk in the park - The need to communicate well and solve problems - The importance of staying updated with new technologies - The job of a Data Engineer is not just about coding all day References: - "Challenges of Being a Data Engineer" by Josh Starmer. Youtube, 18 Aug. 2021, [https://www.youtube.com/watch?v=k3pqfMfV7Ic](https://www.youtube.com/watch?v=k3pqfMfV7Ic). Regenerate response