Notes from the call with manager:
Overview:
- Risk Management Group
- This is a Data Analytics team
- The way the data is structured is a bit different then what the team is equipped to do
- Million and millions of records, thousands of columns big
- Big Data Functions
Day to Day:
- Interact with data sets
- Compute in memory
- Different systems we need to establish
- Looking at other people's code and optimize it.
Top skills will be:
- PySpark
- AWS - EMR clusters, EC2
Job Description:
Basic Qualifications:
- At least 1 year of experience with Apache Spark coding and good understanding of optimizing Spark for memory and performance
- At least 1 year of experience with PySpark
- At least 1 year setting up EMR clusters on AWS
- At least 2 years of professional experience with data engineering and tools like Hadoop, HDFS, HIVE etc.
- At least 2 years' experience in writing good quality software in languages like Java, Python, Scala etc.
Preferred Qualifications:
- Expert knowledge of Apache Spark internals
- Experience with AWS Services like S3, Lambda and EMR
- 2+ years of experience in Python, Java, or Scala
- 2+ years of experience with Unix/Linux systems with scripting experience in Shell, Perl or Python
- 2+ years of experience building data pipelines
- 2+ years of automated deployment, CICD experience with tools like Jenkins
- At least 1 year of Cloud (AWS, Azure, Google) development experience
- Experience with Streaming and/or NoSQL implementation (Mongo, Cassandra, etc.) a plus