SRE/DevOps Engineer in Nashville, TN at Vaco

Date Posted: 2/3/2021

Job Snapshot

Job Description

Ultimately, the reason we are here is to provide software solutions to our nonprofit customers so that they can be successful in their missions.

In this role, you'll work collaboratively with software engineering to ensure the performance, security, and reliability of our systems. You'll maintain tools for deployment, monitoring and operations, and you'll troubleshoot and resolve issues in our development, test, and production environments. You'll work on automation and leveling up our systems.


3. Deploying Software

4. Availability

We have very few incidents and very high uptime, nevertheless:

5. Knowledge & Documentation

6. Communication & Collaboration

What a Typical Week looks like:

Every day brings something new. The work environment in the offices is dynamic, collaborative, communicative, challenging, yet supportive. We leverage each other's diverse perspectives and experiences for success. We employ an agile-flavored software development process. In addition to conversations, you'll use Jira, GitHub, and Slack to communicate with team members. You'll communicate the status of systems and projects frequently - no less than once a day.

We typically perform between 1 and 3 deployments in a week, starting around 7:30 or 8 AM. You may spend several hours preparing for and deploying software in a given week.

Initially, more time will be spent monitoring and addressing production issues as they arise. However, this will represent a smaller fraction of your week as we continue to add automation.

We typically perform at least 1 maintenance release per week. We also typically plan our new development roughly in quarter-long cycles, running 2-3 larger new development efforts nearly continuously. Accordingly, we release new features or functionality about every 1-2 weeks.

Skills and Experience

We are building complex software, which will be used in different environments, by many users with different goals, with different data. In order to build valuable, quality software, a thoughtful and methodical approach is best. The following attributes and skills will serve an operations professional well:

We are looking for someone with the knowledge and confidence that comes from experiencing many different situations designing, implementing, and improving complex systems from core application logic to supporting infrastructure resources. The number of years is not as important to us as knowing that you've successfully performed the described responsibilities successfully in the past, however, at least 5 years' experience is a good guide.

Other preferred experience:

Growth potential

Due to the nature , being a growth-stage startup, our teams and processes are constantly evolving. To be successful, every team member should take pride in their work, be self driven, and comfortable with a changing environment.

This is a full-time position. headquartered in Brentwood, Tennessee (a suburb of Nashville, Tennessee).

  • 1. Operations / System Administration

    • Daily monitoring of systems, identifying issues related to the performance and scalability of software processes and either resolving issues or creating tasks for other team members to debug/resolve
    • Maintaining and spinning up application servers in a cloud environment(GCP)
    • Monitoring SQL queries for unacceptable slowness, testing index changes
    • Monitoring for and identifying potential security vulnerabilities
    • Ensuring our servers have valid TLS certificates
    • Creating, documenting, and testing failover plans and emergency readiness scenarios
    • Maintaining a log of system outages, reasons, and resolutions (incident reports)
    • Ongoing support of development environments orchestrated with Kubernetes(GKE)

    2. Automation & Upgrading

  • Automate repetitive tasks related to server health and performance, first and third-party integrations, our testing environments, and the production environment
  • Upgrading, patching, and failing over database servers
  • Upgrading system-level dependencies to support application upgrades (e.g. Ruby versions in server environments)
    • Identify and escalate/communicate areas of risk before deployments
    • Perform software deployments in collaboration with other team members
    • Improve, maintain, and document the process and tools for deploying software
    • From time-to-time, your role may require you to be available and/or to perform work outside of "typical" business hours.
    • You will be expected to participate in an "on-call" schedule to receive & address alerts.
    • Having a full, working understanding architecture so that you will be able to address unexpected issues with the production environment relatively quickly
    • Creating & maintaining documentation relating to, maintaining & supporting production and testing environments, especially as it relates to the architectural components, 3rd party and other dependencies, and accessing various components/services
    • You'll be regularly communicating your plans and the current status of projects and various aspects of our production system to management and the software development team.
    • If incidents arise, you'll be expected to communicate with and key stakeholders in an informative, clear, and timely manner.
    • Active collaboration is highly valued at. You will collaborate closely with other Site Reliability Engineers on a daily basis, and with our CTO and other members of Product Development on a weekly basis.
    • You'll monitor key metrics and report on them on a regular basis. We currently use the OKR framework to focus, guide and measure success.
    • Excellent diagnostic/troubleshooting and problem-solving skills
    • The ability to think critically, creatively, and logically
    • Curiosity, patience, determination, and focus
    • The ability to collaborate well with other engineers, developers, quality assurance and management, and to communicate diplomatically & effectively
    • An ability to perform at a high level under intense pressure
    • Effective time management, organization skills, and the ability to prioritize and balance multiple competing projects and tasks
    • An inclination to drive your own success; you are a self-starter who takes pride in your work and is motivated to "always be learning"
    • A love for Linux and scripting languages, especially Ruby
    • An understanding of issues related to data privacy and security
    • Experience with Kubernetes and cloud computing
    • Engineering experience, preferably with Ruby / Ruby on Rails
    • Experience with our tech and tools: Redis, Sidekiq, Google Cloud Platform, Ruby, Elixir, PostgreSQL, LMDB, NSQ, Docker, New Relic, Honeybadger, git and Jira.