Share this Job
Apply now »

Site Reliability Engineer -Monitoring/Container Mgmt exp


Lake Mary, FL, US

Req ID:  1926
Level of experience:  Mid-Career
Remote:  No
Travel Required:  0%-25%

At Ellucian, we’re motivated by a mission. Higher education is facing profound change. Shifting demographics and cultural perceptions, combined with declining support and rising expectations, are forcing colleges to do more with less. That’s where we come in. As true believers in the power of education to transform lives, we’re dedicated to helping all our customers thrive—not just survive—in these challenging times by transforming their institutions from the traditional paper-based colleges of yesterday to the agile, connected campuses of today. From cloud solutions built on world-class infrastructure to powerful analytics that drive successful planning, we lead the industry in building enterprise-class solutions tailored to institutions around the world.

Our passion and commitment for learning and continuous improvement drive us internally, too. From professional development to flexibility and work-life balance, we give our global employees the tools they need to succeed so we can all grow together.







Currently, we are looking to build out an SRE team from the ground up to create container environments that will scale & accelerate the automation mindset enabling the best customer experience. This role/team will be highly visible w/in the org and will provide a real opportunity to help create & build something new.






Site Reliability Engineers are responsible for keeping all production systems at Ellucian running smoothly.  SRE are expected to apply sound engineering principles and operational discipline to develop and deliver automation into our environments.  Create monitoring and telemetry to gain insight into the patterns that govern our success and allow Ellucian to deliver on our uptime commitments to our customers.  Build out our CD pipeline to deliver successful builds to production.



  • Responsibility for delivering on identifying, creating, and maintaining SLO’s
  • Design, build, and support automation and monitoring that improve system reliability
  • Partner with R&D and Operations teams to enhance telemetry and reliability
  • Create monitoring to detect symptoms and preempt outages
  • Debug production issues across services and levels of the stack
  • Partner with R&D teams to advance efforts towards containerization and Kubernetes
  • Cover assigned rotation as an on-call resource – working to identify systemic problems and identify solutions
  • Rapidly troubleshoot incidents:
    • By leveraging service restorative actions 
    • By understanding what causes most issues and the actions to mitigate those on your assigned technology stack
    • By understanding what actions will systematically eliminate the causes of common issues
    • Without doing more harm to currently impacted systems 
  • Report problems and participate in related root cause analysis or incident Postmortems
    • Install, setup, and configure third party tools for collecting and analyzing database performance and stability issues.
    • Provide analysis of poor performance and instabilities identified in systems.
    • Participate in postmortem meeting and discussion; providing and documenting details from the incident or problem
    • Provide technical detail for diagnosing and fixing known bugs and problems. Assist with creating run books, automation or other process improvements to address future occurrences of issue and for automating common tasks; clearly document the steps to execute and resolve
  • Design, document, develop, test, and deploy automation
  • Act as a mentor for Cloud Engineering colleagues by: 
    • Providing adoption leadership expertise for new standards and best practices 
    • Participating as a subject matter expert on process improvement; and training & tool development 
    • Be the subject matter expert for assigned specializations by providing technical leadership and technical engineering expertise



  •  Possesses the tenacity to delve to the root of the issue quickly, understand why it happened, and prevent it in the future
  • Proven experience with monitoring development and administration – Datadog, NewRelic, AppDynamics, SignalFX.  Experience with synthetic, RUM, APM, system/host monitoring tools, ELK, log aggregation.
  • 3+ years of experience as either a Sr. Linux, Windows, or Engineering role
  • 3+ years of experience in multi-region, multi-tenant SaaS or PaaS environment
  • 3+ years of experience with AWS platform, preference for, but not limited to: AMI, EC2, EBS, ELB, IAM, KMS, RDS, S3, SNS, VPC, Route 53, CloudWatch, Lambda
  • 3+ years of experience with enterprise scale Linux and/or Windows administration including AD and ADFS
  • B.S (Computer Science/Engineering) and/or one of the following:
    • AWS certification(s): Developer, DevOps, SysOps or Solutions Architect certification preferred



  • Containerization experience is highly desired: Docker, ECR, ECS, EKS/KOPS/Kubernetes, Helm, YAML, Go
  • CD experience is highly desired: Jenkins (Blue Ocean, Jenkins X) , Harness, TeamCity, Gitlab, Bitbucket pipelines, etc
  • 3+ years of experience in developing and deploying automation, CLI and API scripting using multiple tools
    • Preferred: Ansible, Bash, Jenkins, Git, Python, PowerShell, and Terraform, Java, Go
    • Equivalent experience accepted on: Puppet, Chef, Docker, Gradle, JavaScript, Packer, Perl, and PHP
  • 3+ years of experience in web-based application deployment and administration using Python, Apache, Tomcat, IIS, and Nginx.
  • 3+ years of experience in network administration including DNS, IPAM, VPN, SSL certificates, and firewalls
  • 3+ years of experience troubleshooting Oracle database performance using AWR & ADDM reports is a plus
  • Proven experience collaborating with cross functional global and remote teams with diverse backgrounds
  • Experience writing process requirements, technical design documents, and standard operating procedures.
  • Strong verbal and written skills, excellent customer service, as well as high attention to detail







Ellucian provides equal employment opportunities to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability, or genetics. In addition to federal law requirements, Ellucian complies with all laws governing nondiscrimination in employment in every location in which the company has facilities. This policy applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation and training.  Ellucian expressly prohibits any form of workplace harassment based on race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status. Improper interference with the ability of Ellucian employees to perform their job duties may result in discipline up to and including discharge.

Nearest Major Market: Orlando

Apply now »