You have a life. We like that about you.

At OCLC, we believe you'll do the best work of your life when you're living the best life possible.

We work hard to build the technology that connects thousands of today's libraries. But we also work hard to make a job at OCLC a meaningful part of a balanced life- not a substitute for one.

Technology with a Purpose. OCLC supports thousands of libraries in making information more accessible and more useful to people around the world. OCLC provides shared technology services, original research and community programs that help libraries meet the ever-evolving needs of their users, institutions, and communities. With office locations around the globe, OCLC employees are dedicated to offering premier services and software to help libraries.

The Job Details are as follows:

As a Lead Site Reliability Engineer on our Data Platform Reliability team, you will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems.

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. SRE ensures that OCLC's internally critical systems have reliability and uptime appropriate to users' needs while continuously monitoring capacity and performance. SRE is a mindset and a set of engineering approaches focused on optimizing existing systems, building infrastructure, and eliminating work through automation.

You'll join a team of problem solvers with a diverse set of perspectives that is focused on ensuring a consistent environment and supporting day to day operations of a global search platform.

Responsibilities:

Design, code, test and deliver software to automate manual operational work.
Development and support of Java applications, streaming applications (Spark streaming and Kafka), Solr Clusters and Collections, and SQL and No-SQL databases (specifically HBase).
Migrating applications and systems to internal and external clouds.
Troubleshoot priority incidents, facilitate post-mortems, and ensure permanent closure of incidents.
Engage with other development teams throughout the life cycle to help develop software for reliability and scale, ensuring minimal refactoring or changes.
Analyze system performance and identify areas for improvement, working with development teams to optimize code and configurations.
Identify application patterns and analytics in support of better service level objectives.
Design automated software and product upgrades, change management, and release management solutions.
Create and maintain comprehensive documentation for systems, processes, and procedures.
Participate in the 24x7 support coverage, as needed.

Qualifications

Bachelor's degree in computer science or related discipline required
Knowledge of DevOps practices
Strong scripting skills (e.g., Python, Bash, Go, etc.) for automation and tooling.
Experience monitoring and supporting a production application stack.
Experience setting up system monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, New Relic, ELK stack) is required.
Experience with the Hadoop ecosystem and associated components – Map Reduce, Spark, HBase, Zookeeper, etc. is a plus.
Dedication to supporting full-stack solutions, including applications, servers, networks, data pipelines and data platforms.
Strong problem-solving skills, attention to detail, and the ability to work in a fast-paced, collaborative environment.

Working Conditions: Normal office environment.

ADA/EAA: The above statements cover what are generally believed to be principal and essential functions of this job. Specific circumstances may allow or require some people assigned to the job to perform a somewhat different combination of duties.