Senior Service Engineer - CTJ - Poly - Job at Microsoft in Redmond, WA

Senior Service Engineer - CTJ - Poly

Redmond, WA

Posted 13 days ago

Email Job

Job Description

OverviewMicrosoft has an exciting opportunity to join the Silver Infrastructure and Operations team in supporting our Secure Work Area operations. Our team manages the infrastructure and day to day operations required to enable Azure engineers the ability to work in isolated and highly regulated environments. Do you enjoy solving complex issues and have the ability to triage multiple critical events in a calm manner and communicate in an articulate and professional manner? Then we welcome you to learn more about this opportunity and share how you can contribute to the successful delivery of our mission critical services. We are looking for an Senior Service Engineer that understands systems and processes used by Windows, Azure, Linux, and Apple OS' and applications. We look for ways to automate processes and create tools to allow our team to scale in support of our growing facilities. We are responsible for meeting security compliance requirements, meeting service level agreements for escalations, partnering with other engineering groups in architecting solutions that enable our mission critical services to be highly available. This role will also require close interaction with other engineering teams, program managers, and contractors in supporting operations of our clouds. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
ResponsibilitiesTechnical Knowledge and Expertise Develops end-to-end expertise in service and/or system design, interactions between technology layers and components, functions of infrastructure, and dependencies at scale. Takes ownership of service design by driving efforts within an organization to identify, define, recommend, and build optimal configurations of technology solutions with considerations for cost management. Independently adjusts configurations and defines infrastructures to improve the availability, reliability, efficiency, observability, and/or performance of supported products and services. Drives reviews with the engineering teams that develop and/or manage services, identifying opportunities for efficiencies in operations and sharing learnings and recommendations across engineering teams working on related services within their organization. Stays current in knowledge and expertise as technology landscape evolves, maintaining awareness of industry norms. Uses knowledge to drive the adoption of new solutions across engineering teams working with related products within an organization. Provides guidance to others through sharing, coaching, conferences, and other means to drive improvements across teams. Operational Excellence Independently implements reliable, scalable, and high-performance solutions across teams. Contributes to design documents. Owns implementation and rollback plans. Maintains quality checklist and related documentation. Creates, monitors, and takes action on telemetry data and influences telemetry analytics to better identify patterns that reveal errors and unexpected problems that are affecting the system's availability, reliability, performance, and/or efficiency. Develops scripts and/or automation and leverages an understanding of solutions to define, develop, measure, track, change, and improve the quality of telemetry pipelines that support automated monitoring and incident response. Responds to incidents during regular on-call rotations, including complex issues with major customer or business impact, by identifying the level of impact, troubleshooting, contributing to difficult decisions based on business impact, deploying appropriate fixes to resolve root cause(s), and implementing automations for prevention of recurring issues through coordinating resources required for incident resolution, which may include product teams, owners, leadership, other engineering teams, and/or subject matter experts. Escalates resolution of highly complex, ambiguous, and impactful issues as needed. Contributes to postmortems and shares details related to incidents and their resolution through post-mortem reports and regular review meetings. Provides expert incident response assistance to other Service Engineers as needed, and develops incident response and resolution guidance. Adheres to prescriptive guidance for security, privacy, and compliance standards in alignment with direction from the business and technical experts. Works with security, privacy, and compliance teams to identify and address issues relevant to their services. Identifies patterns of violations and implements automations for prevention. Provides assistance to other Service Engineers as needed. Collaboration and Knowledge Sharing Collaborates within and across teams by proactively and systematically sharing information with an appropriate level of detail for their audience. Overcomes obstacles by resolving conflicts and issues across interdependent teams and engages with partners and stakeholders so issues can be resolved and mutual objectives are met. Shares insights and best practices that can be applied to improve development and operations across related sets of the systems, platforms, and/or products. Continues to develop their understanding of insights and best practices through interactions with members of product engineering teams and other resources (e.g., conferences, brown bags, wikis, documentation). Mentors and coaches other engineers to help them identify and propose relevant solutions. Specialty Responsibilities Leverages advanced technical expertise, judgment, and decision making to coordinate multiple work streams and resources in crisis situations to drive mitigation plan and resolve crisis by engaging necessary teams and escalating to appropriate stakeholders. Applies diagnostic expertise. Provides guidance to other engineers working to mitigate and resolve issues. Communicates customer impact and other relevant information with key stakeholders, leadership, and customers. Develops and drives projects and programs to improve crisis response by creating standard practices for consistent response across engineering teams. Fosters increased stability. Reduces noise by adjusting telemetry and alarming. Influences key engineering stakeholders to adopt new standards and practices to broadly improve crisis and problem management. Monitors and maintains security by addressing security vulnerabilities through patches, reconfigurations, and/or settings updates. Identifies, prioritizes, and targets solutions to complex security issues that may impact customers and partners, and drives action to promote the adoption of relevant mitigations. Drives program and process of mitigation (e.g., automation), troubleshoots system issues, and partners closely with internal customers and engineering teams to conduct root cause analyses, share end-to-end expertise in services, and to mitigate and resolve issues. Communicates and drives adherence to security policies and procedures. Embody our culture and values.

Job Summary

Company

Microsoft

Start Date

As soon as possible

Employment Term and Type

Regular, Full Time

Required Experience

Open