CrowdStrike

August 5, 2024 David Teather CrowdStrike

When you’re processing trillions of security events daily for Fortune 500 companies, every millisecond of latency and every error matters. After my successful internship at CrowdStrike, I returned as a full-time Software Engineer on the Threat Detection and Incident Response (TDIR) team, where I’ve built production-ready microservices, developed customer-facing APIs, created organizational tooling, and contributed to threat detection systems that security teams depend on to protect their organizations from cyber threats. This role is an incredible opportunity for growth, ownership, and impact at one of the world’s leading cybersecurity companies.

What TDIR Does

The Threat Detection and Incident Response team, at a super high level we process all of the data coming in from all of our customers, detecting anomalies and threats in real-time to protect organizations worldwide. Even though we’re operating at the scale of trillions of events daily, we still heavily prioritize detecting anomalies as quick as possible to lower our mean time to detection and response.

Our team operates in a critical infrastructure environment where reliability isn’t just important—it’s critical. When we’re processing trillions of events daily, even a 0.01% error rate can mean millions of missed threats or false positives that could overwhelm security teams.

My Role and Impact

As a Software Engineer on TDIR, I’ve worked across multiple areas of our threat detection platform, from building customer-facing APIs and internal tooling to developing monitoring infrastructure and contributing to core detection features like correlation rules that allow users to orchestrate scheduled batch queries to detect anomalies.

Customer-Facing API Development

One of my most significant contributions was spearheading the end-to-end development of a customer-facing API microservice for correlation rules. This was more than just writing code—we needed to build something that let customers use automated processes like API clients, scripts, and more to interact with our platform.

In addition, we needed to build something that was easy to work with and understand, as the existing API was complex and not exposed externally. Furthermore, we needed it to be rock solid and implemented high unit test coverage, integration tests, error rate monitoring, latency requirements, extensive validations, and more.

Building this service taught me that customer-facing APIs require a different level of rigor. Every error message, every response time, every edge case matters when security teams are depending on your service to protect their organizations.

Engineering Excellence and Reliability

Ensuring reliability at CrowdStrike’s scale requires more than just good code. Across all my projects, I implemented comprehensive engineering practices that became the standard for our team:

90%+ unit test coverage - Every critical path tested and verified
Comprehensive end-to-end integration tests - Ensuring the entire system works together, which has caught potential issues before reaching production
Automated alerts and monitoring - Proactive issue detection before customers notice
Documented runbook procedures - Clear escalation paths for any issues
Rapid rollback capability - Ability to quickly revert changes if problems arise

The monitoring infrastructure I built included Grafana dashboards for API metrics, OpenSearch query performance, and Kafka pipeline health. This observability wasn’t just nice-to-have—it was essential for maintaining our SLOs and quickly diagnosing issues in a complex distributed system.

Beyond the API work, I also contributed to core platform improvements, including abstracting shared components that other teams can use, and developing comprehensive integration tests that have prevented issues from reaching production.

One of my most impactful non-coding achievements was creating comprehensive documentation that codified tribal knowledge about building and deploying reliable microservices. This documentation has been widely adopted across the organization and has significantly accelerated service rollout.

The documentation covers everything from initial service setup and testing strategies to monitoring, alerting, and incident response procedures. It represents months of learning from building production services and distills that knowledge into actionable guidance for other engineers.

I also created several other documentation resources that have had significant organizational impact:

Team onboarding resources that have helped new hires get up to speed faster
Internal process documentation that has streamlined workflows across teams
Technical guides that have enabled other teams to implement similar solutions
Cross-team collaboration resources that have improved knowledge sharing

Mentorship and Leadership

At CrowdStrike, I’ve had the opportunity to mentor multiple engineers, including through formal intern mentoring programs and informal guidance. This mentorship is incredibly rewarding, helping more junior engineers navigate the complexities of working in a large-scale distributed system.

One of my mentees worked on delivering an additional endpoint for correlation rules. Guiding him through the intricacies of working with the API, complex downstream service behavior, internal processes, and feature flags was a rewarding experience for both of us.

Incident Response and Reliability

Working on a critical infrastructure platform means being prepared for dealing with production incidents. I’ve been involved in incident response for our team’s owned features, where I’ve diagnosed root causes, written comprehensive incident retrospectives, and implemented mitigations to prevent similar issues from recurring. This includes ensuring proper recovery procedures and validating fixes across affected environments.

One particularly complex incident involved multiple interconnected systems with various contributing factors. I wrote a comprehensive retrospective that was praised by another team’s engineering manager as “way above standard for my level, especially with many moving parts.” The retrospective included both immediate mitigations and long-term fixes to prevent similar issues.

Personal Growth and Reflections

Working at CrowdStrike is incredibly rewarding. The scale of the problems we solve—processing trillions of events daily—has taught me to think differently about system design, reliability, and performance. Every decision has to consider the impact on millions of endpoints and thousands of customers.

The culture of ownership at CrowdStrike is particularly impactful. I’m learning that being a great engineer isn’t just about writing code—it’s about taking ownership of the entire lifecycle of your services, from initial design through production monitoring and incident response.

One of the most valuable lessons I’m learning is the importance of documentation and knowledge sharing. In a fast-moving environment with complex systems, capturing and sharing knowledge isn’t just helpful—it’s essential for team velocity and system reliability.

Back to career