Hello! I’m Nat, a 3rd-year computer science undergrad at UC Berkeley. I work on research at the Center for AI Safety.

Outside of school, I enjoy making music, climbing rocks, learning about elections, and eating watermelon!

:envelope: Email / :link: Linkedin / :computer: GitHub / :mortar_board: Google Scholar

Research

HarmBench Icon
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks arXiv Preprint HarmBench is an evaluation framework for automated language model red teaming. We conduct a large-scale comparision of 18 red teaming methods and 33 target LLMs and defenses, and propose a highly efficient adversarial training method that greatly enhances LLM robustness. paper / website / code
RepE Honesty Icon
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan*, Sarah Chen*, James Campbell*, Phillip Guo*, Richard Ren*, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks arXiv Preprint Representation engineering (RepE) enhances LLM transparency by monitoring and manipulating high-level cognitive phenomena. RepE is effective in mitigating dishonesty, hallucination, and other unsafe behaviors. paper / website / code
Power Seeking Icon
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Alexander Pan*, Chan Jun Shern*, Andy Zou*, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks ICML 2023 Oral MACHIAVELLI is a benchmark of 134 text-based choose-your-own-adventure games with annotations of safety-relevant concepts such as deception, physical harm, and power-seeking, guiding development towards safe yet capable language agents. paper / website / code