INDEX
Explanations
phrases related to potentially difficult or dangerous scenarios
references to various situations that arise in different contexts
New Auto-Interp
Negative Logits
roe
-0.83
rotein
-0.72
rik
-0.70
uster
-0.69
rib
-0.68
sub
-0.66
rica
-0.65
rium
-0.64
Bones
-0.63
head
-0.63
POSITIVE LOGITS
situations
1.32
uations
1.10
scenarios
1.06
Situation
0.95
circumstances
0.94
predic
0.89
afety
0.85
situation
0.85
involving
0.82
contexts
0.78
Activations Density 0.010%