INDEX
Explanations
phrases related to inappropriate behavior and consequences
New Auto-Interp
Negative Logits
OLOGY
-0.73
Trojan
-0.68
Waterloo
-0.67
Roose
-0.66
enegger
-0.66
eways
-0.63
WM
-0.63
esan
-0.62
Printing
-0.61
Kev
-0.61
POSITIVE LOGITS
inent
0.92
abst
0.90
ain
0.90
ainer
0.87
inance
0.86
ention
0.85
rences
0.85
atory
0.83
ained
0.82
aining
0.81
Activations Density 0.027%