INDEX
Explanations
phrases related to sensitive topics or information
references to sensitive topics or issues
New Auto-Interp
Negative Logits
Wolver
-0.76
Helsinki
-0.74
AUT
-0.73
ALK
-0.71
mere
-0.68
RON
-0.65
INST
-0.65
YC
-0.65
Fall
-0.65
AZ
-0.65
POSITIVE LOGITS
sensitive
1.55
sensitive
1.12
ivities
1.02
sensit
0.99
ensitive
0.98
sensitivity
0.95
proble
0.90
insensitive
0.85
mble
0.84
vulner
0.82
Activations Density 0.010%