INDEX
Explanations
words related to confidentiality and secrecy
references to sensitive information or topics
New Auto-Interp
Negative Logits
FIN
-0.83
LOAD
-0.75
Wolver
-0.74
UNCH
-0.73
AUT
-0.72
Truck
-0.72
INST
-0.70
aneers
-0.69
amaz
-0.68
swick
-0.68
POSITIVE LOGITS
sensitive
1.17
ivities
1.07
mble
0.96
sensit
0.80
ively
0.79
sensitivity
0.77
itives
0.76
sensitive
0.76
ensitive
0.75
itiz
0.75
Activations Density 0.016%