INDEX
Explanations
instances of behavior that is done in secret or covertly
references to covert or secretive activities
New Auto-Interp
Negative Logits
Applic
-0.69
orth
-0.68
Dur
-0.65
consistency
-0.64
ity
-0.63
availability
-0.63
availability
-0.62
pic
-0.61
Citation
-0.61
gradient
-0.61
POSITIVE LOGITS
secretly
3.59
covert
1.74
clandestine
1.60
quietly
1.54
unconsciously
1.50
privately
1.48
silently
1.42
anonymously
1.40
secret
1.39
undercover
1.33
Activations Density 0.017%