INDEX
Explanations
phrases related to privacy and confidential information
references to privacy and privileged information
New Auto-Interp
Negative Logits
grass
-0.69
LER
-0.67
Legion
-0.65
conver
-0.64
ORN
-0.63
Gentle
-0.62
eric
-0.62
spur
-0.61
composed
-0.61
composing
-0.60
POSITIVE LOGITS
ilege
1.75
ileged
1.62
ately
1.33
acies
1.31
acy
1.12
aband
1.09
atile
1.05
atis
0.98
ilage
0.98
acist
0.97
Activations Density 0.036%