INDEX
Explanations
words related to revealing or sharing information
terms related to the disclosure of information
New Auto-Interp
Negative Logits
ichick
-0.77
uve
-0.76
ively
-0.75
igue
-0.72
ouf
-0.72
ENDED
-0.70
helm
-0.67
imore
-0.64
hammad
-0.63
ingu
-0.63
POSITIVE LOGITS
IGH
0.84
TAIN
0.83
doms
0.75
hiba
0.70
stood
0.69
tesy
0.68
vern
0.67
ysis
0.66
divul
0.65
Letter
0.65
Activations Density 0.042%