INDEX
Explanations
informational narratives or reports
references to news stories and significant events
New Auto-Interp
Negative Logits
iolet
-0.73
inia
-0.70
inav
-0.65
udget
-0.62
abbling
-0.62
hire
-0.60
imble
-0.58
entimes
-0.58
ankind
-0.58
razil
-0.57
POSITIVE LOGITS
liest
1.07
iest
1.01
same
0.90
anew
0.88
himself
0.86
correctly
0.83
equivalent
0.80
anonymously
0.79
behind
0.77
wrong
0.76
Activations Density 0.394%