INDEX
Explanations
overemphasized or belittled information
words related to triviality and insignificance
New Auto-Interp
Negative Logits
ioch
-0.70
artney
-0.70
elson
-0.70
angelo
-0.69
oother
-0.66
anwhile
-0.65
yer
-0.63
olan
-0.63
andr
-0.61
ravings
-0.60
POSITIVE LOGITS
ities
1.03
ity
0.96
istically
0.92
ifiable
0.88
itized
0.87
trivial
0.86
inconven
0.84
innocuous
0.83
izes
0.83
ization
0.83
Activations Density 0.011%