INDEX
Explanations
words associated with confusion or instability
New Auto-Interp
Negative Logits
refined
-0.65
ngth
-0.64
catentry
-0.62
ner
-0.58
diction
-0.55
Dial
-0.54
gifted
-0.54
libel
-0.53
chell
-0.53
towed
-0.53
POSITIVE LOGITS
ither
0.89
rift
0.85
asa
0.78
oros
0.76
ewater
0.76
acia
0.72
abus
0.72
alys
0.71
oon
0.71
ike
0.69
Activations Density 0.031%