INDEX
Explanations
phrases or expressions indicating confusion or absurdity
New Auto-Interp
Negative Logits
rique
-0.70
ahime
-0.69
rica
-0.68
elist
-0.68
emale
-0.68
ngth
-0.66
rican
-0.65
lean
-0.65
essor
-0.64
Citation
-0.64
POSITIVE LOGITS
except
0.86
together
0.84
toget
0.82
moot
0.78
together
0.73
happening
0.70
transpired
0.66
revolves
0.65
undone
0.64
usional
0.64
Activations Density 0.094%