INDEX
Explanations
words related to judgment and critique
lists of items or instances separated by commas
New Auto-Interp
Negative Logits
OND
-0.79
oir
-0.72
Tech
-0.67
HAM
-0.66
hers
-0.63
isers
-0.62
CLOSE
-0.62
оÐ
-0.62
OUGH
-0.61
Russ
-0.61
POSITIVE LOGITS
nevertheless
0.70
disg
0.66
dominates
0.65
etc
0.65
annihil
0.64
circa
0.62
massac
0.62
litter
0.61
behaved
0.61
artifacts
0.61
Activations Density 0.375%