INDEX
Explanations
phrases describing observations or perceptions
phrases that express perception or subjective opinions
New Auto-Interp
Negative Logits
arta
-0.84
opus
-0.81
veland
-0.74
itus
-0.73
cedented
-0.70
heed
-0.69
isa
-0.68
por
-0.68
holm
-0.67
orio
-0.67
POSITIVE LOGITS
initially
0.70
hoped
0.68
DEV
0.67
earlier
0.66
originally
0.66
Soviet
0.64
previously
0.63
unsuccessfully
0.62
tremend
0.58
prog
0.56
Activations Density 0.543%