INDEX
Explanations
references to thoughts or mental considerations
expressions related to thoughts and perceptions
New Auto-Interp
Negative Logits
poon
-0.73
english
-0.69
jong
-0.66
Estimates
-0.66
ccording
-0.66
Policies
-0.65
deviations
-0.63
aples
-0.63
orously
-0.62
vision
-0.62
POSITIVE LOGITS
thereof
0.85
presented
0.82
iest
0.78
abl
0.78
of
0.75
lessness
0.74
outwe
0.72
afforded
0.72
forts
0.72
thrill
0.72
Activations Density 0.199%