INDEX
Explanations
words or phrases related to reasoning or justification
words related to rationality and justification
New Auto-Interp
Negative Logits
sten
-0.66
raped
-0.65
etry
-0.63
cutting
-0.63
Sina
-0.62
chi
-0.61
blackout
-0.59
facts
-0.58
pees
-0.58
charged
-0.57
POSITIVE LOGITS
insofar
0.85
concern
0.77
altru
0.72
curiosity
0.72
intu
0.70
arily
0.69
indignation
0.68
sympath
0.68
applaud
0.66
;;;;;;;;;;;;
0.65
Activations Density 0.233%