INDEX
Explanations
phrases or sentences explaining reasons or justifications
phrases expressing knowledge or understanding
New Auto-Interp
Negative Logits
xit
-0.83
isode
-0.73
lez
-0.72
anon
-0.71
âĵĺ
-0.71
ESA
-0.70
Appears
-0.70
jri
-0.69
udos
-0.68
âĢİ
-0.67
POSITIVE LOGITS
outnumbered
0.78
pree
0.73
messed
0.71
collateral
0.71
%%
0.70
outwe
0.69
cheap
0.68
technically
0.67
verte
0.66
scarce
0.65
Activations Density 0.580%