INDEX

Explanations

Language usage and definition

New Auto-Interp

Configuration

Dataset (Dashboard)

Various

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

-0.08

-phase

-0.08

 Пост

-0.08

 Dress

-0.08

 Мен

-0.08

Languages

-0.08

лежащ

-0.08

рус

-0.07

 Seam

-0.07

POSITIVE LOGITS

 slang

0.10

表示

0.10

 unfair

0.09

 shorthand

0.09

意味

0.09

暗

0.09

 derog

0.09

恶

0.09

用于

0.09

 trolling

0.09

Activations Density 0.028%