The Monkey Brain
Identifying jargon from wikipedia

When I discovered Simple Wikipedia I realized that I could do a comparative analysis of the words that are most commonly used in “simple explanations” vs. those in the more formal English Wikipedia explanations1. So I collected a list of subjects, then I downloaded all of the pages they had in common across the two sites. These are the (somewhat arbitrary, mostly determined by what overlaps existed) “categories” I looked at:

  • Anthropology
  • Archaeology
  • Astronomy
  • Biology
  • Chemistry
  • Economics
  • Engineering
  • Linguistics
  • Mathematics
  • Music
  • Physics
  • Psychology
  • Sociology

For each subject, I only considered articles that had the same title across both English and Simple Wikipedia – filtering down the number of articles to around 200 per subject. Then I joined together all of the articles in that subject, then found the term frequency of each word. I then considered the difference in term frequencies between them. Words with a high imbalance toward English Wikipedia are considered “jargon-y”, and words with a high imbalance toward Simple Wikipedia are frequently used in simple explanations.

I lower-cased and stemmed the words as a preprocessing step, so “wa” -> “was”, and “species” -> “specie”. To try out for yourself what stemming a word looks like, see this interactive demo. I also filtered out stopwords.2

The results:

+---------------------------+----------------------------+
| Simplest (Anthropology)   | Jargoniest (Anthropology)   |
|---------------------------+----------------------------|
| 1. year                   | 1. meme                    |
| 2. human                  | 2. social                  |
| 3. homo                   | 3. society                 |
| 4. ago                    | 4. division                |
| 5. tool                   | 5. population              |
| 6. found                  | 6. anthropology            |
| 7. people                 | 7. cultural                |
| 8. made                   | 8. labour                  |
| 9. man                    | 9. ax                      |
| 10. specie                | 10. field                  |
+---------------------------+----------------------------+

+--------------------------+---------------------------+
| Simplest (Archaeology)   | Jargoniest (Archaeology)   |
|--------------------------+---------------------------|
| 1. people                | 1. century                |
| 2. year                  | 2. period                 |
| 3. found                 | 3. beaker                 |
| 4. made                  | 4. number                 |
| 5. ancient               | 5. according              |
| 6. tool                  | 6. germanic               |
| 7. ago                   | 7. dynasty                |
| 8. battle                | 8. however                |
| 9. site                  | 9. early                  |
| 10. age                  | 10. within                |
+--------------------------+---------------------------+

+------------------------+-------------------------+
| Simplest (Astronomy)   | Jargoniest (Astronomy)   |
|------------------------+-------------------------|
| 1. earth               | 1. magnitude            |
| 2. asteroid            | 2. due                  |
| 3. planet              | 3. observation          |
| 4. moon                | 4. velocity             |
| 5. star                | 5. observed             |
| 6. space               | 6. located              |
| 7. light               | 7. mission              |
| 8. galaxy              | 8. approximately        |
| 9. called              | 9. result               |
| 10. sun                | 10. spectral            |
+------------------------+-------------------------+

+----------------------+-----------------------+
| Simplest (Biology)   | Jargoniest (Biology)   |
|----------------------+-----------------------|
| 1. called            | 1. within             |
| 2. cell              | 2. due                |
| 3. animal            | 3. however            |
| 4. plant             | 4. level              |
| 5. make              | 5. including          |
| 6. year              | 6. result             |
| 7. organism          | 7. factor             |
| 8. like              | 8. rate               |
| 9. thing             | 9. potential          |
| 10. get              | 10. receptor          |
+----------------------+-----------------------+

+------------------------+-------------------------+
| Simplest (Chemistry)   | Jargoniest (Chemistry)   |
|------------------------+-------------------------|
| 1. chemical            | 1. production           |
| 2. make                | 2. application          |
| 3. made                | 3. due                  |
| 4. used                | 4. may                  |
| 5. compound            | 5. structure            |
| 6. also                | 6. produced             |
| 7. acid                | 7. process              |
| 8. ion                 | 8. concentration        |
| 9. formula             | 9. effect               |
| 10. water              | 10. material            |
+------------------------+-------------------------+

+------------------------+-------------------------+
| Simplest (Economics)   | Jargoniest (Economics)   |
|------------------------+-------------------------|
| 1. people              | 1. card                 |
| 2. money               | 2. social               |
| 3. company             | 3. model                |
| 4. person              | 4. transaction          |
| 5. country             | 5. new                  |
| 6. called              | 6. within               |
| 7. many                | 7. would                |
| 8. thing               | 8. fund                 |
| 9. make                | 9. rate                 |
| 10. good               | 10. billion             |
+------------------------+-------------------------+

+--------------------------+---------------------------+
| Simplest (Engineering)   | Jargoniest (Engineering)   |
|--------------------------+---------------------------|
| 1. energy                | 1. engine                 |
| 2. engineer              | 2. crane                  |
| 3. people                | 3. suspension             |
| 4. make                  | 4. arafat                 |
| 5. university            | 5. exchanger              |
| 6. de                    | 6. method                 |
| 7. work                  | 7. design                 |
| 8. engineering           | 8. hydraulic              |
| 9. compressor            | 9. pump                   |
| 10. year                 | 10. bearing               |
+--------------------------+---------------------------+

+--------------------------+---------------------------+
| Simplest (Linguistics)   | Jargoniest (Linguistics)   |
|--------------------------+---------------------------|
| 1. word                  | 1. speaker                |
| 2. language              | 2. system                 |
| 3. people                | 3. linguistic             |
| 4. linguistics           | 4. translation            |
| 5. text                  | 5. chomsky                |
| 6. study                 | 6. creole                 |
| 7. formula               | 7. theory                 |
| 8. english               | 8. community              |
| 9. part                  | 9. variety                |
| 10. readability          | 10. social                |
+--------------------------+---------------------------+

+--------------------------+---------------------------+
| Simplest (Mathematics)   | Jargoniest (Mathematics)   |
|--------------------------+---------------------------|
| 1. number                | 1. n                      |
| 2. example               | 2. space                  |
| 3. called                | 3. theory                 |
| 4. mathematics           | 4. p                      |
| 5. used                  | 5. r                      |
| 6. line                  | 6. set                    |
| 7. way                   | 7. defined                |
| 8. one                   | 8. thus                   |
| 9. thing                 | 9. k                      |
| 10. mean                 | 10. sequence              |
+--------------------------+---------------------------+

+--------------------+---------------------+
| Simplest (Music)   | Jargoniest (Music)   |
|--------------------+---------------------|
| 1. music           | 1. film             |
| 2. born            | 2. single           |
| 3. american        | 3. tour             |
| 4. singer          | 4. release          |
| 5. known           | 5. show             |
| 6. called          | 6. track            |
| 7. band            | 7. production       |
| 8. died            | 8. would            |
| 9. composer        | 9. number           |
| 10. many           | 10. following       |
+--------------------+---------------------+

+----------------------+-----------------------+
| Simplest (Physics)   | Jargoniest (Physics)   |
|----------------------+-----------------------|
| 1. energy            | 1. system             |
| 2. make              | 2. thus               |
| 3. light             | 3. may                |
| 4. called            | 4. state              |
| 5. electron          | 5. model              |
| 6. nuclear           | 6. phase              |
| 7. thing             | 7. within             |
| 8. atom              | 8. case               |
| 9. one               | 9. due                |
| 10. computer         | 10. result            |
+----------------------+-----------------------+

+-------------------------+--------------------------+
| Simplest (Psychology)   | Jargoniest (Psychology)   |
|-------------------------+--------------------------|
| 1. people               | 1. individual            |
| 2. person               | 2. research              |
| 3. thing                | 3. associated            |
| 4. behaviour            | 4. experience            |
| 5. child                | 5. participant           |
| 6. way                  | 6. state                 |
| 7. psychology           | 7. including             |
| 8. animal               | 8. found                 |
| 9. like                 | 9. social                |
| 10. help                | 10. thus                 |
+-------------------------+--------------------------+

+------------------------+-------------------------+
| Simplest (Sociology)   | Jargoniest (Sociology)   |
|------------------------+-------------------------|
| 1. people              | 1. according            |
| 2. king                | 2. hamas                |
| 3. battle              | 3. member               |
| 4. person              | 4. individual           |
| 5. war                 | 5. role                 |
| 6. english             | 6. including            |
| 7. conflict            | 7. irgun                |
| 8. called              | 8. theory               |
| 9. edward              | 9. within               |
| 10. army               | 10. social              |
+------------------------+-------------------------+

Notes:

  1. This is a much older project (last commit before I touched it up for this blog post: April 29th, 2015), but I’ve just gotten around to posting about it! 

  2. At first, I didn’t strip out stopwords because I figured that differences in stopword usage was itself interesting. For example, “is” is used much more frequently in Simple Wikipedia, which I thought could be interpreted as an insight about sentence structure of simple explanations. But the more I thought about it, the more I realized that Simple Wikipedia uses shorter sentences, so these high-frequency glue verbs will tend to be overrepresented for that alone. In addition, since those same stopwords showed up across every single subject, I deemed them uninteresting enough to be filtered out.

    But in case you’re curious, here are the top “simple” outputs for Anthropology without filtering out stopwords: [wa, is, year, homo, were, are, about, tool, ago, they, specie, found, it] (I ran this at a different point in time than the above analysis, so the non-stopwords words may be different as well) 


Comments


More posts

The Monkey Brain is Sam Zhang's blog of weird datasets and mathematical curiosities.

You can try to read this through an Atom-compatible feed reader, but the posts often rely on embedded HTML and Javascript. For the adventurous, here is the feed.xml.

the zoo
github
sam.zhang.fyi/