Comprehend-ing My Weekly Notes

By Arjen Schwarz (12 minutes read)

Just for fun I decided to run AWS Comprehend over the weekly notes on this site. This is how I went about it, what came out of it, and what I think of Comprehend.

Comprehend

AWS Comprehend is a machine learning service that was announced at re:Invent 2017. Or rather, it’s several services for text analysis or natural language processing that are grouped together under a single name. Right now it lets you detect the dominant language, sentiment, key phrases, and entities.

Considering I run this on my own writing, the language detection wasn’t something I needed but the other ones sounded like fun. I was considering running this on everything I’ve written, when I ran into an unfortunate limitation. Comprehend has a character limit of 5000 characters. Most of my articles are longer than that, so I decided to limit myself to the ones that have the biggest chance of coming under this: my weekly notes.

Especially considering my self-enforced length limit, this seemed like a good option.

Turning the posts into plain text

Naturally, I don’t want any links or HTML entities messing up my results¹, so I needed to clean this up. That turned out to be surprisingly easy. All of my writing on this site is written in Markdown, with a metadata section for Hugo.

I was thinking about writing some kind of parser to strip everything out, but after searching for a good library to do that I noticed a comment about a tool I’ve used in the past, Pandoc.

Pandoc allows you to transform text files from one format to another. A translation from Markdown to plain text therefore means that it strips out everything I don’t want, so I wrote a tiny script to transform all of my weekly notes².

#!/bin/bash
source_path=${1}
destination_path=${2}

for filename in ${source_path}/*.md; do
    stripped_filename=$(basename ${filename})
    pandoc -f markdown -t plain --wrap=none ${filename} -o ${destination_path}/${stripped_filename}
done

Running Comprehend

Comprehend offers a batch version, but that only does 25 strings³ at a time so doesn’t help much. Besides, I wanted all my results separated so that I can do any type of analysis that I can think of. Again, a simple bash script for running this sufficed.

#!/bin/bash
source_path=${1}
destination_path=${2}

for filename in ${source_path}/*; do
    stripped_filename=$(basename ${filename})
    filesize=$(wc -c < ${filename})
    if [[ ${filesize} -gt 5000 ]]; then
      continue
    fi
    echo "Evaluating ${stripped_filename}"
    aws comprehend detect-entities --language-code en --text "$(cat ${filename})" > $destination_path/${stripped_filename}.entities
    aws comprehend detect-key-phrases --language-code en --text "$(cat ${filename})" > $destination_path/${stripped_filename}.phrases
    aws comprehend detect-sentiment --language-code en --text "$(cat ${filename})" > $destination_path/${stripped_filename}.sentiment
done

You will notice a file size check in here, this is to prevent me from making calls that go over the size limit. And it turns out that I needed this. Even stripped down, of the 97 weekly notes that I’ve written only 57 were short enough to be parsed.

Let’s have a quick closer look at the comprehend command before we move on though.

aws comprehend detect-sentiment --language-code en --text "This is my wonderful text!"

As every AWS CLI command, comprehend is constructed of several parts:

aws - the name of the CLI tool
comprehend - the name of the service
detect-sentiment - the name of the command

Everything after this is a parameter, both of which are mandatory⁴. The language-code is only limited to English or Spanish however, so you can’t use these services for any other languages than those.

Detect sentiment

Let’s start with the sentiment results:

Sentiment	Count	Percentage
NEUTRAL	32	56%
POSITIVE	18	32%
MIXED	4	7%
NEGATIVE	3	5%

Now let’s see how to get from the script above, to useful output like this. After running the script, I have a folder full of files containing json objects. Which leads to my best friend for parsing json: jq.

The JSON objects look like the below, and while I could go manually and count each type of Sentiment (well, grep for the words and wc -l the result) parsing it properly is more interesting.

{
    "SentimentScore": {
        "Mixed": 0.0962539091706276,
        "Positive": 0.18013328313827515,
        "Neutral": 0.64349365234375,
        "Negative": 0.08011914044618607
    },
    "Sentiment": "NEUTRAL"
}

One interesting functionality of jq that I discovered while working on this is the -s or --slurp flag. This flag ensures that jq puts an array around all of the objects that are being parsed. Which basically means that it puts all of the files I’m parsing into one big array that I can then deal with.

In addition I put the jq parse command in a separate file, which makes it a little bit easier to deal with. This can be triggered with the -f or --from-file followed by the filename.

.
| group_by(.Sentiment)
| map({Sentiment: .[0].Sentiment, Count: length, Percentage: length| (. * (100 / ($TOTAL | tonumber)))})
| sort_by(.Count)
| reverse

There are a couple of things going on here. First, I group by the value of Sentiment. But then I map this result, first with Sentiment itself and then the length of that grouping, which will represent the number of items in there. After that however, I also calculate the percentage. This is done in a slightly hack-y way.

length| (. * (100 / ($TOTAL | tonumber))) is the complete percentage calculation. This starts with using length, but then using this in a calculation. The calculation is the standard subset * (100 / total) you can use to calculate a percentage, but it uses an argument for the total number (which is treated as a string and so needs to be transformed to a number).

Once I’ve got the mapping, I just do some things to get the output I want: descending by number of occurrences.

Let’s have a look at how this is run. As it was all just for my own benefit, I didn’t bother getting super beautiful output and just translate the resulting json into the table above manually.

#!/bin/bash
source_path=${1}
source_search="${source_path}/*.sentiment"

echo "Combined results:"
TOTAL=$(cat $source_search | jq -s '. | length')
cat $source_search | jq --arg TOTAL $TOTAL -s -f sentiment.jq

There isn’t anything special here. I take the path, add some specificity to it⁵, and run it through jq with the query from the file. The only thing in addition, is that I first calculate the total number of items and pass that in as an argument. This is the same argument used in the percentage calculation above. It’s not a great solution, but I couldn’t figure out how to get that number within a single jq command. If you do know, please tell me and I’ll update it here.

Below are year specific results. As I only have 2 entries from 2015 that were small enough, I don’t show anything for them. Please keep in mind that 2018 is still very young and that percentages are rounded and therefore might not match 100%.

Sentiment	2016	2017	2018
NEUTRAL	7 (41%)	19 (61%)	5 (71%)
POSITIVE	7 (41%)	9 (29%)	1 (14%)
NEGATIVE	2 (12%)	1 (3%)	0 (0%)
MIXED	1 (6%)	2 (6%)	1 (14%)

Based on the results, my writing is mostly neutral or positive. Which is of course great news, although means I should aim to have it be more positive⁶ as that’s always nicer to read.

Entities

Next up I wanted the same with the entities. I’ve limited my result here to the top 10.

Entity	Type	Count	Percentage
AWS	ORGANIZATION	107	5.8
Google	ORGANIZATION	70	3.8
Apple	ORGANIZATION	54	2.9
Docker	TITLE	44	2.4
Android	TITLE	33	1.8
Microsoft	ORGANIZATION	27	1.5
iOS	TITLE	24	1.3
Jenkins	PERSON	24	1.3
Windows	TITLE	23	1.3
Kubernetes	TITLE	23	1.3

These results are interesting not only in what they show of my writing, but also because of what Comprehend will see as an entity and what kind of entity.

I’m not surprised that AWS is my most used term in the weekly notes. It’s one of my main interests, and they have a lot of updates. The remaining terms are all things that crop up once in a while so that makes sense. I didn’t expect Windows to crop up that much though, and Kubernetes is probably mostly based on the last year as my interests grew more in that direction.

But, let’s have a look at the code behind this. The json object per file is a nested array this time, which means I had to deal with that.

{
    "Entities": [
        {
            "Text": "past week",
            "Score": 0.7420563101768494,
            "Type": "DATE",
            "BeginOffset": 114,
            "EndOffset": 123
        },

That said, aside from the jq query everything I did is very similar to the sentiment so I’ll just focus on that. I’ve again split it up for readability.

[.[]
| .Entities
| .[]
| select(.Type != "QUANTITY")
| select(.Type != "DATE")]
| group_by(.Text)
| map({Item: .[0].Text, Type: .[0].Type, Count: length, Percentage: length| (. * (100 / ($TOTAL | tonumber))) })
| sort_by(.Count)
| reverse
| [limit(10;.[])]

You may notice the first 5 lines being wrapped in an array, this is to ensure that the nested values from the .Entities are together for the grouping. If I didn’t do that, it would result in multiple lists of results.

The only other major differences between this and the sentiment query are the limit at the end and the select queries that filter out quantities and dates. These led to the result containing entities like “a couple” and “Australia”, which I was not interested in at all. I do include those in the total value used for the percentage calculation though.

Yearly results for 2016 and 2017 are below. Adding 2018 in the table didn’t help with clarity, and didn’t add much either, so I’ve left it off.

2016 Entity	2016 Count	2017 Entity	2017 Count
AWS	27 (4%)	AWS	61 (6%)
Apple	25 (4%)	Google	47 (5%)
Jenkins	20 (3%)	Docker	32 (3%)
Google	15 (2%)	Apple	27 (3%)
Microsoft	14 (2%)	Kubernetes	21 (2%)
Android	12 (2%)	Azure	20 (2%)
iOS	12 (2%)	Android	20 (2%)
Lambda	12 (2%)	Oracle	16 (2%)
Linux	11 (2%)	Cloudflare	15 (2%)
Facebook	10 (2%)	Microsoft	13 (1%)

Not very surprising results, although there is a clear shift to container technologies in 2017.

Key Phrases

In a way this is the least interesting part, except for seeing if I use certain phrases a bit too much. Or should I say, a lot.

Phrase	Count
a lot	83
AWS	55
Google	46
Apple	40
things	38
Docker	32
people	31
a couple	27
Kubernetes	27
a bit	26

I’ve left out the percentages here, as aside from the first one they’re all under 1%. You can see a lot of overlap here with the entities, and not really a lot of phrases. When I looked at the raw data I notice that a lot of the phrases⁷ are unique. Which makes the results here pretty useless for this kind of analysis.

[[.[]
| .KeyPhrases]
| flatten
| .[]
| select(.Text != "[1]")
| select(.Text != "[2]")
| select(.Text != "[3]")]
| group_by(.Text)
| map({Text: .[0].Text, Count: length, Percentage: length| (. * (100 / ($TOTAL | tonumber)))})
| sort_by(.Count)
| reverse
| [limit(10;.[])]

The jq query is very similar to the entities one, with the only difference being that I filtered out some bracketed values to hide my tendency for footnotes⁸ as they’re not very helpful.

Verdict

Comprehend is an interesting tool to play around with, and I had fun with it as a learning experience, but it seems to be at an early stage in its development. In its current form I can see how it will be useful in a number of situations, which the marketing seems aimed at already. That said, there are several improvements that I’d like to see even in just the current functionalities.

First of all, that 5000 character limit has to go. Either increase it by quite a bit, or just get rid of it completely. AWS charges per 100 characters, so the most likely reason I can imagine is some limitation in the machine learning. Whatever the reason is, if it can’t even parse the articles on a site like this it becomes far less useful.

At this time the sentiment analysis is mostly a toy as it’s only able to detect between positive and negative. While that can be useful in for example parsing communication to and from a helpdesk, it’s not much use beyond that. It would be good to see more refinements here, even if they would act as sub-sentiments. For example, having a POSITIVE rating for an article, but with high ratings for snark and sarcasm in there and maybe some happiness mixed in. Yes, this is more complex, but I never said I would ask for the easy stuff.

The entities seem to work reasonably well at this point, but I can’t tell if Comprehend is limited by the domains it knows. Naturally it will have been trained on tech related items, so I got good matches, but I don’t know if it will perform as well in other domains.

As I said when discussing it, key phrases doesn’t seem to be very useful at this point. The phrases themselves are too unique unless they’re very short⁹. One option here could be to match similar phrases, even if they’re not exactly the same.

With regards to additional features, the obvious end result of a service like this would be for it to be able to summarise the contents you provide it. In that regard, the key components seem to be there but not working together yet. A first step would be the ability to get a combined result from a single call instead of calling 3 different APIs.

I am interested in seeing where Comprehend will go in the future. Not only in how it will expand its current capabilities, but also if the results it gives will change over time. In order to test that, I put my code and the raw and “comprehended” data up on GitHub so I can compare it in maybe a year.

Or causing articles to be excluded because of the length. ↩︎
As mentioned in Writing on iOS, the Markdown files aren’t all in a single directory anymore, but there was only 1 exception so I did that one manually. ↩︎
aka files ↩︎
Except in the case of detect-dominant-language, where you obviously don’t need to provide the language code. ↩︎
There are similar source variables for year specific results. ↩︎
When appropriate. ↩︎
With over 7000 key phrases in total. ↩︎
Not that I think there’s anything wrong with those. ↩︎
Unless my writing is unusual in the amount I repeat myself. ↩︎

Connecting the dots