Automatic text analyzer. Extract the key context keywords and phrases from a text either as individual text or as part of a larger corpus. As there are many different approaches described in the literature, ensure is flexible and efficient to test different ones changing program arguments. Provide key modules for the underlying data-structure, text-analysis and visualization that are easy to reuse by other users, as there are not many good NLP libraries in java.
User provides a list of stopwords and a text file, program automatically extracts keywords it finds more relevant in the text file or extracts a list of key phrases to form an abstract.
Optionally the user can provide a full text corpus in which case the program aims to find relevance vs such corpus.
Program outputs results with relative weight to the console in a way that program can be easily used as part of a more comprehensive analysis or project, this output only requires saving into a text file and can be exported to excel or matlab for example.
As an extra feature program may output a HTML file with a wordcloud visualization.
nutshell + command line arguments as follows:
usage: nutshell -f <source.txt> -om|-os|-oa <n> (-c <directory>) (-v) (-sc
<option>)
-c <arg> Optional: Corpus differential analysis vs all .txt files in
the supplied dir
-f <arg> Source .txt file
-h Show this help
-oa <arg> Abstract output <n>
-om <arg> Muti-word keyword output <n>
-os <arg> Single-word keyword output <n>
-sc <arg> Optional: Scoring options:[DEGREE, WEIGHTED_DEGREE,
ENTROPY, RELATIVE_DEGREE, FREQUENCY]
-stop <arg> Optional: Stopwords file (default is stopwords_EN.txt)
-v Optional: Create Visualization nutshell.html file
- Analyze a Single .txt File (DEFAULT)
- Differential analysis vs a corpus of text (
-c <directory>), program automatically scans all .txt files in the supplied directory.
Each mode of operation may output either single word keywords (-os <n>), composite keywords with one or more words (-om <n>), or key phrases (-oa <n>) in which case delimiters are punctuation only and stopwords are included though do not add points to the phrase weight.
Modes om and os may be asked to also output a HTML file named nutshell.html with a word-cloud visualisation of the results by adding the -v argument, -oa also may output a visualization though at this point is experimental only.
Nutshell requires a text file with stopwords on any language, by default it searches for a file named stopwords_EN.txt though any txt file may be provided adding the option -stop <filename>
Nutshell builds a weighed directed graph of word co-ocurrences.
- WEIGHTED_DEGREE (DEFAULT): considers each word degree multiplied by corresponding edge weight.
- DEGREE: Word degree.
- ENTROPY: entropy = Sum(prob(w) x log(prob(w))) this scoring system is more meaningful on corpus scoring considering the probability of finding a word as the relative frequency of such word in the corpus.
- FREQUENCY: relative frequency of the word
When comparing a word vs the full corpus scoring of a word in the file under ananlysis is considering as relative vs the corpus, with exception to entropy which is considered additive.
See pom.xml for Maven dependencies. For building using maven mvn install, jar in target/