GC overhead limit when mining wikipedia and extracting anchor text

Hi

I am following the steps provided [here](https://github.com/yahoo/FEL/blob/master/src/main/java/com/yahoo/semsearch/fastlinking/io/README.md) to train my model. 

I have pre-processed the datapack. But when I am trying to "Build Data Structures and extract anchor text", I am having this GC overhead issue. 

![screen shot 2018-05-29 at 09 14 53](https://user-images.githubusercontent.com/7984532/40646738-023b73bc-6322-11e8-8691-16d7faacb8ab.png)

I have even increased the MAPRED and HADOOP memory to 15G and even provided opts for 
Dmapreduce.reduce.java.opts and Dmapreduce.reduce.memory.mb

My system has 8 cores 32 GB, using java 8. This is the snippet of command that I am following.
```
hadoop \
jar target/FEL-0.1.0-fat.jar \
com.yahoo.semsearch.fastlinking.io.ExtractWikipediaAnchorText \
-Dmapreduce.map.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dmapreduce.reduce.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dyarn.app.mapreduce.am.env="JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" \
-Dmapred.job.map.memory.mb=15144 \
-Dmapreduce.map.memory.mb=15144 \
-Dmapreduce.reduce.memory.mb=15144 \
-Dmapred.child.java.opts="-Xmx15g" \
-Dmapreduce.map.java.opts='-Xmx15g -XX:NewRatio=8 -XX:+UseSerialGC' \
-Dmapreduce.reduce.java.opts="-Xmx15g -XX:NewRatio=8 -XX:+UseSerialGC" \
-input wiki/${WIKI_MARKET}/${WIKI_DATE}/pages-articles.block \
-emap wiki/${WIKI_MARKET}/${WIKI_DATE}/entities.map \
-amap wiki/${WIKI_MARKET}/${WIKI_DATE}/anchors.map \
-cfmap wiki/${WIKI_MARKET}/${WIKI_DATE}/alias-entity-counts.map \
-redir wiki/${WIKI_MARKET}/${WIKI_DATE}/redirects
```
Could you please suggest why this might be happening? 

Pardon me as I am novice to hadoop and java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GC overhead limit when mining wikipedia and extracting anchor text #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GC overhead limit when mining wikipedia and extracting anchor text #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions