Skip to content

Conversation

@lewismc
Copy link
Member

@lewismc lewismc commented Oct 20, 2025

Proposed fix for https://issues.apache.org/jira/browse/NUTCH-3130
This is a fairly straightforward PR. Some grunt work involved...
I took a bit of time to investigate issues with and options for removing instances of @override finalize()... this may need more investigation and certainly some testing to make sure this PR doesn't break any behavior.

@lewismc lewismc self-assigned this Oct 20, 2025
int threshold = Math.min(3, searched.length() / TRESHOLD_DIVIDER);
for (int i = 0; i < normalized.length && value == null; i++) {
if (StringUtils.getLevenshteinDistance(searched, normalized[i]) < threshold) {
if (StringUtils.compareIgnoreCase(searched, normalized[i]) < threshold) { //.getLevenshteinDistance(searched, normalized[i]) < threshold) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to investigate whether this is a suitable replacement and also need to remove thie comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpellCheckedMetadata is used only by protocol-http and protocol-httpclient. We could deprecate it, use CaseInsensitiveMetadata instead (see NUTCH-3002) and later remove the class SpellCheckedMetadata entirely. Nowadays, spell-checking HTTP headers sounds odd, while 20 years ago it might have been a good idea.

Changing the behavior in opposite to the name does not seem the right way.

If we want to keep the class, we need to use LevenshteinDistance.


@Override
protected void finalize() throws Throwable {
shutDown();
Copy link
Member Author

@lewismc lewismc Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can simply remove the call to shutdown. I need to further investigate options and confirm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for me.


@Override
protected void finalize() {
/**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure on this right now. I need to do some more research.

@lewismc
Copy link
Member Author

lewismc commented Oct 20, 2025

I did want to also note, there are 15 instances of the @deprecated annotation across 9 files in the codebase:
src/java/org/apache/nutch/tools/CommonCrawlFormatFactory.java - 1 instance
src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java - 2 instances
src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java - 3 instances
src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java - 3 instances
src/java/org/apache/nutch/util/NutchJob.java - 1 instance
src/java/org/apache/nutch/protocol/RobotRulesParser.java - 1 instance
src/java/org/apache/nutch/protocol/ProtocolStatus.java - 2 instances
src/java/org/apache/nutch/net/protocols/ProtocolException.java - 1 instance
src/java/org/apache/nutch/crawl/Generator.java - 1 instance

I'm not sure whether is the correct time to be removing these methods but if we decide to keep them we should definitely augment the Javadoc to detail the replacement API usage.

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks, @lewismc!

Some work still to be done. Let me know whether I shall take over.

int threshold = Math.min(3, searched.length() / TRESHOLD_DIVIDER);
for (int i = 0; i < normalized.length && value == null; i++) {
if (StringUtils.getLevenshteinDistance(searched, normalized[i]) < threshold) {
if (StringUtils.compareIgnoreCase(searched, normalized[i]) < threshold) { //.getLevenshteinDistance(searched, normalized[i]) < threshold) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpellCheckedMetadata is used only by protocol-http and protocol-httpclient. We could deprecate it, use CaseInsensitiveMetadata instead (see NUTCH-3002) and later remove the class SpellCheckedMetadata entirely. Nowadays, spell-checking HTTP headers sounds odd, while 20 years ago it might have been a good idea.

Changing the behavior in opposite to the name does not seem the right way.

If we want to keep the class, we need to use LevenshteinDistance.


@Override
protected void finalize() throws Throwable {
shutDown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for me.

installExtensions(this.fRegisteredPlugins);
} catch (PluginRuntimeException e) {
LOG.error("Could not install extensions.", e.toString());
LOG.error("Could not install extensions. {}", e.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Or: LOG.error("Could not install extensions:", e);

@lewismc
Copy link
Member Author

lewismc commented Dec 18, 2025

I didn't see you had responded here @sebastian-nagel . I will revisit this PR, update it and perform some more testing.
I'm thinking about profiling some crawl cycles to test removal of finalize() methods. I am not confident with those proposals right now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants