GSoC 2016 Acceptance for Apache Nutch

Apache NutchLast year, I’ve successfully finished my Google Summer of Code 2015 project for Spark Backend for Apache Gora and this year I’ve been accepted to Google Summer of Code 2016 with my proposal for Apache Nutch.

Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project has diversified and now comprises two codebases, namely:

Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing.

Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexible model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.

Definition of The Problem

Nutch 2.x has a REST API and web application but it doesn’t have a security layer on it. A security layer should be implemented which covers security functionality (authentication, authorization), different authentication mechanisms , documentation and refactoring existing code. My proposal will therefore propose the design, development and implementation of the security agenda as described above. This work will be specifically applicable to the Nutch 2.X codebase.

Background

There has been implemented an API which lets to interact with Nutch via REST API. Administration and configuration tasks can be done via this API. My proposal offers a comprehensive security layer under NUTCH­1756. Existing code should be re­factored, security layer should be added and a documentation should be done programmatically (i.e. Miredot, Swagger).

Suggested Steps for Proposed Method

Suggested schedule and timeline is as follows:

  1. Analyzing The Problem
    • Problem will be analyzed with more detail.
  2. Authentication Implementation
    • HTTP Basic authentication
    • HTTP Digest authentication
    • SSL client authentication
    • Kerberos Authentication
  3. Authorization Implementation
    • Authorization will be implemented
  4. API Documentation
    • API Documentation implementation
  5. Test
    • Implementation tests will be written and run.generic abstraction
  6. Documentation
    • Documentation will be prepared.

You can check my proposal at Google Summer of Code’s web site. Google Melange will not be used for GSoC starting from this year. I’ll try to share my experiences during GSoC 2016 period at this blog.

Gsoc 2016 Banner

Resources:

[1] https://issues.apache.org/jira/browse/NUTCH­1756
[2] https://wiki.apache.org/nutch/NutchRESTAPI
[3] https://summerofcode.withgoogle.com/projects/#6099177868099584
[4] https://issues.apache.org/jira/browse/GORA­386
[5] https://issues.apache.org/jira/browse/NUTCH­2243
[6] https://issues.apache.org/jira/browse/NUTCH­2022
[7] https://github.com/apache/nutch
[8] https://en.wikipedia.org/wiki/Apache_Nutch

kamaci

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *