warceater

command module
v0.0.0-...-b788d29 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 9, 2023 License: Apache-2.0 Imports: 2 Imported by: 0

README

warceater

WARCeater - a reader of WARC forum scrapes (archive.org) to create a searchable full-text index of them.

It was an experiment in which I wanted to see if reconstructing a forum from its archived state (WARC files as found on archive.org), was feasible.

The idea is first to parse the WARC files, use a set of html parsers and xpath/css selectors to extract 'forum post' objects, store these as JSON objects. The phase two indexes these JSON objects into a disk-based full-text search index (bleve search), so that we can easily find posts by content or by ID. The last phase puts a simple web UI on top, that reformats the JSON objects into pages, so that threads can be shown in a sensible way.

It all worked, but it needs some real maintenance and refactoring if you want to use it for something that supports multiple formats.

This was a very early project to learn some Go basics, so the code quality will not be great.

Documentation

Overview

Copyright © 2021 NAME HERE <EMAIL ADDRESS>

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Directories

Path Synopsis
pkg

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL