Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show recent deleted pages #102

Open
KrzysztofMadejski opened this issue Jul 7, 2018 · 3 comments
Open

Show recent deleted pages #102

KrzysztofMadejski opened this issue Jul 7, 2018 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@KrzysztofMadejski
Copy link
Member

Deleted page: url is answering 404

Possibly deleted page may result in redirect:

  • to homepage
  • to a dedicated 404 page
@KrzysztofMadejski KrzysztofMadejski added the enhancement New feature or request label Jul 7, 2018
@KrzysztofMadejski
Copy link
Member Author

KrzysztofMadejski commented Jul 7, 2018

Draft approach using ES based on "search deleted phrases":

  • would require indexing of data.web_objects_revisions.object_id to make buckets
  • may not work because of two many revisions matching the bucket
  • SQL might be more efficient
GET _search
{
  // don't return any hits, we get all the data from aggregations
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "term": {
          "dataset": "web_objects_revisions"
        }
      }
    }
  },
  "aggs": {
    "top-urls": {
      // we are dividing all versions into buckets, one objects/resource in each bucket 
      "terms": {
        "field": "data.web_objects_revisions.object_id",
        "size": 10
        //"order": {
          // order is affected by `should` query above
         // "data.web_objects_revisions.timestamp": "desc"
        //}
      },
      // comment
      "aggs": {
        "top_hits": {
          "top_hits": {
            // in each bucket(object/resource) best-matching versions will be enough
            "size": 1
          }
        },
        // we get last matching version (the word was present)
        "matching": {
          "filter": { 
            "term": {
              "http_code": 200
            }
          },
          "aggs": { "last_seen": { "max": { "field": "data.web_objects_revisions.timestamp" }}}
        },
        // we get the first non-matching version (the word disappeared)
        "not_matching": {
          "filter" : { 
              "bool": {
                "must_not": {
                  "term": {
                    "http_code": 200
                  }
                }
              }
          },
          "aggs": { "first_seen": { "min": { "field": "data.web_objects_revisions.timestamp" }}}
        },
        // show only deleted phrases 
        // we are filtering (bucket_selector) only those versions that had a non-matching version after matching
        "deleted phrases": {
            "bucket_selector": {
                "buckets_path": {
                  "first_not_matching_date": "not_matching.first_seen",
                  "last_matching_date": "matching.last_seen"
                },
                "script": "params.first_not_matching_date > params.last_matching_date"
            }
        }
        }
      }
    }
  }
  

@KrzysztofMadejski
Copy link
Member Author

Early draft of SQL approach:

SELECT r.object_id, r.timestamp, r.code, last_problematic_revisions.timestamp, last_problematic_revisions.code
    FROM web_objects_revisions r
    INNER JOIN
        (SELECT code, timestamp FROM web_objects_revisions WHERE id = ) THIS IS MISSING
        (SELECT object_id, MAX(id) from web_objects_revisions WHERE code >= 400 GROUP BY object_id) as last_problematic_revisions

ON r.object_id = last_problematic_revisions.object_id;

@KrzysztofMadejski
Copy link
Member Author

@danielmacyszyn gotowe do stylowania: http://archiwum.io/deleted-pages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants