add recently added view

Update readmy to reflect current status of project
2024-04-02 18:42:56 +02:00 · 2024-03-30 22:46:03 +01:00
8 changed files with 167 additions and 41 deletions
--- a/README.md
+++ b/README.md
@@ -3,27 +3,67 @@
 This project will be awesome, only if I invest enough time. This software will replace my
 current cronjob yt-dlp archive service.

-Partially inspired by [hobune](https://github.com/rebane2001/hobune). While that project is amazing by it's own, it's just not scaleable.
+Partially inspired by [hobune](https://github.com/rebane2001/hobune). While that project is amazingby it's own, it's just not scaleable.

 ## The idea
-The new setup will either be fully running in flask, including the task that checks the
-youtube channels every x hours. Or Flask will be used as the gui frontend, and a seperate
-script will do the channel archiving. I have not desided yet.
-
-What currently works is that the gui frontend calls to a seperate database while a cronjob
-handles the downloading of new videos from a list of channels.
+Having over 250k videos, scaling the current cronjob yt-dlp archive task is just really hard. Filetypes change, things get partially downloaded and such.
+Partially yt-dlp is to blame because it's a package that needs to change all the time. But with this some changes are not accounted for.
+yt-dlp will still do the downloads. But a flask frontend will be developed to make all downloaded videos easily indexable.
+For it to be quick (unlike hobune) a database has to be implemented. This could get solved by a static site generator type of software, but that is not my choice.

+The whole software package will use postgresql as a data backend and celery as background tasks.
+Currently development however is using mongodb just because it's easy.

 ## How it works currently(legacy)
 In the legacy folder you will find files that are currently in my archive project. How it works is
-that I have a cronjob running every 6 hours what then runs yt-dlp with a config file. In that config
+that I have a cronjob running every 24 hours what then runs yt-dlp with a config file. In that config
 file a channel list contains all the channels that yt-dlp needs to update. If a new video has been
 uploaded, yt-dlp will automatically download a 720p version of the video, all subtitles at that time
 (rip community captions, will not forget you) and a json file with all the rest of the metadata. Oh 
 and also the thumbnail.

 This works. But is very slow and uses lots of "API" calls to youtube, which will sometimes will get
-the IP blocked. This needs to be overhauled.
+the IP blocked. This is why full channel upload pages are not downloaded anymore, I have limited to first 50 videos.
+
+## Goals
+Some goals have been set up which will prioritise functionality for the software package.
+The starting status is that info.json files of videos are loaded into the mongodb database on which flask
+will generate a page for channels and videos to load. But this has major limitations which will not be described right now
+but will be reflected in the goals.
+
+### Stage 1
+Tasks which have to be finished before the GUI frontend is usable as a manager and user in no perticular order.
+- [x] Have videos and channels listed on a page
+- [x] Have a secured admin page where the database can be managed
+- [x] Have working video streaming
+- [x] CI/CD pipeline for quicker deployment
+- [x] Add caching to speed up pages
+- [x] Add ratelimiting for expensive pages
+- [x] Ability to show cronjob logs to easily troubleshoot
+
+### Stage 2
+Extra functionality for further development of features.
+- [x] Fix video titles on disk with slugs
+- [x] Working search functionality
+- [x] Video reporting functionality
+- [x] Ability (for external applications) to queue up video ids for download
+- [x] Add websub requesting and receiving ability. (not fully usable yet without celery tasks)
+- [] OIDC or Webauthn logins instead of static argon2 passwords
+
+### Stage 3
+Mainly focused on retiring the cronjob based scripts and moving it to celery based tasks
+- [] manage videos by ID's instead of per channel basis
+- [] download videos from queue
+- [] Manage websub callbacks
+
+### Stage 4
+Mongodb finally has it's limitations.
+- [] Migrate to postgresql
+
+### Stage ...
+Since this is my flagship software which I have developed more features will be added.
+It may take some time since this is just a hobby for me. And I'm not a programmer by title.
+

 ## Things learned
 ### Video playlists
@@ -50,26 +90,22 @@ If you swap the channel name to channel id. The folders will never change.
 ### Storage structure
 The following folder structure is pretty nice for using static scripts. The one drawback
 is that you can't search for video id's or titles. Because the search takes too long.
-This is mainly why we need a new system using a database.
-```
-./videos/{channel_id}/{upload_date}/{video_id}/video_title.mp4
-```
-For the new system using a blob like storage will be key. I had the following in mind. It will be an independant
-random key and not the YouTube video ID because I have notices that multiple real videos exist under the same key by
-uploaders who replace old videos.
+This is mainly why we need a new system using a database mainly for search.

+The following structure is easily scaleable and usable in a object storage format.
 ```
-| data
- | - videos
-   | - 128bit_random_id.mp4
- | - subtitles
-   | - same_random_id_EN.srt
-   | - same_random_id_DE.srt
- | - thumbnails
-    | - 128bit_random_id.jpg
+./videos/{channel_id}/{video_id}/video-title-slug-format.info.json
 ```

 ## API things learned
+### YouTube push notifications in API form exist
+Using the pubsubhubbub service provided by Google we will implement downloading videos based on uploads.
+The API is based on WebSub which is greatly documented.
+
+The hub will give xml+atom notifications when a video is uploaded by a channel and when a video is deleted.
+The goal is to download a video when a notification gets trough, and run a full channel sync when a video is deleted.
+This will be next to periodic full channel polling to download videos which the hub has not notified us about.
+
 ### Etag is useful
 When we will call the api for 50 items in a playlist we also get an etag back. 
 This is a sort of hash of the returned data. 
--- a/ayta/init.py
+++ b/ayta/init.py
@@ -1,15 +1,13 @@
 def create_app(test_config=None):
    import os, secrets
    from flask import Flask
-    from ayta.extensions import limiter, caching, celery_init_app
+    from ayta.extensions import limiter, caching, celery_init_app, oidc
    from werkzeug.middleware.proxy_fix import ProxyFix

    from . import filters

    config = {'MONGO_CONNECTION': os.environ.get('AYTA_MONGOCONNECTION', 'mongodb://root:example@192.168.66.140:27017'),
-              'S3_CONNECTION': os.environ.get('AYTA_S3CONNECTION', '192.168.66.111:9001'),
-              'S3_ACCESSKEY': os.environ.get('AYTA_S3ACCESSKEY', 'lnUiGClFVXVuZbsr'),
-              'S3_SECRETKEY': os.environ.get('AYTA_S3SECRETKEY', 'Qz9NG7rpcOWdK2WL'),
+              'OIDC_CLIENT_SECRETS': os.environ.get('AYTA_OIDC_PATH', 'client_secrets.json'),
              'CACHE_TYPE': os.environ.get('AYTA_CACHETYPE', 'SimpleCache'),
              'CACHE_DEFAULT_TIMEOUT': int(os.environ.get('AYTA_CACHETIMEOUT', 6)),
              'SECRET_KEY': os.environ.get('AYTA_SECRETKEY', secrets.token_hex(32)),
@@ -18,6 +16,10 @@ def create_app(test_config=None):
              'CELERY': dict(broker_url=str(os.environ.get('AYTA_CELERYBROKER', 'amqp://guest:guest@192.168.66.140:5672/')), 
                             task_ignore_result=True,)
             }
+             
+    # Static configuration settings, do not change
+    
+    config['OIDC_CALLBACK_ROUTE'] = '/api/oidc/callback'  # why is this excension not using it? maybe i should implement oidc by myself?
    
    app = Flask(__name__)
    app.config.from_mapping(config)
@@ -26,6 +28,9 @@ def create_app(test_config=None):
    caching.init_app(app)
    celery_init_app(app)
    
+    if app.config['OIDC_CLIENT_SECRETS']:
+        oidc.init_app(app)
+    
    app.wsgi_app = ProxyFix(app.wsgi_app, x_for=1)
    
    app.jinja_env.filters['pretty_duration'] = filters.pretty_duration
--- a/ayta/blueprints/channel.py
+++ b/ayta/blueprints/channel.py
@@ -8,12 +8,15 @@ bp = Blueprint('channel', __name__, url_prefix='/channel')
@bp.route('')
@caching.cached(unless=caching_unless)
 def base():
-    channels = {}
+    channels = []
    channelIds = get_nosql().list_all_channels()
    
    for channelId in channelIds:
-        channels[channelId] = get_nosql().get_channel_info(channelId)
-        channels[channelId]['video_count'] = get_nosql().get_channel_videos_count(channelId)
+        channel = get_nosql().get_channel_info(channelId)
+        channel['video_count'] = get_nosql().get_channel_videos_count(channelId)
+        channels.append(channel)
+        
+    channels = sorted(channels, key=lambda x: x.get('added_date'), reverse=True)
        
    return render_template('channel/index.html', channels=channels)
    
@@ -47,4 +50,17 @@ def orphaned():
        
    videos = sorted(videos, key=lambda x: x.get('epoch', 0), reverse=True)

-    return render_template('channel/orphaned.html', videos=videos)
+    return render_template('channel/orphaned.html', videos=videos)
+    
+@bp.route('/recent')
+@caching.cached(unless=caching_unless)
+def recent():
+    videoIds = get_nosql().get_recent_videos()
+    
+    videos = []
+    for videoId in videoIds:
+        videos.append(get_nosql().get_video_info(videoId, limited=True))
+        
+    videos = sorted(videos, key=lambda x: x.get('epoch', 0), reverse=True)
+
+    return render_template('channel/recent.html', videos=videos)
--- a/ayta/extensions.py
+++ b/ayta/extensions.py
@@ -3,10 +3,12 @@ from flask_limiter.util import get_remote_address

 from flask_caching import Cache

-from flask import Flask, request, session
-
 from celery import Celery, Task

+from flask_oidc import OpenIDConnect
+
+from flask import Flask, request, session
+
 def celery_init_app(app: Flask) -> Celery:
    class FlaskTask(Task):
        def __call__(self, *args: object, **kwargs: object) -> object:
@@ -46,3 +48,4 @@ limiter = Limiter(
    
 caching = Cache()

+oidc = OpenIDConnect()
--- a/ayta/nosql.py
+++ b/ayta/nosql.py
@@ -196,10 +196,21 @@ class Mango:
        ids = [result['id'] for result in results]
        
        return tuple(ids)
+    
+    def get_recent_videos(self, count=99):
+        """ Returns a SET of YouTube video ID's which have been added last to the info_json collection """
+        result = self.info_json.find({}, {'_id': 0, 'id': 1}, sort=[('_id', pymongo.DESCENDING)]).limit(count)
+
+        ids = []
+        
+        for id in result:
+            ids.append(id['id'])
+
+        return tuple(ids)
        
    def get_video_info(self, videoId, limited=False):
        if limited:
-            projection = {'_id': 1, 'id': 1, 'title': 1, 'upload_date': 1, 'description': 1, 'channel_id': 1, 'epoch': 1, 'title_slug': 1}
+            projection = {'_id': 1, 'id': 1, 'title': 1, 'upload_date': 1, 'description': 1, 'channel_id': 1, 'uploader': 1, 'epoch': 1, 'title_slug': 1}
        else:
            projection = {}
        
--- a/ayta/templates/channel/index.html
+++ b/ayta/templates/channel/index.html
@@ -19,7 +19,17 @@
  </div>
 </div>
 <div class="row">
-  <div class="col s12 m-4 filterable">
+  <div class="col s6 m-4 filterable">
+	<a href="{{ url_for('channel.recent') }}">
+	  <div class="card black-text">
+        <div class="card-content center">
+          <span class="card-title">Recent videos</span>
+		  <p class="grey-text">The last videos to have been added to the archive</p>
+        </div>
+      </div>
+	</a>
+  </div>
+  <div class="col s6 m-4 filterable">
 	<a href="{{ url_for('channel.orphaned') }}">
 	  <div class="card black-text">
        <div class="card-content center">
@@ -31,12 +41,12 @@
  </div>
  {% for channel in channels %}
  <div class="col s6 l4 m-4 filterable">
-	<a href="{{ url_for('channel.channel', channelId=channel) }}">
+	<a href="{{ url_for('channel.channel', channelId=channel.get('id')) }}">
 	  <div class="card black-text">
        <div class="card-content">
-          <span class="card-title">{{ channels[channel].get('original_name') }}</span>
-		  <p class="grey-text">{{ channels[channel].get('id') }}</p>
-          <p><b>Added:</b> {{ channels[channel].get('added_date')|pretty_time }} | <b>Active:</b> {{ channels[channel].get('active') }} | <b>Videos:</b> {{ channels[channel].get('video_count') }}</p>
+          <span class="card-title">{{ channel.get('original_name') }}</span>
+		  <p class="grey-text">{{ channel.get('id') }}</p>
+          <p><b>Added:</b> {{ channel.get('added_date')|pretty_time }} | <b>Active:</b> {{ channel.get('active') }} | <b>Videos:</b> {{ channel.get('video_count') }}</p>
        </div>
      </div>
 	</a>
--- a/ayta/templates/channel/orphaned.html
+++ b/ayta/templates/channel/orphaned.html
@@ -5,7 +5,7 @@
 {% block content %}
 <div class="row">
  <div class="col s12">
-    <h4>Channels lising page</h4>
+    <h4>Videos lising page</h4>
  </div>
 </div>
 <div class="divider"></div>
@@ -30,6 +30,7 @@
 	  </a>
      <div class="card-content activator">
        <span class="card-title">{{ video.get('title') }}</span>
+		<p><b>{{ video.get('uploader') }}</b></p>
 		<p class="grey-text">{{ video.get('id') }} | {{ video.get('upload_date')|pretty_time }}</p>
      </div>
      <div class="card-reveal">
--- a/ayta/templates/channel/recent.html
+++ b/ayta/templates/channel/recent.html
@@ -0,0 +1,44 @@
+{% extends 'material_base.html' %}
+{% block title %}Recent videos{% endblock %}
+{% block description %}The last videos to have been added to the archive{% endblock %}
+
+{% block content %}
+<div class="row">
+  <div class="col s12">
+    <h4>Videos lising page</h4>
+  </div>
+</div>
+<div class="divider"></div>
+<div class="row">
+  <div class="col s6 l9">
+	<h5>Recent videos</h5>
+	<p>The last 99 videos to have been added to the archive.</p>
+  </div>
+  <div class="col s6 l3 m-4 input-field">
+	<input id="filter_query" type="text">
+    <label for="filter_query">Filter results</label>
+  </div>
+</div>
+<div class="row">
+  {% for video in videos %}
+  <div class="col s6 l4 m-4 filterable">
+	<div class="card medium black-text">
+	  <a href="{{ url_for('watch.base') }}?v={{ video.get('id') }}">
+	    <div class="card-image">
+		  <img loading="lazy" src="https://archive.ventilaar.net/videos/automatic/{{ video.get('channel_id') }}/{{ video.get('id') }}/{{ video.get('title_slug') }}.jpg">
+        </div>
+	  </a>
+      <div class="card-content activator">
+        <span class="card-title">{{ video.get('title') }}</span>
+		<p><b>{{ video.get('uploader') }}</b></p>
+		<p class="grey-text">{{ video.get('id') }} | {{ video.get('upload_date')|pretty_time }}</p>
+      </div>
+      <div class="card-reveal">
+        <span class="card-title truncate">{{ video.get('title') }}</span>
+        <p style="white-space: pre-wrap;">{{ video.get('description') }}</p>
+      </div>
+	</div>
+  </div>
+  {% endfor %}
+</div>
+{% endblock %}
Author	SHA1	Message	Date
Ventilaar	1918a03e05	add recently added view	2024-04-02 18:42:56 +02:00
Ventilaar	ed4f8b03eb	Update readmy to reflect current status of project	2024-03-30 22:46:03 +01:00