Compare commits

...

2 Commits

Author SHA1 Message Date
Ventilaar
1918a03e05 add recently added view 2024-04-02 18:42:56 +02:00
Ventilaar
ed4f8b03eb Update readmy to reflect current status of project 2024-03-30 22:46:03 +01:00
8 changed files with 167 additions and 41 deletions

View File

@@ -3,27 +3,67 @@
This project will be awesome, only if I invest enough time. This software will replace my
current cronjob yt-dlp archive service.
Partially inspired by [hobune](https://github.com/rebane2001/hobune). While that project is amazing by it's own, it's just not scaleable.
Partially inspired by [hobune](https://github.com/rebane2001/hobune). While that project is amazingby it's own, it's just not scaleable.
## The idea
The new setup will either be fully running in flask, including the task that checks the
youtube channels every x hours. Or Flask will be used as the gui frontend, and a seperate
script will do the channel archiving. I have not desided yet.
What currently works is that the gui frontend calls to a seperate database while a cronjob
handles the downloading of new videos from a list of channels.
Having over 250k videos, scaling the current cronjob yt-dlp archive task is just really hard. Filetypes change, things get partially downloaded and such.
Partially yt-dlp is to blame because it's a package that needs to change all the time. But with this some changes are not accounted for.
yt-dlp will still do the downloads. But a flask frontend will be developed to make all downloaded videos easily indexable.
For it to be quick (unlike hobune) a database has to be implemented. This could get solved by a static site generator type of software, but that is not my choice.
The whole software package will use postgresql as a data backend and celery as background tasks.
Currently development however is using mongodb just because it's easy.
## How it works currently(legacy)
In the legacy folder you will find files that are currently in my archive project. How it works is
that I have a cronjob running every 6 hours what then runs yt-dlp with a config file. In that config
that I have a cronjob running every 24 hours what then runs yt-dlp with a config file. In that config
file a channel list contains all the channels that yt-dlp needs to update. If a new video has been
uploaded, yt-dlp will automatically download a 720p version of the video, all subtitles at that time
(rip community captions, will not forget you) and a json file with all the rest of the metadata. Oh
and also the thumbnail.
This works. But is very slow and uses lots of "API" calls to youtube, which will sometimes will get
the IP blocked. This needs to be overhauled.
the IP blocked. This is why full channel upload pages are not downloaded anymore, I have limited to first 50 videos.
## Goals
Some goals have been set up which will prioritise functionality for the software package.
The starting status is that info.json files of videos are loaded into the mongodb database on which flask
will generate a page for channels and videos to load. But this has major limitations which will not be described right now
but will be reflected in the goals.
### Stage 1
Tasks which have to be finished before the GUI frontend is usable as a manager and user in no perticular order.
- [x] Have videos and channels listed on a page
- [x] Have a secured admin page where the database can be managed
- [x] Have working video streaming
- [x] CI/CD pipeline for quicker deployment
- [x] Add caching to speed up pages
- [x] Add ratelimiting for expensive pages
- [x] Ability to show cronjob logs to easily troubleshoot
### Stage 2
Extra functionality for further development of features.
- [x] Fix video titles on disk with slugs
- [x] Working search functionality
- [x] Video reporting functionality
- [x] Ability (for external applications) to queue up video ids for download
- [x] Add websub requesting and receiving ability. (not fully usable yet without celery tasks)
- [] OIDC or Webauthn logins instead of static argon2 passwords
### Stage 3
Mainly focused on retiring the cronjob based scripts and moving it to celery based tasks
- [] manage videos by ID's instead of per channel basis
- [] download videos from queue
- [] Manage websub callbacks
### Stage 4
Mongodb finally has it's limitations.
- [] Migrate to postgresql
### Stage ...
Since this is my flagship software which I have developed more features will be added.
It may take some time since this is just a hobby for me. And I'm not a programmer by title.
## Things learned
### Video playlists
@@ -50,26 +90,22 @@ If you swap the channel name to channel id. The folders will never change.
### Storage structure
The following folder structure is pretty nice for using static scripts. The one drawback
is that you can't search for video id's or titles. Because the search takes too long.
This is mainly why we need a new system using a database.
```
./videos/{channel_id}/{upload_date}/{video_id}/video_title.mp4
```
For the new system using a blob like storage will be key. I had the following in mind. It will be an independant
random key and not the YouTube video ID because I have notices that multiple real videos exist under the same key by
uploaders who replace old videos.
This is mainly why we need a new system using a database mainly for search.
The following structure is easily scaleable and usable in a object storage format.
```
-| data
| - videos
| - 128bit_random_id.mp4
| - subtitles
| - same_random_id_EN.srt
| - same_random_id_DE.srt
| - thumbnails
| - 128bit_random_id.jpg
./videos/{channel_id}/{video_id}/video-title-slug-format.info.json
```
## API things learned
### YouTube push notifications in API form exist
Using the pubsubhubbub service provided by Google we will implement downloading videos based on uploads.
The API is based on WebSub which is greatly documented.
The hub will give xml+atom notifications when a video is uploaded by a channel and when a video is deleted.
The goal is to download a video when a notification gets trough, and run a full channel sync when a video is deleted.
This will be next to periodic full channel polling to download videos which the hub has not notified us about.
### Etag is useful
When we will call the api for 50 items in a playlist we also get an etag back.
This is a sort of hash of the returned data.

View File

@@ -1,15 +1,13 @@
def create_app(test_config=None):
import os, secrets
from flask import Flask
from ayta.extensions import limiter, caching, celery_init_app
from ayta.extensions import limiter, caching, celery_init_app, oidc
from werkzeug.middleware.proxy_fix import ProxyFix
from . import filters
config = {'MONGO_CONNECTION': os.environ.get('AYTA_MONGOCONNECTION', 'mongodb://root:example@192.168.66.140:27017'),
'S3_CONNECTION': os.environ.get('AYTA_S3CONNECTION', '192.168.66.111:9001'),
'S3_ACCESSKEY': os.environ.get('AYTA_S3ACCESSKEY', 'lnUiGClFVXVuZbsr'),
'S3_SECRETKEY': os.environ.get('AYTA_S3SECRETKEY', 'Qz9NG7rpcOWdK2WL'),
'OIDC_CLIENT_SECRETS': os.environ.get('AYTA_OIDC_PATH', 'client_secrets.json'),
'CACHE_TYPE': os.environ.get('AYTA_CACHETYPE', 'SimpleCache'),
'CACHE_DEFAULT_TIMEOUT': int(os.environ.get('AYTA_CACHETIMEOUT', 6)),
'SECRET_KEY': os.environ.get('AYTA_SECRETKEY', secrets.token_hex(32)),
@@ -18,6 +16,10 @@ def create_app(test_config=None):
'CELERY': dict(broker_url=str(os.environ.get('AYTA_CELERYBROKER', 'amqp://guest:guest@192.168.66.140:5672/')),
task_ignore_result=True,)
}
# Static configuration settings, do not change
config['OIDC_CALLBACK_ROUTE'] = '/api/oidc/callback' # why is this excension not using it? maybe i should implement oidc by myself?
app = Flask(__name__)
app.config.from_mapping(config)
@@ -26,6 +28,9 @@ def create_app(test_config=None):
caching.init_app(app)
celery_init_app(app)
if app.config['OIDC_CLIENT_SECRETS']:
oidc.init_app(app)
app.wsgi_app = ProxyFix(app.wsgi_app, x_for=1)
app.jinja_env.filters['pretty_duration'] = filters.pretty_duration

View File

@@ -8,12 +8,15 @@ bp = Blueprint('channel', __name__, url_prefix='/channel')
@bp.route('')
@caching.cached(unless=caching_unless)
def base():
channels = {}
channels = []
channelIds = get_nosql().list_all_channels()
for channelId in channelIds:
channels[channelId] = get_nosql().get_channel_info(channelId)
channels[channelId]['video_count'] = get_nosql().get_channel_videos_count(channelId)
channel = get_nosql().get_channel_info(channelId)
channel['video_count'] = get_nosql().get_channel_videos_count(channelId)
channels.append(channel)
channels = sorted(channels, key=lambda x: x.get('added_date'), reverse=True)
return render_template('channel/index.html', channels=channels)
@@ -47,4 +50,17 @@ def orphaned():
videos = sorted(videos, key=lambda x: x.get('epoch', 0), reverse=True)
return render_template('channel/orphaned.html', videos=videos)
return render_template('channel/orphaned.html', videos=videos)
@bp.route('/recent')
@caching.cached(unless=caching_unless)
def recent():
videoIds = get_nosql().get_recent_videos()
videos = []
for videoId in videoIds:
videos.append(get_nosql().get_video_info(videoId, limited=True))
videos = sorted(videos, key=lambda x: x.get('epoch', 0), reverse=True)
return render_template('channel/recent.html', videos=videos)

View File

@@ -3,10 +3,12 @@ from flask_limiter.util import get_remote_address
from flask_caching import Cache
from flask import Flask, request, session
from celery import Celery, Task
from flask_oidc import OpenIDConnect
from flask import Flask, request, session
def celery_init_app(app: Flask) -> Celery:
class FlaskTask(Task):
def __call__(self, *args: object, **kwargs: object) -> object:
@@ -46,3 +48,4 @@ limiter = Limiter(
caching = Cache()
oidc = OpenIDConnect()

View File

@@ -196,10 +196,21 @@ class Mango:
ids = [result['id'] for result in results]
return tuple(ids)
def get_recent_videos(self, count=99):
""" Returns a SET of YouTube video ID's which have been added last to the info_json collection """
result = self.info_json.find({}, {'_id': 0, 'id': 1}, sort=[('_id', pymongo.DESCENDING)]).limit(count)
ids = []
for id in result:
ids.append(id['id'])
return tuple(ids)
def get_video_info(self, videoId, limited=False):
if limited:
projection = {'_id': 1, 'id': 1, 'title': 1, 'upload_date': 1, 'description': 1, 'channel_id': 1, 'epoch': 1, 'title_slug': 1}
projection = {'_id': 1, 'id': 1, 'title': 1, 'upload_date': 1, 'description': 1, 'channel_id': 1, 'uploader': 1, 'epoch': 1, 'title_slug': 1}
else:
projection = {}

View File

@@ -19,7 +19,17 @@
</div>
</div>
<div class="row">
<div class="col s12 m-4 filterable">
<div class="col s6 m-4 filterable">
<a href="{{ url_for('channel.recent') }}">
<div class="card black-text">
<div class="card-content center">
<span class="card-title">Recent videos</span>
<p class="grey-text">The last videos to have been added to the archive</p>
</div>
</div>
</a>
</div>
<div class="col s6 m-4 filterable">
<a href="{{ url_for('channel.orphaned') }}">
<div class="card black-text">
<div class="card-content center">
@@ -31,12 +41,12 @@
</div>
{% for channel in channels %}
<div class="col s6 l4 m-4 filterable">
<a href="{{ url_for('channel.channel', channelId=channel) }}">
<a href="{{ url_for('channel.channel', channelId=channel.get('id')) }}">
<div class="card black-text">
<div class="card-content">
<span class="card-title">{{ channels[channel].get('original_name') }}</span>
<p class="grey-text">{{ channels[channel].get('id') }}</p>
<p><b>Added:</b> {{ channels[channel].get('added_date')|pretty_time }} | <b>Active:</b> {{ channels[channel].get('active') }} | <b>Videos:</b> {{ channels[channel].get('video_count') }}</p>
<span class="card-title">{{ channel.get('original_name') }}</span>
<p class="grey-text">{{ channel.get('id') }}</p>
<p><b>Added:</b> {{ channel.get('added_date')|pretty_time }} | <b>Active:</b> {{ channel.get('active') }} | <b>Videos:</b> {{ channel.get('video_count') }}</p>
</div>
</div>
</a>

View File

@@ -5,7 +5,7 @@
{% block content %}
<div class="row">
<div class="col s12">
<h4>Channels lising page</h4>
<h4>Videos lising page</h4>
</div>
</div>
<div class="divider"></div>
@@ -30,6 +30,7 @@
</a>
<div class="card-content activator">
<span class="card-title">{{ video.get('title') }}</span>
<p><b>{{ video.get('uploader') }}</b></p>
<p class="grey-text">{{ video.get('id') }} | {{ video.get('upload_date')|pretty_time }}</p>
</div>
<div class="card-reveal">

View File

@@ -0,0 +1,44 @@
{% extends 'material_base.html' %}
{% block title %}Recent videos{% endblock %}
{% block description %}The last videos to have been added to the archive{% endblock %}
{% block content %}
<div class="row">
<div class="col s12">
<h4>Videos lising page</h4>
</div>
</div>
<div class="divider"></div>
<div class="row">
<div class="col s6 l9">
<h5>Recent videos</h5>
<p>The last 99 videos to have been added to the archive.</p>
</div>
<div class="col s6 l3 m-4 input-field">
<input id="filter_query" type="text">
<label for="filter_query">Filter results</label>
</div>
</div>
<div class="row">
{% for video in videos %}
<div class="col s6 l4 m-4 filterable">
<div class="card medium black-text">
<a href="{{ url_for('watch.base') }}?v={{ video.get('id') }}">
<div class="card-image">
<img loading="lazy" src="https://archive.ventilaar.net/videos/automatic/{{ video.get('channel_id') }}/{{ video.get('id') }}/{{ video.get('title_slug') }}.jpg">
</div>
</a>
<div class="card-content activator">
<span class="card-title">{{ video.get('title') }}</span>
<p><b>{{ video.get('uploader') }}</b></p>
<p class="grey-text">{{ video.get('id') }} | {{ video.get('upload_date')|pretty_time }}</p>
</div>
<div class="card-reveal">
<span class="card-title truncate">{{ video.get('title') }}</span>
<p style="white-space: pre-wrap;">{{ video.get('description') }}</p>
</div>
</div>
</div>
{% endfor %}
</div>
{% endblock %}