You've already forked amazing-ytdlp-archive
Compare commits
2 Commits
Author | SHA1 | Date | |
---|---|---|---|
![]() |
1918a03e05 | ||
![]() |
ed4f8b03eb |
84
README.md
84
README.md
@@ -3,27 +3,67 @@
|
||||
This project will be awesome, only if I invest enough time. This software will replace my
|
||||
current cronjob yt-dlp archive service.
|
||||
|
||||
Partially inspired by [hobune](https://github.com/rebane2001/hobune). While that project is amazing by it's own, it's just not scaleable.
|
||||
Partially inspired by [hobune](https://github.com/rebane2001/hobune). While that project is amazingby it's own, it's just not scaleable.
|
||||
|
||||
## The idea
|
||||
The new setup will either be fully running in flask, including the task that checks the
|
||||
youtube channels every x hours. Or Flask will be used as the gui frontend, and a seperate
|
||||
script will do the channel archiving. I have not desided yet.
|
||||
|
||||
What currently works is that the gui frontend calls to a seperate database while a cronjob
|
||||
handles the downloading of new videos from a list of channels.
|
||||
Having over 250k videos, scaling the current cronjob yt-dlp archive task is just really hard. Filetypes change, things get partially downloaded and such.
|
||||
Partially yt-dlp is to blame because it's a package that needs to change all the time. But with this some changes are not accounted for.
|
||||
yt-dlp will still do the downloads. But a flask frontend will be developed to make all downloaded videos easily indexable.
|
||||
For it to be quick (unlike hobune) a database has to be implemented. This could get solved by a static site generator type of software, but that is not my choice.
|
||||
|
||||
The whole software package will use postgresql as a data backend and celery as background tasks.
|
||||
Currently development however is using mongodb just because it's easy.
|
||||
|
||||
## How it works currently(legacy)
|
||||
In the legacy folder you will find files that are currently in my archive project. How it works is
|
||||
that I have a cronjob running every 6 hours what then runs yt-dlp with a config file. In that config
|
||||
that I have a cronjob running every 24 hours what then runs yt-dlp with a config file. In that config
|
||||
file a channel list contains all the channels that yt-dlp needs to update. If a new video has been
|
||||
uploaded, yt-dlp will automatically download a 720p version of the video, all subtitles at that time
|
||||
(rip community captions, will not forget you) and a json file with all the rest of the metadata. Oh
|
||||
and also the thumbnail.
|
||||
|
||||
This works. But is very slow and uses lots of "API" calls to youtube, which will sometimes will get
|
||||
the IP blocked. This needs to be overhauled.
|
||||
the IP blocked. This is why full channel upload pages are not downloaded anymore, I have limited to first 50 videos.
|
||||
|
||||
## Goals
|
||||
Some goals have been set up which will prioritise functionality for the software package.
|
||||
The starting status is that info.json files of videos are loaded into the mongodb database on which flask
|
||||
will generate a page for channels and videos to load. But this has major limitations which will not be described right now
|
||||
but will be reflected in the goals.
|
||||
|
||||
### Stage 1
|
||||
Tasks which have to be finished before the GUI frontend is usable as a manager and user in no perticular order.
|
||||
- [x] Have videos and channels listed on a page
|
||||
- [x] Have a secured admin page where the database can be managed
|
||||
- [x] Have working video streaming
|
||||
- [x] CI/CD pipeline for quicker deployment
|
||||
- [x] Add caching to speed up pages
|
||||
- [x] Add ratelimiting for expensive pages
|
||||
- [x] Ability to show cronjob logs to easily troubleshoot
|
||||
|
||||
### Stage 2
|
||||
Extra functionality for further development of features.
|
||||
- [x] Fix video titles on disk with slugs
|
||||
- [x] Working search functionality
|
||||
- [x] Video reporting functionality
|
||||
- [x] Ability (for external applications) to queue up video ids for download
|
||||
- [x] Add websub requesting and receiving ability. (not fully usable yet without celery tasks)
|
||||
- [] OIDC or Webauthn logins instead of static argon2 passwords
|
||||
|
||||
### Stage 3
|
||||
Mainly focused on retiring the cronjob based scripts and moving it to celery based tasks
|
||||
- [] manage videos by ID's instead of per channel basis
|
||||
- [] download videos from queue
|
||||
- [] Manage websub callbacks
|
||||
|
||||
### Stage 4
|
||||
Mongodb finally has it's limitations.
|
||||
- [] Migrate to postgresql
|
||||
|
||||
### Stage ...
|
||||
Since this is my flagship software which I have developed more features will be added.
|
||||
It may take some time since this is just a hobby for me. And I'm not a programmer by title.
|
||||
|
||||
|
||||
## Things learned
|
||||
### Video playlists
|
||||
@@ -50,26 +90,22 @@ If you swap the channel name to channel id. The folders will never change.
|
||||
### Storage structure
|
||||
The following folder structure is pretty nice for using static scripts. The one drawback
|
||||
is that you can't search for video id's or titles. Because the search takes too long.
|
||||
This is mainly why we need a new system using a database.
|
||||
```
|
||||
./videos/{channel_id}/{upload_date}/{video_id}/video_title.mp4
|
||||
```
|
||||
For the new system using a blob like storage will be key. I had the following in mind. It will be an independant
|
||||
random key and not the YouTube video ID because I have notices that multiple real videos exist under the same key by
|
||||
uploaders who replace old videos.
|
||||
This is mainly why we need a new system using a database mainly for search.
|
||||
|
||||
The following structure is easily scaleable and usable in a object storage format.
|
||||
```
|
||||
-| data
|
||||
| - videos
|
||||
| - 128bit_random_id.mp4
|
||||
| - subtitles
|
||||
| - same_random_id_EN.srt
|
||||
| - same_random_id_DE.srt
|
||||
| - thumbnails
|
||||
| - 128bit_random_id.jpg
|
||||
./videos/{channel_id}/{video_id}/video-title-slug-format.info.json
|
||||
```
|
||||
|
||||
## API things learned
|
||||
### YouTube push notifications in API form exist
|
||||
Using the pubsubhubbub service provided by Google we will implement downloading videos based on uploads.
|
||||
The API is based on WebSub which is greatly documented.
|
||||
|
||||
The hub will give xml+atom notifications when a video is uploaded by a channel and when a video is deleted.
|
||||
The goal is to download a video when a notification gets trough, and run a full channel sync when a video is deleted.
|
||||
This will be next to periodic full channel polling to download videos which the hub has not notified us about.
|
||||
|
||||
### Etag is useful
|
||||
When we will call the api for 50 items in a playlist we also get an etag back.
|
||||
This is a sort of hash of the returned data.
|
||||
|
@@ -1,15 +1,13 @@
|
||||
def create_app(test_config=None):
|
||||
import os, secrets
|
||||
from flask import Flask
|
||||
from ayta.extensions import limiter, caching, celery_init_app
|
||||
from ayta.extensions import limiter, caching, celery_init_app, oidc
|
||||
from werkzeug.middleware.proxy_fix import ProxyFix
|
||||
|
||||
from . import filters
|
||||
|
||||
config = {'MONGO_CONNECTION': os.environ.get('AYTA_MONGOCONNECTION', 'mongodb://root:example@192.168.66.140:27017'),
|
||||
'S3_CONNECTION': os.environ.get('AYTA_S3CONNECTION', '192.168.66.111:9001'),
|
||||
'S3_ACCESSKEY': os.environ.get('AYTA_S3ACCESSKEY', 'lnUiGClFVXVuZbsr'),
|
||||
'S3_SECRETKEY': os.environ.get('AYTA_S3SECRETKEY', 'Qz9NG7rpcOWdK2WL'),
|
||||
'OIDC_CLIENT_SECRETS': os.environ.get('AYTA_OIDC_PATH', 'client_secrets.json'),
|
||||
'CACHE_TYPE': os.environ.get('AYTA_CACHETYPE', 'SimpleCache'),
|
||||
'CACHE_DEFAULT_TIMEOUT': int(os.environ.get('AYTA_CACHETIMEOUT', 6)),
|
||||
'SECRET_KEY': os.environ.get('AYTA_SECRETKEY', secrets.token_hex(32)),
|
||||
@@ -18,6 +16,10 @@ def create_app(test_config=None):
|
||||
'CELERY': dict(broker_url=str(os.environ.get('AYTA_CELERYBROKER', 'amqp://guest:guest@192.168.66.140:5672/')),
|
||||
task_ignore_result=True,)
|
||||
}
|
||||
|
||||
# Static configuration settings, do not change
|
||||
|
||||
config['OIDC_CALLBACK_ROUTE'] = '/api/oidc/callback' # why is this excension not using it? maybe i should implement oidc by myself?
|
||||
|
||||
app = Flask(__name__)
|
||||
app.config.from_mapping(config)
|
||||
@@ -26,6 +28,9 @@ def create_app(test_config=None):
|
||||
caching.init_app(app)
|
||||
celery_init_app(app)
|
||||
|
||||
if app.config['OIDC_CLIENT_SECRETS']:
|
||||
oidc.init_app(app)
|
||||
|
||||
app.wsgi_app = ProxyFix(app.wsgi_app, x_for=1)
|
||||
|
||||
app.jinja_env.filters['pretty_duration'] = filters.pretty_duration
|
||||
|
@@ -8,12 +8,15 @@ bp = Blueprint('channel', __name__, url_prefix='/channel')
|
||||
@bp.route('')
|
||||
@caching.cached(unless=caching_unless)
|
||||
def base():
|
||||
channels = {}
|
||||
channels = []
|
||||
channelIds = get_nosql().list_all_channels()
|
||||
|
||||
for channelId in channelIds:
|
||||
channels[channelId] = get_nosql().get_channel_info(channelId)
|
||||
channels[channelId]['video_count'] = get_nosql().get_channel_videos_count(channelId)
|
||||
channel = get_nosql().get_channel_info(channelId)
|
||||
channel['video_count'] = get_nosql().get_channel_videos_count(channelId)
|
||||
channels.append(channel)
|
||||
|
||||
channels = sorted(channels, key=lambda x: x.get('added_date'), reverse=True)
|
||||
|
||||
return render_template('channel/index.html', channels=channels)
|
||||
|
||||
@@ -47,4 +50,17 @@ def orphaned():
|
||||
|
||||
videos = sorted(videos, key=lambda x: x.get('epoch', 0), reverse=True)
|
||||
|
||||
return render_template('channel/orphaned.html', videos=videos)
|
||||
return render_template('channel/orphaned.html', videos=videos)
|
||||
|
||||
@bp.route('/recent')
|
||||
@caching.cached(unless=caching_unless)
|
||||
def recent():
|
||||
videoIds = get_nosql().get_recent_videos()
|
||||
|
||||
videos = []
|
||||
for videoId in videoIds:
|
||||
videos.append(get_nosql().get_video_info(videoId, limited=True))
|
||||
|
||||
videos = sorted(videos, key=lambda x: x.get('epoch', 0), reverse=True)
|
||||
|
||||
return render_template('channel/recent.html', videos=videos)
|
@@ -3,10 +3,12 @@ from flask_limiter.util import get_remote_address
|
||||
|
||||
from flask_caching import Cache
|
||||
|
||||
from flask import Flask, request, session
|
||||
|
||||
from celery import Celery, Task
|
||||
|
||||
from flask_oidc import OpenIDConnect
|
||||
|
||||
from flask import Flask, request, session
|
||||
|
||||
def celery_init_app(app: Flask) -> Celery:
|
||||
class FlaskTask(Task):
|
||||
def __call__(self, *args: object, **kwargs: object) -> object:
|
||||
@@ -46,3 +48,4 @@ limiter = Limiter(
|
||||
|
||||
caching = Cache()
|
||||
|
||||
oidc = OpenIDConnect()
|
@@ -196,10 +196,21 @@ class Mango:
|
||||
ids = [result['id'] for result in results]
|
||||
|
||||
return tuple(ids)
|
||||
|
||||
def get_recent_videos(self, count=99):
|
||||
""" Returns a SET of YouTube video ID's which have been added last to the info_json collection """
|
||||
result = self.info_json.find({}, {'_id': 0, 'id': 1}, sort=[('_id', pymongo.DESCENDING)]).limit(count)
|
||||
|
||||
ids = []
|
||||
|
||||
for id in result:
|
||||
ids.append(id['id'])
|
||||
|
||||
return tuple(ids)
|
||||
|
||||
def get_video_info(self, videoId, limited=False):
|
||||
if limited:
|
||||
projection = {'_id': 1, 'id': 1, 'title': 1, 'upload_date': 1, 'description': 1, 'channel_id': 1, 'epoch': 1, 'title_slug': 1}
|
||||
projection = {'_id': 1, 'id': 1, 'title': 1, 'upload_date': 1, 'description': 1, 'channel_id': 1, 'uploader': 1, 'epoch': 1, 'title_slug': 1}
|
||||
else:
|
||||
projection = {}
|
||||
|
||||
|
@@ -19,7 +19,17 @@
|
||||
</div>
|
||||
</div>
|
||||
<div class="row">
|
||||
<div class="col s12 m-4 filterable">
|
||||
<div class="col s6 m-4 filterable">
|
||||
<a href="{{ url_for('channel.recent') }}">
|
||||
<div class="card black-text">
|
||||
<div class="card-content center">
|
||||
<span class="card-title">Recent videos</span>
|
||||
<p class="grey-text">The last videos to have been added to the archive</p>
|
||||
</div>
|
||||
</div>
|
||||
</a>
|
||||
</div>
|
||||
<div class="col s6 m-4 filterable">
|
||||
<a href="{{ url_for('channel.orphaned') }}">
|
||||
<div class="card black-text">
|
||||
<div class="card-content center">
|
||||
@@ -31,12 +41,12 @@
|
||||
</div>
|
||||
{% for channel in channels %}
|
||||
<div class="col s6 l4 m-4 filterable">
|
||||
<a href="{{ url_for('channel.channel', channelId=channel) }}">
|
||||
<a href="{{ url_for('channel.channel', channelId=channel.get('id')) }}">
|
||||
<div class="card black-text">
|
||||
<div class="card-content">
|
||||
<span class="card-title">{{ channels[channel].get('original_name') }}</span>
|
||||
<p class="grey-text">{{ channels[channel].get('id') }}</p>
|
||||
<p><b>Added:</b> {{ channels[channel].get('added_date')|pretty_time }} | <b>Active:</b> {{ channels[channel].get('active') }} | <b>Videos:</b> {{ channels[channel].get('video_count') }}</p>
|
||||
<span class="card-title">{{ channel.get('original_name') }}</span>
|
||||
<p class="grey-text">{{ channel.get('id') }}</p>
|
||||
<p><b>Added:</b> {{ channel.get('added_date')|pretty_time }} | <b>Active:</b> {{ channel.get('active') }} | <b>Videos:</b> {{ channel.get('video_count') }}</p>
|
||||
</div>
|
||||
</div>
|
||||
</a>
|
||||
|
@@ -5,7 +5,7 @@
|
||||
{% block content %}
|
||||
<div class="row">
|
||||
<div class="col s12">
|
||||
<h4>Channels lising page</h4>
|
||||
<h4>Videos lising page</h4>
|
||||
</div>
|
||||
</div>
|
||||
<div class="divider"></div>
|
||||
@@ -30,6 +30,7 @@
|
||||
</a>
|
||||
<div class="card-content activator">
|
||||
<span class="card-title">{{ video.get('title') }}</span>
|
||||
<p><b>{{ video.get('uploader') }}</b></p>
|
||||
<p class="grey-text">{{ video.get('id') }} | {{ video.get('upload_date')|pretty_time }}</p>
|
||||
</div>
|
||||
<div class="card-reveal">
|
||||
|
44
ayta/templates/channel/recent.html
Normal file
44
ayta/templates/channel/recent.html
Normal file
@@ -0,0 +1,44 @@
|
||||
{% extends 'material_base.html' %}
|
||||
{% block title %}Recent videos{% endblock %}
|
||||
{% block description %}The last videos to have been added to the archive{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<div class="row">
|
||||
<div class="col s12">
|
||||
<h4>Videos lising page</h4>
|
||||
</div>
|
||||
</div>
|
||||
<div class="divider"></div>
|
||||
<div class="row">
|
||||
<div class="col s6 l9">
|
||||
<h5>Recent videos</h5>
|
||||
<p>The last 99 videos to have been added to the archive.</p>
|
||||
</div>
|
||||
<div class="col s6 l3 m-4 input-field">
|
||||
<input id="filter_query" type="text">
|
||||
<label for="filter_query">Filter results</label>
|
||||
</div>
|
||||
</div>
|
||||
<div class="row">
|
||||
{% for video in videos %}
|
||||
<div class="col s6 l4 m-4 filterable">
|
||||
<div class="card medium black-text">
|
||||
<a href="{{ url_for('watch.base') }}?v={{ video.get('id') }}">
|
||||
<div class="card-image">
|
||||
<img loading="lazy" src="https://archive.ventilaar.net/videos/automatic/{{ video.get('channel_id') }}/{{ video.get('id') }}/{{ video.get('title_slug') }}.jpg">
|
||||
</div>
|
||||
</a>
|
||||
<div class="card-content activator">
|
||||
<span class="card-title">{{ video.get('title') }}</span>
|
||||
<p><b>{{ video.get('uploader') }}</b></p>
|
||||
<p class="grey-text">{{ video.get('id') }} | {{ video.get('upload_date')|pretty_time }}</p>
|
||||
</div>
|
||||
<div class="card-reveal">
|
||||
<span class="card-title truncate">{{ video.get('title') }}</span>
|
||||
<p style="white-space: pre-wrap;">{{ video.get('description') }}</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
{% endfor %}
|
||||
</div>
|
||||
{% endblock %}
|
Reference in New Issue
Block a user