You've already forked amazing-ytdlp-archive
Compare commits
4 Commits
Author | SHA1 | Date | |
---|---|---|---|
![]() |
1be9729720 | ||
![]() |
1918a03e05 | ||
![]() |
ed4f8b03eb | ||
![]() |
7266a437d1 |
84
README.md
84
README.md
@@ -3,27 +3,67 @@
|
|||||||
This project will be awesome, only if I invest enough time. This software will replace my
|
This project will be awesome, only if I invest enough time. This software will replace my
|
||||||
current cronjob yt-dlp archive service.
|
current cronjob yt-dlp archive service.
|
||||||
|
|
||||||
Partially inspired by [hobune](https://github.com/rebane2001/hobune). While that project is amazing by it's own, it's just not scaleable.
|
Partially inspired by [hobune](https://github.com/rebane2001/hobune). While that project is amazingby it's own, it's just not scaleable.
|
||||||
|
|
||||||
## The idea
|
## The idea
|
||||||
The new setup will either be fully running in flask, including the task that checks the
|
Having over 250k videos, scaling the current cronjob yt-dlp archive task is just really hard. Filetypes change, things get partially downloaded and such.
|
||||||
youtube channels every x hours. Or Flask will be used as the gui frontend, and a seperate
|
Partially yt-dlp is to blame because it's a package that needs to change all the time. But with this some changes are not accounted for.
|
||||||
script will do the channel archiving. I have not desided yet.
|
yt-dlp will still do the downloads. But a flask frontend will be developed to make all downloaded videos easily indexable.
|
||||||
|
For it to be quick (unlike hobune) a database has to be implemented. This could get solved by a static site generator type of software, but that is not my choice.
|
||||||
What currently works is that the gui frontend calls to a seperate database while a cronjob
|
|
||||||
handles the downloading of new videos from a list of channels.
|
|
||||||
|
|
||||||
|
The whole software package will use postgresql as a data backend and celery as background tasks.
|
||||||
|
Currently development however is using mongodb just because it's easy.
|
||||||
|
|
||||||
## How it works currently(legacy)
|
## How it works currently(legacy)
|
||||||
In the legacy folder you will find files that are currently in my archive project. How it works is
|
In the legacy folder you will find files that are currently in my archive project. How it works is
|
||||||
that I have a cronjob running every 6 hours what then runs yt-dlp with a config file. In that config
|
that I have a cronjob running every 24 hours what then runs yt-dlp with a config file. In that config
|
||||||
file a channel list contains all the channels that yt-dlp needs to update. If a new video has been
|
file a channel list contains all the channels that yt-dlp needs to update. If a new video has been
|
||||||
uploaded, yt-dlp will automatically download a 720p version of the video, all subtitles at that time
|
uploaded, yt-dlp will automatically download a 720p version of the video, all subtitles at that time
|
||||||
(rip community captions, will not forget you) and a json file with all the rest of the metadata. Oh
|
(rip community captions, will not forget you) and a json file with all the rest of the metadata. Oh
|
||||||
and also the thumbnail.
|
and also the thumbnail.
|
||||||
|
|
||||||
This works. But is very slow and uses lots of "API" calls to youtube, which will sometimes will get
|
This works. But is very slow and uses lots of "API" calls to youtube, which will sometimes will get
|
||||||
the IP blocked. This needs to be overhauled.
|
the IP blocked. This is why full channel upload pages are not downloaded anymore, I have limited to first 50 videos.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
Some goals have been set up which will prioritise functionality for the software package.
|
||||||
|
The starting status is that info.json files of videos are loaded into the mongodb database on which flask
|
||||||
|
will generate a page for channels and videos to load. But this has major limitations which will not be described right now
|
||||||
|
but will be reflected in the goals.
|
||||||
|
|
||||||
|
### Stage 1
|
||||||
|
Tasks which have to be finished before the GUI frontend is usable as a manager and user in no perticular order.
|
||||||
|
- [x] Have videos and channels listed on a page
|
||||||
|
- [x] Have a secured admin page where the database can be managed
|
||||||
|
- [x] Have working video streaming
|
||||||
|
- [x] CI/CD pipeline for quicker deployment
|
||||||
|
- [x] Add caching to speed up pages
|
||||||
|
- [x] Add ratelimiting for expensive pages
|
||||||
|
- [x] Ability to show cronjob logs to easily troubleshoot
|
||||||
|
|
||||||
|
### Stage 2
|
||||||
|
Extra functionality for further development of features.
|
||||||
|
- [x] Fix video titles on disk with slugs
|
||||||
|
- [x] Working search functionality
|
||||||
|
- [x] Video reporting functionality
|
||||||
|
- [x] Ability (for external applications) to queue up video ids for download
|
||||||
|
- [x] Add websub requesting and receiving ability. (not fully usable yet without celery tasks)
|
||||||
|
- [] OIDC or Webauthn logins instead of static argon2 passwords
|
||||||
|
|
||||||
|
### Stage 3
|
||||||
|
Mainly focused on retiring the cronjob based scripts and moving it to celery based tasks
|
||||||
|
- [] manage videos by ID's instead of per channel basis
|
||||||
|
- [] download videos from queue
|
||||||
|
- [] Manage websub callbacks
|
||||||
|
|
||||||
|
### Stage 4
|
||||||
|
Mongodb finally has it's limitations.
|
||||||
|
- [] Migrate to postgresql
|
||||||
|
|
||||||
|
### Stage ...
|
||||||
|
Since this is my flagship software which I have developed more features will be added.
|
||||||
|
It may take some time since this is just a hobby for me. And I'm not a programmer by title.
|
||||||
|
|
||||||
|
|
||||||
## Things learned
|
## Things learned
|
||||||
### Video playlists
|
### Video playlists
|
||||||
@@ -50,26 +90,22 @@ If you swap the channel name to channel id. The folders will never change.
|
|||||||
### Storage structure
|
### Storage structure
|
||||||
The following folder structure is pretty nice for using static scripts. The one drawback
|
The following folder structure is pretty nice for using static scripts. The one drawback
|
||||||
is that you can't search for video id's or titles. Because the search takes too long.
|
is that you can't search for video id's or titles. Because the search takes too long.
|
||||||
This is mainly why we need a new system using a database.
|
This is mainly why we need a new system using a database mainly for search.
|
||||||
```
|
|
||||||
./videos/{channel_id}/{upload_date}/{video_id}/video_title.mp4
|
|
||||||
```
|
|
||||||
For the new system using a blob like storage will be key. I had the following in mind. It will be an independant
|
|
||||||
random key and not the YouTube video ID because I have notices that multiple real videos exist under the same key by
|
|
||||||
uploaders who replace old videos.
|
|
||||||
|
|
||||||
|
The following structure is easily scaleable and usable in a object storage format.
|
||||||
```
|
```
|
||||||
-| data
|
./videos/{channel_id}/{video_id}/video-title-slug-format.info.json
|
||||||
| - videos
|
|
||||||
| - 128bit_random_id.mp4
|
|
||||||
| - subtitles
|
|
||||||
| - same_random_id_EN.srt
|
|
||||||
| - same_random_id_DE.srt
|
|
||||||
| - thumbnails
|
|
||||||
| - 128bit_random_id.jpg
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## API things learned
|
## API things learned
|
||||||
|
### YouTube push notifications in API form exist
|
||||||
|
Using the pubsubhubbub service provided by Google we will implement downloading videos based on uploads.
|
||||||
|
The API is based on WebSub which is greatly documented.
|
||||||
|
|
||||||
|
The hub will give xml+atom notifications when a video is uploaded by a channel and when a video is deleted.
|
||||||
|
The goal is to download a video when a notification gets trough, and run a full channel sync when a video is deleted.
|
||||||
|
This will be next to periodic full channel polling to download videos which the hub has not notified us about.
|
||||||
|
|
||||||
### Etag is useful
|
### Etag is useful
|
||||||
When we will call the api for 50 items in a playlist we also get an etag back.
|
When we will call the api for 50 items in a playlist we also get an etag back.
|
||||||
This is a sort of hash of the returned data.
|
This is a sort of hash of the returned data.
|
||||||
|
@@ -1,15 +1,13 @@
|
|||||||
def create_app(test_config=None):
|
def create_app(test_config=None):
|
||||||
import os, secrets
|
import os, secrets
|
||||||
from flask import Flask
|
from flask import Flask
|
||||||
from ayta.extensions import limiter, caching, celery_init_app
|
from ayta.extensions import limiter, caching, celery_init_app, oidc
|
||||||
from werkzeug.middleware.proxy_fix import ProxyFix
|
from werkzeug.middleware.proxy_fix import ProxyFix
|
||||||
|
|
||||||
from . import filters
|
from . import filters
|
||||||
|
|
||||||
config = {'MONGO_CONNECTION': os.environ.get('AYTA_MONGOCONNECTION', 'mongodb://root:example@192.168.66.140:27017'),
|
config = {'MONGO_CONNECTION': os.environ.get('AYTA_MONGOCONNECTION', 'mongodb://root:example@192.168.66.140:27017'),
|
||||||
'S3_CONNECTION': os.environ.get('AYTA_S3CONNECTION', '192.168.66.111:9001'),
|
'OIDC_CLIENT_SECRETS': os.environ.get('AYTA_OIDC_PATH', None),
|
||||||
'S3_ACCESSKEY': os.environ.get('AYTA_S3ACCESSKEY', 'lnUiGClFVXVuZbsr'),
|
|
||||||
'S3_SECRETKEY': os.environ.get('AYTA_S3SECRETKEY', 'Qz9NG7rpcOWdK2WL'),
|
|
||||||
'CACHE_TYPE': os.environ.get('AYTA_CACHETYPE', 'SimpleCache'),
|
'CACHE_TYPE': os.environ.get('AYTA_CACHETYPE', 'SimpleCache'),
|
||||||
'CACHE_DEFAULT_TIMEOUT': int(os.environ.get('AYTA_CACHETIMEOUT', 6)),
|
'CACHE_DEFAULT_TIMEOUT': int(os.environ.get('AYTA_CACHETIMEOUT', 6)),
|
||||||
'SECRET_KEY': os.environ.get('AYTA_SECRETKEY', secrets.token_hex(32)),
|
'SECRET_KEY': os.environ.get('AYTA_SECRETKEY', secrets.token_hex(32)),
|
||||||
@@ -18,6 +16,10 @@ def create_app(test_config=None):
|
|||||||
'CELERY': dict(broker_url=str(os.environ.get('AYTA_CELERYBROKER', 'amqp://guest:guest@192.168.66.140:5672/')),
|
'CELERY': dict(broker_url=str(os.environ.get('AYTA_CELERYBROKER', 'amqp://guest:guest@192.168.66.140:5672/')),
|
||||||
task_ignore_result=True,)
|
task_ignore_result=True,)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Static configuration settings, do not change
|
||||||
|
|
||||||
|
config['OIDC_CALLBACK_ROUTE'] = '/api/oidc/callback' # why is this excension not using it? maybe i should implement oidc by myself?
|
||||||
|
|
||||||
app = Flask(__name__)
|
app = Flask(__name__)
|
||||||
app.config.from_mapping(config)
|
app.config.from_mapping(config)
|
||||||
@@ -26,6 +28,9 @@ def create_app(test_config=None):
|
|||||||
caching.init_app(app)
|
caching.init_app(app)
|
||||||
celery_init_app(app)
|
celery_init_app(app)
|
||||||
|
|
||||||
|
if app.config['OIDC_CLIENT_SECRETS']:
|
||||||
|
oidc.init_app(app)
|
||||||
|
|
||||||
app.wsgi_app = ProxyFix(app.wsgi_app, x_for=1)
|
app.wsgi_app = ProxyFix(app.wsgi_app, x_for=1)
|
||||||
|
|
||||||
app.jinja_env.filters['pretty_duration'] = filters.pretty_duration
|
app.jinja_env.filters['pretty_duration'] = filters.pretty_duration
|
||||||
|
@@ -8,12 +8,15 @@ bp = Blueprint('channel', __name__, url_prefix='/channel')
|
|||||||
@bp.route('')
|
@bp.route('')
|
||||||
@caching.cached(unless=caching_unless)
|
@caching.cached(unless=caching_unless)
|
||||||
def base():
|
def base():
|
||||||
channels = {}
|
channels = []
|
||||||
channelIds = get_nosql().list_all_channels()
|
channelIds = get_nosql().list_all_channels()
|
||||||
|
|
||||||
for channelId in channelIds:
|
for channelId in channelIds:
|
||||||
channels[channelId] = get_nosql().get_channel_info(channelId)
|
channel = get_nosql().get_channel_info(channelId)
|
||||||
channels[channelId]['video_count'] = get_nosql().get_channel_videos_count(channelId)
|
channel['video_count'] = get_nosql().get_channel_videos_count(channelId)
|
||||||
|
channels.append(channel)
|
||||||
|
|
||||||
|
channels = sorted(channels, key=lambda x: x.get('added_date'), reverse=True)
|
||||||
|
|
||||||
return render_template('channel/index.html', channels=channels)
|
return render_template('channel/index.html', channels=channels)
|
||||||
|
|
||||||
@@ -47,4 +50,17 @@ def orphaned():
|
|||||||
|
|
||||||
videos = sorted(videos, key=lambda x: x.get('epoch', 0), reverse=True)
|
videos = sorted(videos, key=lambda x: x.get('epoch', 0), reverse=True)
|
||||||
|
|
||||||
return render_template('channel/orphaned.html', videos=videos)
|
return render_template('channel/orphaned.html', videos=videos)
|
||||||
|
|
||||||
|
@bp.route('/recent')
|
||||||
|
@caching.cached(unless=caching_unless)
|
||||||
|
def recent():
|
||||||
|
videoIds = get_nosql().get_recent_videos()
|
||||||
|
|
||||||
|
videos = []
|
||||||
|
for videoId in videoIds:
|
||||||
|
videos.append(get_nosql().get_video_info(videoId, limited=True))
|
||||||
|
|
||||||
|
videos = sorted(videos, key=lambda x: x.get('epoch', 0), reverse=True)
|
||||||
|
|
||||||
|
return render_template('channel/recent.html', videos=videos)
|
@@ -3,10 +3,12 @@ from flask_limiter.util import get_remote_address
|
|||||||
|
|
||||||
from flask_caching import Cache
|
from flask_caching import Cache
|
||||||
|
|
||||||
from flask import Flask, request, session
|
|
||||||
|
|
||||||
from celery import Celery, Task
|
from celery import Celery, Task
|
||||||
|
|
||||||
|
from flask_oidc import OpenIDConnect
|
||||||
|
|
||||||
|
from flask import Flask, request, session
|
||||||
|
|
||||||
def celery_init_app(app: Flask) -> Celery:
|
def celery_init_app(app: Flask) -> Celery:
|
||||||
class FlaskTask(Task):
|
class FlaskTask(Task):
|
||||||
def __call__(self, *args: object, **kwargs: object) -> object:
|
def __call__(self, *args: object, **kwargs: object) -> object:
|
||||||
@@ -46,3 +48,4 @@ limiter = Limiter(
|
|||||||
|
|
||||||
caching = Cache()
|
caching = Cache()
|
||||||
|
|
||||||
|
oidc = OpenIDConnect()
|
@@ -196,10 +196,21 @@ class Mango:
|
|||||||
ids = [result['id'] for result in results]
|
ids = [result['id'] for result in results]
|
||||||
|
|
||||||
return tuple(ids)
|
return tuple(ids)
|
||||||
|
|
||||||
|
def get_recent_videos(self, count=99):
|
||||||
|
""" Returns a SET of YouTube video ID's which have been added last to the info_json collection """
|
||||||
|
result = self.info_json.find({}, {'_id': 0, 'id': 1}, sort=[('_id', pymongo.DESCENDING)]).limit(count)
|
||||||
|
|
||||||
|
ids = []
|
||||||
|
|
||||||
|
for id in result:
|
||||||
|
ids.append(id['id'])
|
||||||
|
|
||||||
|
return tuple(ids)
|
||||||
|
|
||||||
def get_video_info(self, videoId, limited=False):
|
def get_video_info(self, videoId, limited=False):
|
||||||
if limited:
|
if limited:
|
||||||
projection = {'_id': 1, 'id': 1, 'title': 1, 'upload_date': 1, 'description': 1, 'channel_id': 1, 'epoch': 1, 'title_slug': 1}
|
projection = {'_id': 1, 'id': 1, 'title': 1, 'upload_date': 1, 'description': 1, 'channel_id': 1, 'uploader': 1, 'epoch': 1, 'title_slug': 1}
|
||||||
else:
|
else:
|
||||||
projection = {}
|
projection = {}
|
||||||
|
|
||||||
|
@@ -19,7 +19,17 @@
|
|||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
<div class="row">
|
<div class="row">
|
||||||
<div class="col s12 m-4 filterable">
|
<div class="col s6 m-4 filterable">
|
||||||
|
<a href="{{ url_for('channel.recent') }}">
|
||||||
|
<div class="card black-text">
|
||||||
|
<div class="card-content center">
|
||||||
|
<span class="card-title">Recent videos</span>
|
||||||
|
<p class="grey-text">The last videos to have been added to the archive</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
<div class="col s6 m-4 filterable">
|
||||||
<a href="{{ url_for('channel.orphaned') }}">
|
<a href="{{ url_for('channel.orphaned') }}">
|
||||||
<div class="card black-text">
|
<div class="card black-text">
|
||||||
<div class="card-content center">
|
<div class="card-content center">
|
||||||
@@ -31,12 +41,12 @@
|
|||||||
</div>
|
</div>
|
||||||
{% for channel in channels %}
|
{% for channel in channels %}
|
||||||
<div class="col s6 l4 m-4 filterable">
|
<div class="col s6 l4 m-4 filterable">
|
||||||
<a href="{{ url_for('channel.channel', channelId=channel) }}">
|
<a href="{{ url_for('channel.channel', channelId=channel.get('id')) }}">
|
||||||
<div class="card black-text">
|
<div class="card black-text">
|
||||||
<div class="card-content">
|
<div class="card-content">
|
||||||
<span class="card-title">{{ channels[channel].get('original_name') }}</span>
|
<span class="card-title">{{ channel.get('original_name') }}</span>
|
||||||
<p class="grey-text">{{ channels[channel].get('id') }}</p>
|
<p class="grey-text">{{ channel.get('id') }}</p>
|
||||||
<p><b>Added:</b> {{ channels[channel].get('added_date')|pretty_time }} | <b>Active:</b> {{ channels[channel].get('active') }} | <b>Videos:</b> {{ channels[channel].get('video_count') }}</p>
|
<p><b>Added:</b> {{ channel.get('added_date')|pretty_time }} | <b>Active:</b> {{ channel.get('active') }} | <b>Videos:</b> {{ channel.get('video_count') }}</p>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
</a>
|
</a>
|
||||||
|
@@ -5,7 +5,7 @@
|
|||||||
{% block content %}
|
{% block content %}
|
||||||
<div class="row">
|
<div class="row">
|
||||||
<div class="col s12">
|
<div class="col s12">
|
||||||
<h4>Channels lising page</h4>
|
<h4>Videos lising page</h4>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
<div class="divider"></div>
|
<div class="divider"></div>
|
||||||
@@ -25,11 +25,12 @@
|
|||||||
<div class="card medium black-text">
|
<div class="card medium black-text">
|
||||||
<a href="{{ url_for('watch.base') }}?v={{ video.get('id') }}">
|
<a href="{{ url_for('watch.base') }}?v={{ video.get('id') }}">
|
||||||
<div class="card-image">
|
<div class="card-image">
|
||||||
<img loading="lazy" src="https://archive.ventilaar.net/videos/automatic/{{ video.get('channel_id') }}/{{ video.get('id') }}/{{ video.get('title') }}.jpg">
|
<img loading="lazy" src="https://archive.ventilaar.net/videos/automatic/{{ video.get('channel_id') }}/{{ video.get('id') }}/{{ video.get('title_slug') }}.jpg">
|
||||||
</div>
|
</div>
|
||||||
</a>
|
</a>
|
||||||
<div class="card-content activator">
|
<div class="card-content activator">
|
||||||
<span class="card-title">{{ video.get('title') }}</span>
|
<span class="card-title">{{ video.get('title') }}</span>
|
||||||
|
<p><b>{{ video.get('uploader') }}</b></p>
|
||||||
<p class="grey-text">{{ video.get('id') }} | {{ video.get('upload_date')|pretty_time }}</p>
|
<p class="grey-text">{{ video.get('id') }} | {{ video.get('upload_date')|pretty_time }}</p>
|
||||||
</div>
|
</div>
|
||||||
<div class="card-reveal">
|
<div class="card-reveal">
|
||||||
|
44
ayta/templates/channel/recent.html
Normal file
44
ayta/templates/channel/recent.html
Normal file
@@ -0,0 +1,44 @@
|
|||||||
|
{% extends 'material_base.html' %}
|
||||||
|
{% block title %}Recent videos{% endblock %}
|
||||||
|
{% block description %}The last videos to have been added to the archive{% endblock %}
|
||||||
|
|
||||||
|
{% block content %}
|
||||||
|
<div class="row">
|
||||||
|
<div class="col s12">
|
||||||
|
<h4>Videos lising page</h4>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div class="divider"></div>
|
||||||
|
<div class="row">
|
||||||
|
<div class="col s6 l9">
|
||||||
|
<h5>Recent videos</h5>
|
||||||
|
<p>The last 99 videos to have been added to the archive.</p>
|
||||||
|
</div>
|
||||||
|
<div class="col s6 l3 m-4 input-field">
|
||||||
|
<input id="filter_query" type="text">
|
||||||
|
<label for="filter_query">Filter results</label>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<div class="row">
|
||||||
|
{% for video in videos %}
|
||||||
|
<div class="col s6 l4 m-4 filterable">
|
||||||
|
<div class="card medium black-text">
|
||||||
|
<a href="{{ url_for('watch.base') }}?v={{ video.get('id') }}">
|
||||||
|
<div class="card-image">
|
||||||
|
<img loading="lazy" src="https://archive.ventilaar.net/videos/automatic/{{ video.get('channel_id') }}/{{ video.get('id') }}/{{ video.get('title_slug') }}.jpg">
|
||||||
|
</div>
|
||||||
|
</a>
|
||||||
|
<div class="card-content activator">
|
||||||
|
<span class="card-title">{{ video.get('title') }}</span>
|
||||||
|
<p><b>{{ video.get('uploader') }}</b></p>
|
||||||
|
<p class="grey-text">{{ video.get('id') }} | {{ video.get('upload_date')|pretty_time }}</p>
|
||||||
|
</div>
|
||||||
|
<div class="card-reveal">
|
||||||
|
<span class="card-title truncate">{{ video.get('title') }}</span>
|
||||||
|
<p style="white-space: pre-wrap;">{{ video.get('description') }}</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
{% endfor %}
|
||||||
|
</div>
|
||||||
|
{% endblock %}
|
Reference in New Issue
Block a user