This project will be awesome, only if I invest enough time.
Go to file
2024-04-06 23:27:18 +02:00
.gitea/workflows many things changed 2024-03-20 22:44:02 +01:00
ayta Add small time difference leeway 2024-04-06 23:23:32 +02:00
.dockerignore optimize docker 2024-02-28 23:43:38 +01:00
.gitignore added basic admin auth and simple rate limiting 2024-02-28 00:12:30 +01:00
ad-hoc-test.py base upload of offline files 2024-02-24 22:22:59 +01:00
Dockerfile A bit faster container building 2024-02-29 20:41:53 +01:00
LICENSE base upload of offline files 2024-02-24 22:22:59 +01:00
make_celery.py many things changed 2024-03-20 22:44:02 +01:00
README.md Update PyJWT requirement 2024-04-06 23:27:18 +02:00
requirements.txt Update PyJWT requirement 2024-04-06 23:27:18 +02:00
run.py Add base WebSub support (not finished). Add orphaned videos view. Implement video reporting and managing. Some small changes 2024-03-13 00:13:57 +01:00

amazing-ytdlp-archive

This project will be awesome, only if I invest enough time. This software will replace my current cronjob yt-dlp archive service.

Partially inspired by hobune. While that project is amazingby it's own, it's just not scaleable.

The idea

Having over 250k videos, scaling the current cronjob yt-dlp archive task is just really hard. Filetypes change, things get partially downloaded and such. Partially yt-dlp is to blame because it's a package that needs to change all the time. But with this some changes are not accounted for. yt-dlp will still do the downloads. But a flask frontend will be developed to make all downloaded videos easily indexable. For it to be quick (unlike hobune) a database has to be implemented. This could get solved by a static site generator type of software, but that is not my choice.

The whole software package will use postgresql as a data backend and celery as background tasks. Currently development however is using mongodb just because it's easy.

How it works currently(legacy)

In the legacy folder you will find files that are currently in my archive project. How it works is that I have a cronjob running every 24 hours what then runs yt-dlp with a config file. In that config file a channel list contains all the channels that yt-dlp needs to update. If a new video has been uploaded, yt-dlp will automatically download a 720p version of the video, all subtitles at that time (rip community captions, will not forget you) and a json file with all the rest of the metadata. Oh and also the thumbnail.

This works. But is very slow and uses lots of "API" calls to youtube, which will sometimes will get the IP blocked. This is why full channel upload pages are not downloaded anymore, I have limited to first 50 videos.

Goals

Some goals have been set up which will prioritise functionality for the software package. The starting status is that info.json files of videos are loaded into the mongodb database on which flask will generate a page for channels and videos to load. But this has major limitations which will not be described right now but will be reflected in the goals.

Stage 1

Tasks which have to be finished before the GUI frontend is usable as a manager and user in no perticular order.

  • Have videos and channels listed on a page
  • Have a secured admin page where the database can be managed
  • Have working video streaming
  • CI/CD pipeline for quicker deployment
  • Add caching to speed up pages
  • Add ratelimiting for expensive pages
  • Ability to show cronjob logs to easily troubleshoot

Stage 2

Extra functionality for further development of features.

  • Fix video titles on disk with slugs
  • Working search functionality
  • Video reporting functionality
  • Ability (for external applications) to queue up video ids for download
  • Add websub requesting and receiving ability. (not fully usable yet without celery tasks)
  • OIDC or Webauthn logins instead of static argon2 passwords

Stage 3

Mainly focused on retiring the cronjob based scripts and moving it to celery based tasks

  • manage videos by ID's instead of per channel basis
  • download videos from queue
  • Manage websub callbacks

Stage 4

Mongodb finally has it's limitations.

  • Migrate to postgresql

Stage ...

Since this is my flagship software which I have developed more features will be added. It may take some time since this is just a hobby for me. And I'm not a programmer by title.

Things learned

Video playlists

Using the following url for a youtube channel in yt-dlp references to the uploads playlist for that perticular channel. Not using the videos path in this format will upload some videos that are not uploaded by this channel.

https://www.youtube.com/channel/UCdBK94H6oZT2Q7l0-b0xmMg/videos

This is why my archive is tainted with channels referencing only one video.

However, simply changing the second character C to the character U, so that you get the following string. You can reference the uploaded by this channel playlist directly.

https://www.youtube.com/playlist?list=UUdBK94H6oZT2Q7l0-b0xmMg

Both links reference the same thing. Uploaded videos by this channel.

Channel names

Creators change their names pretty often. But channel id's are static. So if you create a folder structure by channel name, you will get multiple folders from the same channel. If you swap the channel name to channel id. The folders will never change.

Storage structure

The following folder structure is pretty nice for using static scripts. The one drawback is that you can't search for video id's or titles. Because the search takes too long. This is mainly why we need a new system using a database mainly for search.

The following structure is easily scaleable and usable in a object storage format.

./videos/{channel_id}/{video_id}/video-title-slug-format.info.json

API things learned

YouTube push notifications in API form exist

Using the pubsubhubbub service provided by Google we will implement downloading videos based on uploads. The API is based on WebSub which is greatly documented.

The hub will give xml+atom notifications when a video is uploaded by a channel and when a video is deleted. The goal is to download a video when a notification gets trough, and run a full channel sync when a video is deleted. This will be next to periodic full channel polling to download videos which the hub has not notified us about.

Etag is useful

When we will call the api for 50 items in a playlist we also get an etag back. This is a sort of hash of the returned data. And thus if we filter the returned data based on etag we can eliminate most playlist calls. What I mean by that is that if we call the latest uploaded videos, and the etag is the same as the previous call. We can just not call the rest of the playlist. Saving API calls.

But sometimes we could call it anyways in case private videos became public. Since etag is only the last 50 videos. Saving the last etag in the database, and sending it along with the api call will result in a 304(not modified) back. In case nothing has been changed for stale youtube channels.

The etag is present in every return from the call and looks like this:

{
  "kind": "youtube#playlistItemListResponse",
  "etag": "pGXxpygXglKdyZa1rK8GZq0elbo",
  "nextPageToken": "EAAaBlBUOkNBVQ",
  "items": [
    {
    ...