The ScratchDB project began a few years ago as a way to watch the Scratch forums to see what the Scratch Team deleted, in response to one of my friends posts being consistently deleted for random reasons, trying to find trends. In the end, I never really analyzed that data or looked much further into it as the project rapidly grew beyond that initial purpose.
Fast forward a few years, I am currently a student at Georgia Tech in Atlanta studying Computer Science. Even with only 1.5 semesters of college-level programming knowledge (besides the ??? years of self-taught information) I have seen many aspects of ScratchDB that could be revised or improved on. I currently have no plans on fixing what was in v2 or v3, as at this point I bet I could barely recognize much of the codebase, but I do have a plan in mind, and hopefully in the next few weeks things will start falling together.
A New Plan
ScratchDBv3’s database is the same database as ScratchDBv2, and they both use the same indexing system (actually, the same one, only one instance feeding the same database they pull from), and its held up quite well, but it is starting to show its age in its parsing and timing setup.
Starting out, the main part that I need to replace is the ranking system. Currently v2 stores all of this information on ram when it starts up (takes a good 3-5 minutes to read and parse), while v3 keeps a rolling update in Redis. In v4, I plan on moving to a central place to keep all live ranking data, rather than reading from a database. To begin this new project, I will create an open-source database in Rust (🚀) that handles nothing but keeping a rapid up-to-date database of ranks as the indexing system pulls in updates. This would go for the following categories:
- total project loves
- total project favorites
- (the same 4, but per region)
I may try to do per-project rankings, but that is down to how well I can manage memory in this new revision (and once it is open-sourced, help would be greatly appreciated!)
Currently for timing, everything is up to selecting a time in the database, hoping it doesn’t fall behind from what it set previously. (I’m not explaining this very well, basically, once it gets info on a user, its puts the updated user info into the database and also calculates a time the user should be checked next, then hopefully at that next time the indexer isn’t overwhelmed and can index them again) This system has worked, but it has been left up to some very simple math, rather than what really matters: user interaction. I’ve gotten a lot of complaints on how slow people are updated on the site, and I hope I can fix this issue in the next revision. Perhaps indexing and serving live rather than serving possibly months-old data. So, for this next system, I hope to either re-write it in TypeScript (which I now have some knowledge of), or Go (purely to get some learning in it).
Lastly comes the front-end API. Currently the v3 API heavily relies on caching to not have to touch the database too much. In v4, this will be changed to talk mostly to the Rust (🚀)-based database discussed earlier. Otherwise, much of the API should be similar to v3’s, since on its own its had most every endpoint people have needed.
Another addition will be live data. Currently v2 and v3 serve old data, with v4 I hope to provide some “lightning” endpoints that’ll give up-to-date data of a user or a category. This may also include some form of websocket for forum posts, since some people have requested getting more rapid updates there.
When it comes to the actual implementation, I will keep it quite similar to v2 and v3, being an Express based API, this time being TypeScript instead of simple Node, to keep things a bit more clean. I plan on possibly open-sourcing the new API, but this is by no means a promise.
So I’ve left this project go dormant for quite some time, and people are starting to wonder when it’ll simply go offline or disappear. I’ve simply moved on to many other things, and also quite a bit of not doing much new development since then. I’m now taking some classes relating to data structures and algorithms, and these have shown how much better ScratchDB could be.
The last thing that caused this update is, surprisingly, my English class. The course operates where by the end of the semester I need to research and provide a result, and in this case, my research is designing a new database and more proper methods to building this system, and the result is most of the code, and of course the new API. I don’t expect this project to be finished by the end of the semester, but perhaps some pieces will start to fall into place.
ScratchDB has been the peak of what I’ve made so far, processing terabytes of data and serving millions of requests (see photo below), and I really don’t want to let the project go gently into that good night. I would like to provide something that I feel I could show people, and explain how it works in the backend. By making ScratchDBv4 I can finally do this.
This project feels a tad ambitious, being an effective full rewrite and possibly a full re-index, but at the end of the day, its experience that I can use in the future. I’ve been asked, “why do you run this when you aren’t paid anything?” and the answer is that it’s simply fun to run and provides a great thing to have on a resume, as it shows I can work with large projects with large real-world datasets.
Hopefully along the way I can provide some updates, although based on my track record of a lot of radio-silence (side note: I am a DJ on WREK 91.1 FM, although I currently do things at midnight [Thursday morning, Wednesday night] so not too much going on there), I may not. I guess we shall see.
- ScratchDBv4 should hopefully be coming soon
- I am now no longer in high school and somehow made it into a university
- pieces of it will be open sourced and may require some help
- this project isn’t dead even though no updates have come out in over 2 years
- None of this may happen, but it should hopefully be happening
Whelp, here goes nothing!
(at this point, the writing is over, below this I’m just having some transparency)
Also, for those who have asked here are specific numbers in relation to storage, processing, and requests.
Numbers (As of posting):
- 12,756,854 users
- 6,138,400 forum posts
- 572,355 forum topics
- 1,905,424 forum edits
- 94,633,751 projects
- 61,693,216 pieces of user history
- 2.84TB of project JSON
- 292.6 GiB of MySQL databases
Counts given my PHPMyAdmin guesses and ZFS data