EPUB Support (#178)

* Added book filetype detection and reorganized tests due to size of file * Added ability to get basic Parse Info from Book and Pages. * We can now scan books and get them in a library with cover images. * Take the first image in the epub if the cover isn't set. * Implemented the ability to unzip the ebup to cache. Implemented a test api to load html files. * Just some test code to figure out how to approach this. * Fixed some merge conflicts * Removed some dead code from merge * Snapshot: I can now load everything properly into the UI by rewriting the urls before I send them back. I don't notice any lag from this method. It can be optimized further. * Implemented a way to load the content in the browser not via an iframe. * Added a note * Anchor mappings is complete. New anchors are updated so references now resolve to javascript:void() for UI to take care of internally loading and the appropriate page is mapped to it. Anchors that are external have target="_blank" added so they don't force you out of the app and styles are of course inlined. * Oops i need this * Table of contents api implemented (rough) and some small enhancements to codebase for books. * GetBookPageResources now only loads files from within the book. Nested chapter list support and images now use html parsing instead of string parsing. * Fonts now are remapped to load from endpoint. * book-resources now uses a key, ensuring the file is in proper format for lookup. Changed chapter list based on structure with one HEADER and nested chapters. * Properly handle svg resource requests and when there are part anchors that are clickable, make sure we handle them in the UI by adding a kavita-page handler. * Add Chapter group page even if one isn't set by using first page (without part) from nestedChildren. * Added extra debug code for issue #163. * Added new user preferences for books and updated the css so we scope it to our reading section. * Cleaned up style code * Implemented ability to save book preferences and some cleanup on existing apis. * Added an api for checking if a user has read something in a library type before. * Forgot to make sure the has reading progress is against a user lol. * Remove cacheservice code for books, sine we use an in-memory method * Handle svg images as well * Enhanced cover image extraction to check for a "cover" image if the cover image wasn't set in OPF before falling back to the first image. * Fixed an issue with special books not properly generating metadata due to not having filename set. * Cleanup, removed warmup task code from statup/program and changed taskscheduler to schedule tasks on startup only (or if tasks are changed from UI). * Code cleanup * Code cleanup * So much code. Lots of refactors to try to test scanner service. Moved a lot of the queries into Extensions to allow to easier test, even though it's hacky. Support @font-face src:url swaps with ' and ". Source summary information from epubs. * Well...baseURL needs to come from BE and not from UI lol. * Adjusted migrations so default values match Entity * Removed comment * I think I finally fixed #163! The issue was that when i checked if it had a parserInfo, i wasn't considering that the chapter range might have a - in it (0-6) and so when the code to check if range could parse out a number failed, it treated it like a special and checked range against info's filename. * Some bugfixes * Lots of testing, extracting code to make it easier to test. This code is buggy, but fixed a bug where 1) If we changed the normalization code, we would remove the whole db during a scan and 2) We weren't actually removing series properly. Other than that, code is being extracted to remove duplication and centralize logic. * More code cleanup and test cleanup to ensure scan loop is working as expected and matches expectaions from tests. * Cleaned up the code and made it so if I change normalization, which I do in this branch, it wont break existing DBs. * Some comic parser changes for partial chapter support. * Added some code for directory service and scanner service along with python code to generate test files (not used yet). Fixed up all the tests. * Code smells
2021-04-28 16:16:22 -05:00 · 2021-04-28 16:16:22 -05:00 · a01613f80f
commit a01613f80f
parent 2b99c8abfa
103 changed files with 5017 additions and 2480 deletions
--- a/API/Parser/Parser.cs
+++ b/API/Parser/Parser.cs
@ -9,14 +9,19 @@ namespace API.Parser
 {
    public static class Parser
    {
-        public static readonly string ArchiveFileExtensions = @"\.cbz|\.zip|\.rar|\.cbr|.tar.gz|.7zip";
+        public static readonly string ArchiveFileExtensions = @"\.cbz|\.zip|\.rar|\.cbr|\.tar.gz|\.7zip";
+        public static readonly string BookFileExtensions = @"\.epub";
        public static readonly string ImageFileExtensions = @"^(\.png|\.jpeg|\.jpg)";
+        public static readonly Regex FontSrcUrlRegex = new Regex("(src:url\\(\"?'?)([a-z0-9/\\._]+)(\"?'?\\))", RegexOptions.IgnoreCase | RegexOptions.Compiled);
+
        private static readonly string XmlRegexExtensions = @"\.xml";
        private static readonly Regex ImageRegex = new Regex(ImageFileExtensions, RegexOptions.IgnoreCase | RegexOptions.Compiled);
        private static readonly Regex ArchiveFileRegex = new Regex(ArchiveFileExtensions, RegexOptions.IgnoreCase | RegexOptions.Compiled);
        private static readonly Regex XmlRegex = new Regex(XmlRegexExtensions, RegexOptions.IgnoreCase | RegexOptions.Compiled);
+        private static readonly Regex BookFileRegex = new Regex(BookFileExtensions, RegexOptions.IgnoreCase | RegexOptions.Compiled);
        private static readonly Regex CoverImageRegex = new Regex(@"(?<![[a-z]\d])(?:!?)(cover|folder)(?![\w\d])", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        
+
        private static readonly Regex[] MangaVolumeRegex = new[]
        {
            // Dance in the Vampire Bund v16-17
@ -56,6 +61,10 @@ namespace API.Parser

        private static readonly Regex[] MangaSeriesRegex = new[]
        {
+            // [SugoiSugoi]_NEEDLESS_Vol.2_-_Disk_The_Informant_5_[ENG].rar
+            new Regex(
+                @"^(?<Series>.*)( |_)Vol\.?\d+",
+                RegexOptions.IgnoreCase | RegexOptions.Compiled),
            // Ichiban_Ushiro_no_Daimaou_v04_ch34_[VISCANS].zip
            new Regex(
            @"(?<Series>.*)(\b|_)v(?<Volume>\d+-?\d*)( |_)",
@ -126,10 +135,7 @@ namespace API.Parser
            new Regex(
                @"^(?!Vol)(?<Series>.*)( |_)Chapter( |_)(\d+)",
                RegexOptions.IgnoreCase | RegexOptions.Compiled),
-            // [SugoiSugoi]_NEEDLESS_Vol.2_-_Disk_The_Informant_5_[ENG].rar
-            new Regex(
-                @"^(?<Series>.*)( |_)Vol\.?\d+",
-                RegexOptions.IgnoreCase | RegexOptions.Compiled),
+           
            // Fullmetal Alchemist chapters 101-108.cbz
            new Regex(
                @"^(?!vol)(?<Series>.*)( |_)(chapters( |_)?)\d+-?\d*",
@ -238,21 +244,21 @@ namespace API.Parser
        
        private static readonly Regex[] ComicChapterRegex = new[]
        {
-            // 04 - Asterix the Gladiator (1964) (Digital-Empire) (WebP by Doc MaKS)
-            new Regex(
-                @"^(?<Volume>\d+) (- |_)?(?<Series>.*(\d{4})?)( |_)(\(|\d+)",
-                RegexOptions.IgnoreCase | RegexOptions.Compiled),
-            // 01 Spider-Man & Wolverine 01.cbr
-            new Regex(
-                @"^(?<Volume>\d+) (?:- )?(?<Series>.*) (\d+)?",
-                RegexOptions.IgnoreCase | RegexOptions.Compiled),
+            // // 04 - Asterix the Gladiator (1964) (Digital-Empire) (WebP by Doc MaKS)
+            // new Regex(
+            //     @"^(?<Volume>\d+) (- |_)?(?<Series>.*(\d{4})?)( |_)(\(|\d+)",
+            //     RegexOptions.IgnoreCase | RegexOptions.Compiled),
+            // // 01 Spider-Man & Wolverine 01.cbr
+            // new Regex(
+            //     @"^(?<Volume>\d+) (?:- )?(?<Series>.*) (\d+)?", // NOTE: WHy is this here without a capture group
+            //     RegexOptions.IgnoreCase | RegexOptions.Compiled),
            // Batman & Wildcat (1 of 3)
            new Regex(
                @"(?<Series>.*(\d{4})?)( |_)(?:\((?<Chapter>\d+) of \d+)",
                RegexOptions.IgnoreCase | RegexOptions.Compiled),
            // Teen Titans v1 001 (1966-02) (digital) (OkC.O.M.P.U.T.O.-Novus)
            new Regex(
-                @"^(?<Series>.*)(?: |_)v(?<Volume>\d+)(?: |_)(c? ?)(?<Chapter>\d+)",
+                @"^(?<Series>.*)(?: |_)v(?<Volume>\d+)(?: |_)(c? ?)(?<Chapter>(\d+(\.\d)?)-?(\d+(\.\d)?)?)(c? ?)",
                RegexOptions.IgnoreCase | RegexOptions.Compiled),
            // Batman & Catwoman - Trail of the Gun 01, Batman & Grendel (1996) 01 - Devil's Bones, Teen Titans v1 001 (1966-02) (digital) (OkC.O.M.P.U.T.O.-Novus)
            new Regex(
@ -262,6 +268,10 @@ namespace API.Parser
            new Regex(
                @"^(?<Series>.*)(?: |_)#(?<Volume>\d+)",
                RegexOptions.IgnoreCase | RegexOptions.Compiled),
+            // Invincible 070.5 - Invincible Returns 1 (2010) (digital) (Minutemen-InnerDemons).cbr
+            new Regex(
+                @"^(?<Series>.*)(?: |_)(c? ?)(?<Chapter>(\d+(\.\d)?)-?(\d+(\.\d)?)?)(c? ?)-",
+                RegexOptions.IgnoreCase | RegexOptions.Compiled),
        };

        private static readonly Regex[] ReleaseGroupRegex = new[]
@ -350,7 +360,7 @@ namespace API.Parser
        {
            // All Keywords, does not account for checking if contains volume/chapter identification. Parser.Parse() will handle.
            new Regex(
-                @"(?<Special>Specials?|OneShot|One\-Shot|Omake|Extra( Chapter)?|Art Collection)",
+                @"(?<Special>Specials?|OneShot|One\-Shot|Omake|Extra( Chapter)?|Art Collection|Side( |_)Stories)",
                RegexOptions.IgnoreCase | RegexOptions.Compiled),
        };

@ -366,17 +376,34 @@ namespace API.Parser
        public static ParserInfo Parse(string filePath, string rootPath, LibraryType type = LibraryType.Manga)
        {
            var fileName = Path.GetFileName(filePath);
+            ParserInfo ret;

-            var ret = new ParserInfo()
+            if (type == LibraryType.Book)
            {
-                Chapters = type == LibraryType.Manga ? ParseChapter(fileName) : ParseComicChapter(fileName),
-                Series = type == LibraryType.Manga ? ParseSeries(fileName) : ParseComicSeries(fileName),
-                Volumes = type == LibraryType.Manga ? ParseVolume(fileName) : ParseComicVolume(fileName),
-                Filename = fileName,
-                Format = ParseFormat(filePath),
-                FullFilePath = filePath
-            };
-            
+                ret = new ParserInfo()
+                {
+                    Chapters = ParseChapter(fileName) ?? ParseComicChapter(fileName),
+                    Series = ParseSeries(fileName) ?? ParseComicSeries(fileName),
+                    Volumes = ParseVolume(fileName) ?? ParseComicVolume(fileName),
+                    Filename = fileName,
+                    Format = ParseFormat(filePath),
+                    FullFilePath = filePath
+                };
+            }
+            else
+            {
+                ret = new ParserInfo()
+                {
+                    Chapters = type == LibraryType.Manga ? ParseChapter(fileName) : ParseComicChapter(fileName),
+                    Series = type == LibraryType.Manga ? ParseSeries(fileName) : ParseComicSeries(fileName),
+                    Volumes = type == LibraryType.Manga ? ParseVolume(fileName) : ParseComicVolume(fileName),
+                    Filename = fileName,
+                    Format = ParseFormat(filePath),
+                    Title = Path.GetFileNameWithoutExtension(fileName),
+                    FullFilePath = filePath
+                };
+            }
+
            if (ret.Series == string.Empty)
            {
                // Try to parse information out of each folder all the way to rootPath
@ -412,6 +439,8 @@ namespace API.Parser
            }

            var isSpecial = ParseMangaSpecial(fileName);
+            // We must ensure that we can only parse a special out. As some files will have v20 c171-180+Omake and that 
+            // could cause a problem as Omake is a special term, but there is valid volume/chapter information.
            if (ret.Chapters == "0" && ret.Volumes == "0" && !string.IsNullOrEmpty(isSpecial))
            {
                ret.IsSpecial = true;
@ -426,6 +455,7 @@ namespace API.Parser
        {
            if (IsArchive(filePath)) return MangaFormat.Archive;
            if (IsImage(filePath)) return MangaFormat.Image;
+            if (IsBook(filePath)) return MangaFormat.Book;
            return MangaFormat.Unknown;
        }

@ -520,7 +550,7 @@ namespace API.Parser
            
            return "0";
        }
-        
+
        public static string ParseComicVolume(string filename)
        {
            foreach (var regex in ComicVolumeRegex)
@ -735,6 +765,10 @@ namespace API.Parser
        {
            return ArchiveFileRegex.IsMatch(Path.GetExtension(filePath));
        }
+        public static bool IsBook(string filePath)
+        {
+            return BookFileRegex.IsMatch(Path.GetExtension(filePath));
+        }

        public static bool IsImage(string filePath, bool suppressExtraChecks = false)
        {
@ -749,13 +783,13 @@ namespace API.Parser
        
        public static float MinimumNumberFromRange(string range)
        {
-            var tokens = range.Split("-");
+            var tokens = range.Replace("_", string.Empty).Split("-");
            return tokens.Min(float.Parse);
        }

        public static string Normalize(string name)
        {
-            return name.ToLower().Replace("-", "").Replace(" ", "").Replace(":", "").Replace("_", "");
+            return Regex.Replace(name.ToLower(), "[^a-zA-Z0-9]", string.Empty);
        }

        /// <summary>
@ -773,6 +807,10 @@ namespace API.Parser
            return path.Contains("__MACOSX");
        }

-        
+
+        public static bool IsEpub(string filePath)
+        {
+            return Path.GetExtension(filePath).ToLower() == ".epub";
+        }
    }
 }
--- a/API/Parser/ParserInfo.cs
+++ b/API/Parser/ParserInfo.cs
@ -7,16 +7,36 @@ namespace API.Parser
    /// </summary>
    public class ParserInfo
    {
-        // This can be multiple
+        /// <summary>
+        /// Represents the parsed chapters from a file. By default, will be 0 which means nothing could be parsed.
+        /// <remarks>The chapters can only be a single float or a range of float ie) 1-2. Mainly floats should be multiples of 0.5 representing specials</remarks>
+        /// </summary>
        public string Chapters { get; set; } = "";
+        /// <summary>
+        /// Represents the parsed series from the file or folder
+        /// </summary>
        public string Series { get; set; } = "";
-        // This can be multiple
+        /// <summary>
+        /// Represents the parsed volumes from a file. By default, will be 0 which means that nothing could be parsed.
+        /// If Volumes is 0 and Chapters is 0, the file is a special. If Chapters is non-zero, then no volume could be parsed.
+        /// <example>Beastars Vol 3-4 will map to "3-4"</example>
+        /// <remarks>The volumes can only be a single int or a range of ints ie) 1-2. Float based volumes are not supported.</remarks>
+        /// </summary>
        public string Volumes { get; set; } = "";
+        /// <summary>
+        /// Filename of the underlying file
+        /// <example>Beastars v01 (digital).cbz</example>
+        /// </summary>
        public string Filename { get; init; } = "";
+        /// <summary>
+        /// Full filepath of the underlying file
+        /// <example>C:/Manga/Beastars v01 (digital).cbz</example>
+        /// </summary>
        public string FullFilePath { get; set; } = "";

        /// <summary>
-        /// <see cref="MangaFormat"/> that represents the type of the file (so caching service knows how to cache for reading)
+        /// <see cref="MangaFormat"/> that represents the type of the file
+        /// <remarks>Mainly used to show in the UI and so caching service knows how to cache for reading.</remarks>
        /// </summary>
        public MangaFormat Format { get; set; } = MangaFormat.Unknown;

@ -26,8 +46,38 @@ namespace API.Parser
        public string Edition { get; set; } = "";

        /// <summary>
-        /// If the file contains no volume/chapter information and contains Special Keywords <see cref="Parser.MangaSpecialRegex"/>
+        /// If the file contains no volume/chapter information or contains Special Keywords <see cref="Parser.MangaSpecialRegex"/>
        /// </summary>
        public bool IsSpecial { get; set; } = false;
+
+        /// <summary>
+        /// Used for specials or books, stores what the UI should show.
+        /// <remarks>Manga does not use this field</remarks>
+        /// </summary>
+        public string Title { get; set; } = string.Empty;
+        
+        /// <summary>
+        /// If the ParserInfo has the IsSpecial tag or both volumes and chapters are default aka 0
+        /// </summary>
+        /// <returns></returns>
+        public bool IsSpecialInfo()
+        { 
+            return (IsSpecial || (Volumes == "0" && Chapters == "0"));
+        }
+
+        /// <summary>
+        /// Merges non empty/null properties from info2 into this entity.
+        /// </summary>
+        /// <param name="info2"></param>
+        public void Merge(ParserInfo info2)
+        {
+            if (info2 == null) return;
+            Chapters = string.IsNullOrEmpty(Chapters) || Chapters == "0" ? info2.Chapters: Chapters;
+            Volumes = string.IsNullOrEmpty(Volumes) || Volumes == "0" ? info2.Volumes : Volumes;
+            Edition = string.IsNullOrEmpty(Edition) ? info2.Edition : Edition;
+            Title = string.IsNullOrEmpty(Title) ? info2.Title : Title;
+            Series = string.IsNullOrEmpty(Series) ? info2.Series : Series;
+            IsSpecial = IsSpecial || info2.IsSpecial;
+        }
    }
 }