извлекаем данные из файла

FB-Extractor – extract data from a .fb2 file

Hello everybody! Not so long ago there was a discussion on Facebook – how to clean up the library on a local device? Considering that I mostly read .fb2 files, today’s question will sound like a solution to the problem: we extract data from the .fb2 file. And at the same time we put things in order on the screw 🙂

The crux of the problem: when we download a book – most often it has a completely abnormal title (for example: Piz_Novyy-yazyk-telodvizheniy.RoTD5Q.393030.fb2 – yes, in general, it’s understandable, but still …). Considering that it is difficult to guess what this is about by this quack, the question arose, is it possible to somehow determine the author, name, and rename the file accordingly? Quite possible!

So what is a .fb2 file? In fact, it is an .xml document. And since this is an .xml document, then it can and should be parsed 🙂

The problem I ran into at the very beginning was to determine the file encoding. In theory, it was possible to randomly open a file in any encoding, look for the encoding header inside the file itself, and proceed from this to work. But the code was so large that I had to use the library chardet, which works with the following encodings:

  • ASCII, UTF-8, UTF-16 (2 variants), UTF-32 (4 variants)
  • Big5, GB2312, EUC-TW, HZ-GB-2312, ISO-2022-CN (Traditional and Simplified Chinese)
  • EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP (Japanese)
  • EUC-KR, ISO-2022-KR (Korean)
  • KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, windows-1251 (Cyrillic)
  • ISO-8859-5, windows-1251 (Bulgarian)
  • ISO-8859-1, windows-1252 (Western European languages)
  • ISO-8859-7, windows-1253 (Greek)
  • ISO-8859-8, windows-1255 (Visual and Logical Hebrew)
  • TIS-620 (Thai)
  • As a result, the code takes the .fb2 file, parses it, finds the author of the book, the title of the book and saves this file in the video “Author-Title”. You can, of course, add a genre, and so on. I checked the code for operability on a couple of thousand books available. It seems like it works.

    The complete code itself looks like this:

    1. from chardet.universaldetector import UniversalDetector #Определим кодировку файла
    2. import re #Подключаем библиотеку для парсинга
    3. import glob #Подключаем библиотеку для работы с файлами определенного расширения
    4. import os #Подключаем библиотеку для работы с файловой системой
    5. import random
    6.  
    7. path = str(input('Где находятся исходные файлы? : \n'))
    8. regxp = '[\w-]+[\w:]'
    9. result = re.findall(regxp, path) #Разбиваем введенный адрес на составляющие без обратного слеша
    10. path = '\\\\'.join(result) #Добавляем двойной слеш после каждой папки (под формат Python)
    11.  
    12. print('Берем данные из:', path)
    13. os.chdir(path)
    14. print('Предстоит обработать книг:', len(glob.glob('*.fb2')))
    15. for file in glob.glob('*.fb2'):
    16.     print('\n\nОбрабатываем файл:',file)
    17.     detector = UniversalDetector()
    18.     with open(file, 'rb') as file_encoding:
    19.         for line in file_encoding:
    20.             detector.feed(line)
    21.             if detector.done:
    22.                 break
    23.         detector.close()
    24.     file_encoding = str(detector.result['encoding'])
    25.     print(file_encoding)
    26.     if file_encoding == False:
    27.         continue
    28.     else:
    29.         with open(file, 'r', encoding=file_encoding) as file_name:
    30.             stroke = file_name.read()
    31.             stroke = str(stroke)
    32.             reg_genre = '<genre>(.+)</genre>'
    33.             reg_title = '<book-title>(.+)</book-title>'
    34.             reg_last_name = '<last-name>(.+)</last-name>'
    35.             reg_first_name = 'first-name>(.+)</first-name>'
    36.             reg_trial = '<title>Конец ознакомительного фрагмента.</p></title>'
    37.             title = re.findall(reg_title, stroke)
    38.             title = str(title)
    39.             title = title.replace('\'', '').replace('\"', '').replace(':', '').replace('#', '').replace('(', ''). \
    40.                     replace(')', '').replace('[', '').replace(']', '').replace('!', '').replace('?', '').replace('\\', '').\
    41.                     replace('/', '').replace('.', '').replace('«', '').replace('»', '').replace(',', '').replace('-', '')
    42.  
    43.             last_name = re.findall(reg_last_name, stroke)
    44.             print('Фамилия после поиска первой регуляркой: ', last_name )
    45.             print('Тип фамилии:', type(last_name))
    46.             if len(last_name) > 0:
    47.                 last_name = last_name[0]
    48.                 last_name = ''.join(last_name)
    49.                 last_name = str(last_name)
    50.             else:
    51.                 last_name = str('БезФамилии')
    52.             print('Фамилия после уборки лишних знаков: ', last_name)
    53.             print('Тип фамилии:', type(last_name))
    54.             reg_last_names = '^\w*'
    55.             last_name = re.findall(reg_last_names, last_name)
    56.             print('Фамилия после поиска второй регуляркой: ', last_name)
    57.             print('Тип фамилии:', type(last_name))
    58.             last_name = str(''.join(last_name))
    59.  
    60.             first_name = re.findall (reg_first_name, stroke)
    61.             print('Результат имени после работы первой регулярки: ', first_name)
    62.             print('Тип имени: ', type(first_name))
    63.             if len(first_name) > 0:
    64.                 first_name = first_name[0]
    65.                 first_name = ''.join(first_name)
    66.                 first_name = str(first_name)
    67.             else:
    68.                 first_name = str('БезИмени')
    69.             print('Имя после уборки лишних знаков: ', first_name)
    70.             reg_first_names = '^\w*'
    71.             first_name = re.findall(reg_first_names, first_name)
    72.             print('Имя после обработки второй регуляркой: ', first_name)
    73.             print('Тип имени: ', type(first_name))
    74.             first_name = str(''.join(first_name))
    75.  
    76.             trial_status = re.findall(reg_trial, stroke)
    77.             print('Demo? ',trial_status)
    78.             trial_status = ''.join(trial_status)
    79.             print(trial_status)
    80.             print(type(trial_status))
    81.             print('Название:', ''.join(title))
    82.             print('Фамилия: ', last_name)
    83.             title = str(''.join(title))
    84.             print('Имя: ', first_name)
    85.  
    86.         if trial_status == '<title><p>Конец ознакомительного фрагмента.</p></title>':
    87.             print('Файл будет переименован: ', 'Demo' + ' ' + last_name  + ' ' + first_name + ' ' +  title + '.fb2')
    88.             new_file_name = str('Demo' + ' ' + last_name  + ' ' + first_name + ' ' +  title + '.fb2')
    89.             try:
    90.                 os.rename(str(file), new_file_name)
    91.             except:
    92.                 os.rename(str(file), 'copy ' + new_file_name)
    93.         else:
    94.             print('файл будет переименован: ', last_name + ' ' +  first_name +  ' ' + title + '.fb2')
    95.             new_file_name = str(last_name  + ' ' + first_name + ' ' +  title + '.fb2')
    96.             try:
    97.                 os.rename(str(file), new_file_name)
    98.             except:
    99.                 os.rename(str(file), 'copy ' + str(random.randrange(1, 10000)) + new_file_name)
    100.  
    101. print('Все сделано!')

    The code is also available on GitHub by link.

    And as always – in case of any questions and comments or even ideas – write to mail or Telegram/